Tales of woe: Query performance
February 14th, 2007 by Peter Scott
The nuts and bolts of putting together a decent data warehouse are not that hard. You just need to get the I/O right ;-)
OK, in reality there is a little more to it: you need to make sure you store the correct granularity of data, you have all of the right constraints in place (plus all those other helpful bits of metadata that the query optimizer uses), the right backup and recovery regime, some useful pre-built summaries, a blisteringly fast ETL (preferably set-based) load process and enough grunt to delivery the data to the users. Not rocket science (but perhaps hard enough for people to want to employ me)
But the hard thing is to plan for those ad-hoc user queries created in a drag and drop query tool. Typically, the user picks columns from a palette of available data objects and drops them on to a canvas representing a report, graph, cross-tab or whatever. But sometimes I have the sneaking impression that the intellect of the mouse they are using to create these reports somehow transfers to the user’s brain. Just because it is easy to create a report on the screen does not mean it is simple to get the information back from the data warehouse.
Recently I have seen users looking for all of the customers that spent more than £1000 on a single item by looking at every item sold across 200 stores for a whole year and wondering why the query takes more than a few hours to run – looking at the SQL running against the database we are hitting the base fact table, finding about quarter a billion sales lines then using a having clause to isolate the results we need. Given a little thought about what the data means it is quite possible to do a few smart things – there is only one row per item per customer per store per day so we don’t really need to aggregate and use a having clause, and it would be much better to filter the items first to only include those that cost more than the threshold value – that way we just do a bitmap index join on a big data set, a lot of work, still not quick, but achievable.
But it is not always the user’s fault. I saw one slow query today where the user concatenated a store code and its name to create a meaningful long description and then used that in a query predicate. The query was of course slow running with a full-tablescan of the dimension table, change the predicate to a single column match and it runs in less than five seconds using a bitmap index. Our fault was not predicting that a user wants to use long descriptions and to create the required column and a suitable index in advance.


February 14th, 2007 at 11:43 pm
“and it would be much better to filter the items first to only include those that cost more than the threshold value”
and give up doing all the processing where it “should be done”: in the application server tier? Narh!
really hope Crystal Reports isn’t the ad-hoc reporting tool being used there…
February 15th, 2007 at 9:12 am
Noons:
Not Crystal Reports…
But, this report server pushes the having back to the database. If they are going to keep doing this I guess I should index the measures as well as the dimensions in fact table. But then if they insist on using SUM for GROUP BYs that are the same as the row unique key I might as well go home!
February 15th, 2007 at 9:12 am
The problem with reporting tools I think is they seem easy to use, only drag and drop. But after a few while doing easy reports, users need to construct a more complicated report, it seems easy, but some conditions aren’t so evident if you don’t know the data model and PL/SQL (or whatever) language, so soon they have to learn that kind of stuff, or, more properly, you have to teach them while you fix their reports.
I don’t know how is the situation in other countries, but my experience in Spain is that this reporting tools are sold to users as they don’t need to learn about tables, functions, and databases, but that’s not true, if you want to report more than just asking about sales per month, or opening and refreshing reports other people make, you need to know more about the database model and language.
I’m not used to write so long paragraphs in English, this late years I only read technical books or blogs, hope you can understand me…