Data Warehouse speed

January 6th, 2007 by Peter Scott

I see that David Aldridge is buying a new laptop, now that is a dilemma! Do you go for the biggest, fastest CPU, the most RAM, the best display or choose something more modest?. Of course it depends what you are going to do, I just need a machine that is plain vanilla enough (no fancy hardware with its incompatibility problems) to run virtual machines successfully for various databases and query tools, and to have the connectivity tools to allow wireless and Bluetooth access on the move and the ability to connect to a highspeed network to allow me to work with some “industrial strength” servers.

Although I can test concepts on a laptop PC, the one thing I doubt I could do is to look at some aspects of data warehouse performance, and that is down to insufficient IO capacity, or perhaps more correctly bandwidth. Most laptops have a single hard disk, most laptops only have one network port, USB 2 is not really fast enough to run multiple external drives. Although I can plug a lot of disk into a laptop, I doubt if I can get good data throughput.

More on speed

Testing, or benchmarking, data warehouses for query performance is somewhat difficult, after all what is a reasonable query workload in either the number of concurrent queries or size of the dataset being manipulated? And when you do get a measure of performance, is it scalable?

Obviously, we can’t extrapolate performance from

select sysdate from dual;

and use that to estimate the performance of a query to find the total number tubs of ice cream sold by all of the 7-11s in Seattle in August. But can we scale from the results for a single store or a single day? Even if the query optimiser decided to tackle a single day and a whole month in exactly the same way would be able to apply some simple (not necessarily linear) function to estimate the query time of 31 day’s worth of data against one?

One factor that can “throw a spanner in the works” is when the almost ubiquitous sort operations beloved of data warehouses moves from PGA memory sort to disk sort and then, in the case of those systems on RAID 5 (and there a lot of them), when the disc write buffer becomes saturated and the write speed drops to the physical disk write speed, each step seeing a dramatic drop in performance.

And don’t get me started on parallel queries that end up spending more time waiting than doing.

Comments

  1. Noons Says:

    One of the problems I have nowadays is with the methodology if one wants to plan for a given performance level with the VLDBs that make up most DSS systems.

    Unlike 10 years ago, the distinction between a pure DW and a DSS is starting to blur. There are a heap of systems in our clients that plain don’t fall into any of these two categories and yet have been labeled with one of them.

    The result is that it is incredibly difficult to make an extrapolation as to what constitutes a good level of performance, if it is sustainable, what happens when the SAN cache runs out, what sort of degradation one is going to hit, managing expectations, etc.

    There is just almost no data available to find this. And when there is a little bit of data, the methodology used to gather it is hazy or undocumented. This makes it nearly useless for anything else than a “wet thumb in the wind” approach to performance/speed planning.

    Many years ago we used to sit down with the application designers – they were called analysts back then – and extensive spreadsheets detailing the various logical access paths to specific data, per business function.

    This would let us figure out how many logical IOs it would take to satisfy any given application query. The totals would give us a good idea of the bandwidth we’d need.

    From there, it was relatively easy to extrapolate total hardware requirements and come up with a cost case for getting to the performance levels required.

    Then ad-hoc queries were invented…

  2. matjazc Says:

    Peter, you might want to consider SANdisk for notebook. It might mitigate some of your I/O problems a bit.
    Though I don’t believe it will make it possible to do any real WH testing – as you can see it doesn’t come in sufficient capacities. Plus it comes with unfriendly price tag.

Website Design & Build: tymedia.co.uk