Writing it large and reading it big

March 10th, 2007 by

Yesterday I ungraciously forgot to mention Nuno Souto’s blog piece from January where he was talking about just the same sort of issues with access to massive databases. Please go and read Noons’ work, it is well worth it.

I mentioned two challenges yesterday, putting the data into the database and finding it again. Systems that can do one of these steps very well often falter with the other process. If you decide to use a very sophisticated tokenized column based database to give you blistering data read access you may find the computing required to work out where to insert the data is too expensive (cost or time) to make the process viable, conversely direct writing of data to disk may give us the best insert rate but something less than optimal for reading. If you index the raw data how much of an overhead is there in maintaining the index in the batch and does that process interfere with concurrent user activity; after all trickle-feed is a now viable dataload strategy? If you don’t index then how do you find what you are looking for? Not that indexes are always a help; take the example of looking for exception values, it is just not common to index fact measures.

By necessity our data load process is inserting one (or hopefully, many) data records into our data warehouse; that is, we insert complete records. Whether they come from flatfiles, XML files, database links, or whatever they are converted into database rows, maybe cleaned up first, but at the end of the day, inserted into the DW – we may be doing fancy stuff on the way with transportable tablespaces, partition exchange or change data capture, but the net effect is the same. All of the data passes through our DW and ends up written to disk.

But getting the stuff back is not the same, we are generally interested in a subset of the information stored in the DW, we may be filtering on a group of dimensional attributes and further only looking at one or two of the fact measures stored. We may also aggregate this data down to create a yet smaller result set. But the problem with conventional DW technology is that the whole result set is brought back to the CPU on the data warehouse to be manipulated and as we know this can be slow. The database appliance vendors push out pre-processing of the results to the disk arrays; some vendors using clustered PCs serving single disks, and Netezza uses a field programmable gate array between the disk and backplane, this is approach is particularly interesting in that some very sophisticated query logic can be executed next to the disk. The downside to processing “at the disk” is the case where the data interaction is between rows stored on different drives – as is the case in data mining and other pattern finding applications

Comments

  1. Noons Says:

    This whole subject is fascinating! I’m approaching it from the angle of someone concerned mostly with application design and overall performance – the largest part of my professional work in the last 10 years and therefore the most familiar to me.

    While others approach it from the angle of: how long does it take to upload all the XML feeds we’re getting now. And that is a completely different problem.

    Yet others are mostly concerned with how to get back at the data in a usable and repeatable timeframe. And get back to it we must, otherwise why store it in the first place?

    All equally valid and important, IMHO.

    Agree entirely that indexing is not always the best solution. I can’t see for example how to maintain complex global indexing when we’re talking hundreds of TB volumes per single table. Note that the problem to me is not indexing itself, it’s how to maintain it in usable time at that volume. And maintain we must: these things are not yet self-running. Maybe that would be a whole new area of research?

    Jim Gray had this interesting concept that involved losing indexing altogether: instead, replace it with data cataloguing, like the old library systems used to do. As such, data would be localized into particular storage units by class – a range or attribute value or multiple combinations of these- with retrieval being also by full scan or local indexing *within* the class(es).

    Personally, I reckon he was spot-on: indexing other than local/partitioned is just not a solution for the large volumes we’re talking about here. Not in its current format. And I can’t see parallelism by itself being able to address the volume delta here.

    It’ll require a few years of tight work between db makers and storage architecture folks. But I reckon it’s the easiest avenue that can scale at these dimensions and well beyond: let’s not forget that the Petabyte requirement is only a steeping stone until the Hexabyte dimension takes over.

    Who said IT was stale? Yeah! Right…

  2. Noons Says:

    sometimes I have to wonder if Scot Adams is not listening in…
    http://news.yahoo.com/comics/070311/cx_dilbert_umedia/20071103;_ylt=Akz9NxSvkiHBcYJV.ASCj4gA_b4F
    :-)

  3. Peter Scott Says:

    Often in the past DW designers got away with a compromise – fast load or fast retrieval, but massive databases are now so close to the edge it is not going to be possible to carry on like that.

    Traditional indexing at a global level is not going to be the answer, but clustering records by some form of key is not going to be the complete answer either – we have a limit on how many dimensions we can sort on without harming the performance of selection based on the unsorted dimensional keys.

Website Design & Build: tymedia.co.uk