Writing it large and reading it big

Yesterday I ungraciously forgot to mention Nuno Souto’s blog piece from January where he was talking about just the same sort of issues with access to massive databases. Please go and read Noons’ work, it is well worth it.

I mentioned two challenges yesterday, putting the data into the database and finding it again. Systems that can do one of these steps very well often falter with the other process. If you decide to use a very sophisticated tokenized column based database to give you blistering data read access you may find the computing required to work out where to insert the data is too expensive (cost or time) to make the process viable, conversely direct writing of data to disk may give us the best insert rate but something less than optimal for reading. If you index the raw data how much of an overhead is there in maintaining the index in the batch and does that process interfere with concurrent user activity; after all trickle-feed is a now viable dataload strategy? If you don’t index then how do you find what you are looking for? Not that indexes are always a help; take the example of looking for exception values, it is just not common to index fact measures.

By necessity our data load process is inserting one (or hopefully, many) data records into our data warehouse; that is, we insert complete records. Whether they come from flatfiles, XML files, database links, or whatever they are converted into database rows, maybe cleaned up first, but at the end of the day, inserted into the DW – we may be doing fancy stuff on the way with transportable tablespaces, partition exchange or change data capture, but the net effect is the same. All of the data passes through our DW and ends up written to disk.

But getting the stuff back is not the same, we are generally interested in a subset of the information stored in the DW, we may be filtering on a group of dimensional attributes and further only looking at one or two of the fact measures stored. We may also aggregate this data down to create a yet smaller result set. But the problem with conventional DW technology is that the whole result set is brought back to the CPU on the data warehouse to be manipulated and as we know this can be slow. The database appliance vendors push out pre-processing of the results to the disk arrays; some vendors using clustered PCs serving single disks, and Netezza uses a field programmable gate array between the disk and backplane, this is approach is particularly interesting in that some very sophisticated query logic can be executed next to the disk. The downside to processing “at the disk” is the case where the data interaction is between rows stored on different drives – as is the case in data mining and other pattern finding applications