On Relational Databases and Column-Based Storage

"Are Relational Databases Obsolete?" asks Slashdot, pointing to this article by Michael Stonebraker in Computerworld.

Oh no I think, here we go again, another article from the Linux and Java fanboy crowd that's saying that databases are legacy technologies that merely provide storage for an application's objects. Or actually - it's not, and the Slashdot headline is a bit misleading - what the article it links to is saying is that for a particular class of database application, data warehousing and fast retrieval of, for example, customer attributes, the row-based storage mechanism in relational databases is suboptimal compared to the column-based storage mechanisms that are coming on to the market - and he may well have a point there. (Note also that the author of the article is part of a start-up based around this technology)

I talked about column-based databases a year or so ago when I covered Sybase IQ and SAND on this blog. In short, what they do is turn the traditional row-based storage mechanism on it's side, where data is organized into blocks made up of rows of data items, each row made up of the Customer ID, Name, Address, Hair Colour and son on - and instead store data in blocks of columns, with each block made up of all of the Customer Names in a table, then all of the Addresses, then all of the Hair Colours and so on. Where this model becomes impressive is when you need to search through all customer records to find those who's names start with "A", have brown hair and are in socioeconomic group "G" - all the data for names, hair colour and socioeconomic groups is stored next to each other in a small amount of blocks, rather than spread over much more blocks mixed in with all the other customer data in each row. Feedback I've had from people is that when querying large data sets and selecting from sets of attributes (to do, for example, market analysis or customer searches) it can indeed be a storage type that's an order of magnitude faster than traditional row-based databases.

I've never actually worked on a project using a column-based database, but I'd be interested in benchmarking one against an Oracle database for this type of application. Of course with Oracle, you've got bitmap indexes, index-organized tables and so on which speeds up this type of application, and of course with Oracle it's also very efficient at running OLTP application, which is where column-based databases are themselves slow (the same proble, but reversed), but even so, I'd be interested to see how they perform in real life.

See also "One Size Fits All - A Concept Whos Time Has Come and Gone."