On Relational Databases and Column-Based Storage

September 9th, 2007 by Mark Rittman

“Are Relational Databases Obsolete?” asks Slashdot, pointing to this article by Michael Stonebraker in Computerworld.

Oh no I think, here we go again, another article from the Linux and Java fanboy crowd that’s saying that databases are legacy technologies that merely provide storage for an application’s objects. Or actually - it’s not, and the Slashdot headline is a bit misleading - what the article it links to is saying is that for a particular class of database application, data warehousing and fast retrieval of, for example, customer attributes, the row-based storage mechanism in relational databases is suboptimal compared to the column-based storage mechanisms that are coming on to the market - and he may well have a point there. (Note also that the author of the article is part of a start-up based around this technology)

I talked about column-based databases a year or so ago when I covered Sybase IQ and SAND on this blog. In short, what they do is turn the traditional row-based storage mechanism on it’s side, where data is organized into blocks made up of rows of data items, each row made up of the Customer ID, Name, Address, Hair Colour and son on - and instead store data in blocks of columns, with each block made up of all of the Customer Names in a table, then all of the Addresses, then all of the Hair Colours and so on. Where this model becomes impressive is when you need to search through all customer records to find those who’s names start with “A”, have brown hair and are in socioeconomic group “G” - all the data for names, hair colour and socioeconomic groups is stored next to each other in a small amount of blocks, rather than spread over much more blocks mixed in with all the other customer data in each row. Feedback I’ve had from people is that when querying large data sets and selecting from sets of attributes (to do, for example, market analysis or customer searches) it can indeed be a storage type that’s an order of magnitude faster than traditional row-based databases.

I’ve never actually worked on a project using a column-based database, but I’d be interested in benchmarking one against an Oracle database for this type of application. Of course with Oracle, you’ve got bitmap indexes, index-organized tables and so on which speeds up this type of application, and of course with Oracle it’s also very efficient at running OLTP application, which is where column-based databases are themselves slow (the same proble, but reversed), but even so, I’d be interested to see how they perform in real life.

See also “One Size Fits All - A Concept Whos Time Has Come and Gone.”

Comments

  1. Jeff Moss Says:

    Colocation of data is possible, to a certain degree within an Oracle database (and probably other RDBMS too).

    IOTs or manual ORDER BY approaches can achieve this, although they are limited to a specific column or set of columns to colocate by.

    Colocation, as I’ve mentioned once or twice on my blog, is useful in enhancing the degree of compression achieved.

    Bitmap and Bitmap Join Indexes can provide orders of magnitude improvement on column based queries like “Brown Hair” and “Socio Economic Group G”.

    It’s funny, I was going through my blog aggregator and read the post by Michael Stonebraker before yours and I thought to myself…lots of “marketing speak there” but no actual substance or, dare I say it, benchmarks…then I began to think, I wonder who he works for…then I read your article.

    I’m not saying there isn’t a market for the column based approach but it does seem like a niche thing to my mind…as you say, it would be nice to do some benchmarking and testing of it’s approach in comparison to Oracle (or other RDBMS).

    I don’t particularly regard Oracle as a “one size fits all” - I regard it as a multi talented offering, which, with the appropriate people using it (to take advantage of the most appropriate features), can achieve excellent results in many disciplines…that’s something I did agree with the Stonebraker article on…it’s the people that make the difference, no matter how clever the product.

  2. Tim Berry Says:

    I would be keen on discovering how column based storage may provide drill down functionalility as the current offering from oracle suits both purposes. We can currently configure the database for fast retrieval using a variety of MI/ aggregation solutions then drill down through standard row based indexes or OLAP heirarchies to the granular set that defined the aggregate.

    It is difficult to see how storing in (potentially as I don’t know the column database design) an aggregate fashion that you could drill down further and back again!

  3. Peter Scott Says:

    I was thinking along the lines of Jeff’s response - how about going for a set of two column IOT tables, the attribute and the surrogate key of the item. We can use one table per attribute; so, the brown haired, group G people are in the set that comes from the join of the two tables on the surrogate keys.

    Of course I have no idea how this would work with real data and the need to update attributes when people move house or dye their hair, or how it would work with non-key searches such the people with ‘I’ as the second letter in their name.

    It could be a “playtime” thing to look at

  4. Mark Rittman Says:

    Hi Tim, Jeff, Pete,

    It’s certainly an interesting subject. Thinking about it, if all the data for a particular column was placed next to each other on disk, rather than spread over many blocks as part of whole rows of (unrelated) data, then logically it should be faster to retrieve. I’m conscious though that other things come into play, IOTs or bitmap indexes can do much the same thing, and there’s all other factors involved including how scalable the servers are, how much duplication of data has to go on if loading into a column-based database, and so on.

    That said though, it’s definately interesting and something I wouldn’t mind raising (informally, if appropriate) at the upcoming BIWA conference, where a lot of the Server Tech PM guys will be attending. Watch this space, as they say.

  5. Marco Gralike Says:

    Apparently I am still attracted in the same stuff (reading this). I once opted IOT in 2005 ;-) Comments posted now sounds like a deja vu. It only fills me with sadness to see how much has been lost after your website problems. Shame that all those good discussions are lost.

  6. Pythian Group Blog » Log Buffer #62: a Carnival of the Vanities for DBAs Says:

    […] Rittman provided a quick explanation of the technological background: columnar data […]

  7. Ileana Somesan Says:

    Hi all,

    is the column-orientation-paradigm another name for vertical partitioning in traditional RDBMS?

  8. Jeff Madsen Says:

    We’re looking at SuperSTAR from Space-Time Research. Early indications are that it’s a very good way of doing OLAP-type stuff at the detail level with a lot of data.

  9. Stewart Bryson Says:

    Mark:
    Concerning you comment: “…if all the data for a particular column was placed next to each other on disk, rather than spread over many blocks as part of whole rows of (unrelated) data…”: in todays database environments where most data is actually stored in SANS using RAID 5, the concept of contiguous space is really meaningless. The abundance of RAID renders antiquated processes like table rebuilds as neglibible for the same reason: data is never going to be contiguous anyway.

    Granted, I don’t understand the inner-workings of the column-based databases, but if they rely on the mere fact that data is stored together, then I am doubtful that the solution would really work all that well. The Oracle functionality listed by many of the posters here–bitmaps indexes especially–actually allow the database to retrieve the row quickly no matter where or how it’s stored, and that is the paradigm that seems most important.

Website Design & Build: tymedia.co.uk