SAND, Sybase IQ and Column-Based Databases
Several months ago a reader contacted me about about a new data warehousing technology he'd come across:
"Mark,
I have been reading a quite enthralling document (sounds nerdy I know), about how to actually achieve ROI on Data Warehousing.
The article is http://www.it-director.com/research.php?productid=748.
Its done by Bloor, and to be honest is really pitching you to use SAP. I'm not going to go there with that one, but I am interested in the concept. Just wanted to know if you heard of any similar such research by Oracle?
Particularly interesting was the discussion on Column Based Relational Databases."
I took a quick look at the article, which appeared to be a piece of work commissioned by SAND Technology Inc., on their new data warehousing technology. I'd heard of SAND being mentioned before but I didn't have any spare time at the time, so I mentally filed it away as "interesting" and didn't think much more of it.
I just happened to be going through some old notes though the other day and I came across the SAND paper again. I've actually taken a look through it now and I'd agree that it does in fact look interesting. What the paper talks about is a company, SAND Technology Inc., that has two products, SAND Searchable Archive and SAND Analytic Server, based on a development of a technology called Column-based Databases.
To try and summarise the technology, instead of storing data in a row-based database (each row containing several columns, which often make up the elements of a transaction) it stores data "column-based". Each column represents one attribute, fact or whatever, with the benefits being that:
- A typical column-based query that accesses just a small number of columns will move around much smaller blocks of data than a row-based database, which has to bring back a whole row of data even for just couple of columns, and
- These column-tables are effectively self-indexing, as you store the data in the correct order for that particular column.
In addition, the SAND products also tokenize the data, replacing town names, product names and so forth with tokens, and compress it. SAND are pitching the product as a replacement for operational data stores, allowing you to keep lots of historic data online and available, quick to query but only taking up a tenth of the space of a row-based database.
Doing a quick google search it appears that Sybase have a similar product, called Sybase IQ. A bit like Oracle with Express, Sybase bought the technology back in the 90's and is now positioning it as a specialised query engine for DSS queries. Further looking around Google showed a number of smaller, niche vendors (often marketing consultancies) with column-based database products, typically used for marketing and sales analysis of datasets with many thousands of attributes.
Now I'd vaguely heard of column-based databases a few years ago, I think it was actually Nigel Pendse who mentioned it to me (although I could be mistaken). It was funny though having taken a look through the SAND and Sybase white papers to come across a technology that clearly was aimed at data warehousing applications that I'd not properly come across before. It's certainly not a technology that Oracle provide, and it's not an OLAP replacement, and I'm not trying to picture my head how it works - talking about column, rather than row, based databases is (to me, with my Oracle upbringing) a bit like talking about a situation where the normal rules of science are altered, so that you can move along the time dimension, but not left and right or up and down, if you see what I mean. I guess this is how OLAP must seem to relational people.
Anyway, I was wondering if any readers of this blog use either of these two products, or any other column-based databases? Any views on how they work, whether the benefit is indeed there? Also, it seems to me that you could build much the same thing in Oracle by creating individual Index-Organised Tables for each of the columns in your dataset, maybe tokenize the data yourself, and use Oracle's table compression feature; am I missing something here or would this be a similar solution? Any feedback or experiences would be very much appreciated. Also, thanks to Chris Balodis for the original email.