Interview with Mike Stonebraker
December 15th, 2008 by Mark Rittman
And now for something a bit different. Some of you in the Oracle data warehousing world might remember Lilian Hobbs, who used to work in the Server Tech part of Oracle and worked on areas such as materialized views, partitioning and information lifecycle management, and who was co-author of Oracle Database 10g Data Warehousing, published by Digital Press. Well, Lilian left Oracle a year or so ago but we stayed in touch, and then I noticed that she’d ended up working with Vertica, one of the new column-store databases that compete with Oracle in the high-end data warehousing space. What makes Vertica particularly interesting is that Mike Stonebraker, one of the original figures in the relational database world, was involved in the formation of Vertica and Lilian offered to set an interview with him up for the blog.
Now I’m certainly aware that there’s quite a few differing opinions out there on what the best architecture for data warehouse databases is, and of course Oracle recently released Exadata and the HP Oracle Database Machine which competes directly in this space. Vertica, and other products like Sybase IQ and SAND address the problem in a different way in that they store data “column-orientated”, so I thought it’d be interesting to hear from Mike why he thinks column-store is the ideal solution for DSS-style applications. Feel free to add comments to the posting; given that columns-store and Exadata seem to be a hot-topics I’d be keen to hear any insights into the two technologies.
Anyway, here’s the interview, and thanks to Mike and Lilian for setting this up:
[Mark Rittman] Mike, you have a distinguished history in the database market, being involved in the creation of databases such as Ingres and Postgres, and now being involved in a new startup, Vertica. Tell us a bit about how you became involved in databases, how you got involved with Vertica and what your new company is offering?
[Mike Stonebraker] I received a PhD from the University of Michigan in Computer Science in 1971 and joined the faculty at the University of California, Berkeley. As you know an Assistant Professor must “publish or perish”. It was clear that my PhD topic was a dead end, and I had to find something else. It was a happy accident that I stumbled into the DBMS area in 1972.
At the time, Codd’s pioneering paper had just appears, and there was immediate controversy over whether the relational model could be efficiently implemented. Hence, I started (with Eugene Wong) the Ingres project to demonstrate the feasibility of Codd’s ideas.
After building and commercializing Ingres over the next 8 years, I moved on to build Postgres. In fact, Vertica is the 5th DBMS that I have built, first as a research prototype and then commercialized.
At some point I realized that column stores would clobber row stores in the data warehouse marketplace, and set about building a research prototype at M.I.T. (with help from a bunch of other professors and students). The success of this prototype (C-Store) led to Vertica Systems, which I founded (with Andy Palmer)
[Mark Rittman] Vertica to my understanding differs from the Oracle database in that it stores data in a “column-based” form. Tell us a bit more about this, how it differs to the Oracle database and why you think it has certain advantages?
[Mike StoneBraker] Most data warehouses either use a star schema or something close to a star schema. The central “fact” table in such a schema usually has many attributes. 50 attributes is very common and 200 is not unusual. So assume a fact table with 60 attributes.
However, the query from a business analyst invariably only accesses 3 or 4 of these 60 attributes.
A “row store” such as Oracle will read all 60 attributes because the attributes needed are intermixed with all of the others. In contrast, a column store, such as Vertica, will read only the relevant attributes. Note that this will result in a factor of 15-20 less I/O activity.
In addition, a column store has other benefits. Compression is more effective in a column store than a row store, because any storage block only contains one kind of thing. In contrast a storage block in Oracle contains many different kinds of objects. Obviously, it is easier to compress one thing than many things. In head-to-head bakeoffs, Vertica compression is typically a factor of 3 better than Oracle’s. This further lowers I/O activity in a column store.
In customer bakeoffs against Oracle, Vertica has yet to win a benchmark by less than a factor of 30, largely because of these two considerations.
[Mark Rittman] Vertica also has a “shared-nothing” architecture. Again, what does this mean, how does it differ from Oracle’s Real Application Clusters, and why do you think this is a superior architecture for data warehousing?
[Mike Stonebraker] A “shared nothing” architecture means that each computer system has its own private memory and private disk. This architecture is followed by essentially all high performance, scalable, DBMSs, including Teradata, Netezza, Greenplum, Paraccel, DB2 and Vertica. It is also used by most of the high-end e-commerce platforms, inclusing Amazon, Akamai, Yahoo, Google, and Facebook.
Oracle RAC does not run on a shared nothing system. It was built many years ago to run on a “shared disk” architecture. In this world, a computer system has private memory but shares a disk system with other computer systems. Such a “disk cluster” was popularized in the 1990′s by Sun and HP, among others In the 2000′s this architecture has been replaced by “grid computing”, which uses shared nothing.
Shared disk has well-known scalability problems, when applied to DBMSs. These result from the requirement that each system have its own lock table and buffer pool, which must be synchronized with their peers. This synchronization is painful and has serious performance problems, which limit the scalability of shared disk implementations. In fact, the largest Oracle RAC configuration I am aware if is 6 systems.
In contrast shared nothing DBMSs, such as Vertica, are routinely run on 50-100 nodes.
In summary, shared nothing scales much better than shared disk. hence, it is the superior architecture.
[Mark Rittman] Now that Oracle have released Exadata and the HP Oracle Database Machine, with predicate pushing, return of just the result set from a query and a shared-nothing disk architecture, does this not negate the benefits of Vertica? Why should an Oracle customer invest in another database technology?
[Mike Stonebraker] I have 2 comments concerning this question.
First, Exadata is intended to overcome the scalability limitations of RAC. Hence, it is an appliance with a total of 120 CPUs in 22 systems. Clearly, 22 is better than 6, and Oracle has managed to “push out” the RAC limit noted above. However, the scalability limit is still vastly lower than that available in a shared nothing system, such as Vertica.
Second, Exadata is still Oracle and is still a row store. Vertica on 120 systems will outperform Exadata by the same factor of 30-100. Hence, one can run Vertica on a comparable hardware platform and get 1-2 orders of magnitude better response time. Alternately, one can run Vertica on a low end server and get comparable performance to that offered by Exadata on a huge complex of machines.
[Mark Rittman] You presented a paper at the 2007 VLDB Conference on something called the H-Store project, that proposed running high-end OLTP databases entirely in RAM with SQL being replaced by Ruby or Python. Tell us a bit about this project and what it means to the typical Oracle (or even Vertica) customer?
[Mike Stonebraker] Oracle is selling “one size fits all” — i.e. a single DBMS code line intended to solve any kind of application. Vertica is proof that one can do vastly better with a specialized architecture oriented to data warehousing. In fact, a different specialized architecture will run circles around Oracle on OLTP. The details would take me sometime to explain, so I would like to refer the interested reader to the H-store literature. If you google “H-store VLDB”, you will get to three recent papers that explain the system in detail.
[Mark Rittman] Although Vertica appears to offer several benefits to data warehouse customers, how would you propose incorporating a column-store database into a typical business intelligence & data warehousing architecture that also needs to incorporate for example OLAP, Data Mining and Master Data Management technologies? Does it replace the Oracle database, OLAP and so on?
[Mike Stonebraker] I think any CIO would be crazy to suggest “cold turkey”, i.e. replacing one component of a complex system with another by flipping a switch. Instead, I would recommend that he set up a new system, such as Vertica, in parallel with his existing production system. His ETL load system would load both, and he could direct queries to either one. Over time he could shift his load onto the new engine and eventually disconnect the old one. Such a gradual transition from old to new technology is way less ricky than “cold turkey”
[Mark Rittman] Mike, you’ve been involved in the database industry for many years. How do you see the industry going in the future, particularly with regard to high-end business intelligence and data warehousing customers?
[Mike Stonebraker] Think petabytes — and eventually exabytes.


December 16th, 2008 at 2:08 am
Thanks Mark
Nice one !
December 16th, 2008 at 11:27 am
Very interesting, I’d like to see it in action.
We see from the widespread use of Microsoft, and Oracle that standardisation is a secret to success. It does not have to be the best solution, there are other factors which come into play – Support, development costs, skills base, ease of use, etc.
Oracle have been giving away the software for free, to help developers understand it, and there is a huge skills base, with mountains of books to read – How does a new db on the block compete with this?
December 17th, 2008 at 9:07 am
Very great article. I’m a beginner in this area and I enjoy everyday to read your team.
The column storage is a great idea but I don’t have until now the chance to try it.
In the same domain, you have also the “vertical partitionning” which is a solution for the same problem (more than 400 column in the fact table for only a few used).
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1277688700346608142
Happy Christmas
December 17th, 2008 at 2:50 pm
I think that vertical partitioning is probably most similar to Vertica, but a better technique in Oracle is to provision the system with one or more materialized views that have a subset of fact table columns. I made a comment about that on a previous posting — I think it was about experiences I’d had with Siebel Analytics.
Mike Stonebraker’s point that it is more efficient to query a table of 3 commonly used columns instead of projecting them from a table of all 200 columns is undeniably true, however as Oracle practitioners we do not have to accept a scan of a 200 column table as our only choice. We have the tools at hand to define skinny tables and aggregate tables (and combinations of both using HOLAP MV’s or OLAP MVs) that allow the optimizer to choose which one to query. The tables compress better, of course, and for single columns we can also provision extremely fast bitmap indexes.
So while Vertica tackles this particular problem, Oracle has its own solution for exactly that same problem.
I recall about 6 years ago that the business management at the organisation I was working for approached us, the technical team, with a proposition from a company that provided an indexing middle-tier between applications and Oracle data warehouses. Their major selling point was the acceleration of queries based on single columns — “count 60 million rows in two seconds!” … “add 100 million rows in 20 seconds!” … that sort of thing. In every case we were able to provide comparisons with real world performance benchmarks from our Oracle 9i system based simply on bitmap indexes or other common techniques.
I’m afraid that I see Vertica as something of a similar one-trick pony, and I haven’t seen any advertised capabilities that would not be more simply implemented within Oracle. I wonder whether in any of the cases that are quoted in favour of Vertica’s performance gains, were any attempts made to use Oracle to match those improvements by using these techniques?
December 29th, 2008 at 3:33 pm
It is a common misconception that column-stores are the same thing as as vertically partitioning a row-store like Oracle. See my blog post:
http://www.databasecolumn.com/2008/07/debunking-another-myth-columns.html
that debunks this myth.
Daniel Abadi
Assistant Professor
Yale University
January 12th, 2009 at 1:46 pm
Hi,
An interesting interview indeed.
I am yet to compare what Mike Stonebraker just said, with corressponding points from Oracle.
Let me search oracle’s answer to this.
A few years ago Ant-DB claimed it was 3 times faster than Oracle (leave aside sybase, SQL server etc), Around 4 years have passed snice I heard that and I am yet to see their installations around.
CUIL also says they can overapower google.
They say, Reality is stranger than fiction.
Let’s wait and watch.
At the same time, I have full confidence in Mike Stonebraker’s capabilities too (having seen his last 25 yrs career track).
Cheers!
Param
send2param@yahoo.com