Thoughts on extremely large databases and searching the unstructured

January 15th, 2007 by

Nuno Souto posts an interesting set of thoughts on Extremely Large Databases. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of partitioning (distributing) the data so that techniques such as indexing or table scans can succeed in a usably short time frame.

It can be argued that coupled with sensible partitioning schemes we can minimise IO in a data warehouse (of any large size) by building pre-aggregated summary tables; the pain of the build query happens just once instead of each time the aggregate is needed and by getting the partition granularity right we can minimise aggregations operations to just the partitions that need rebuild.

But what when stray from the domain of of structured DSS type queries? That is away from those queries where (even for ad-hoc queries) we remain able to use our wisely chosen aggregates or exploit the partitioning scheme of the base data. Suppose we look at data mining where the statistical relationships between data items are being determined or we search unstructured data for patterns, relationships or just index its content. And now-a-days unstructured data is not just text (if it ever was) we have speech and speaker recognition, image recognition from the (now) trivial OCR, through fingerprints and facial recognition to the more complex ability search libraries of image components to identify photographic locations. These require inventive indexing techniques but they also require fast access to the underlying data, and for big datasets is that going to be possible?

Maybe ELDBs are not the way to go for unstructured data or data that needs extensive analysis. Perhaps the way to go here is through database federation, keeping the computation close to the disk and coordinating the outputs to produce the end result


  1. Noons Says:

    well, not exactly “scarred for life”!

    but there are a few marks here and there, for sure.

    I must admit one thing though: mostly caused by folks listening too close to the presentations instead of working the maths of “is it feasible?”.

    Back in the 70s and most of the 80s, no major project would ever be attempted without what was then called a feasability study.

    Which covered not only “bang for buck” but also the very simple and humble: “are we talking lah-lah-land or can this thing fly”?

    Somewhere along the “risk analysis” revolution, someone forgot to teach folks they can analyze all the risks they want but: if the blessed thing is not flight capable, it WILL hit the ground.

    And that’s why I remember “experts” telling me quicksort was “much faster” than Oracle ORDER BY. Therefore I should consider denormalizing my tables into flat files, sort them outside the db and then upload it all again, normalized:

    that HAD to be faster than ORDER BY! “Are you listening, boy? Oracle says we’re the experts, you better listen good and do as told!”

    These were the “experts”. You should have heard the OCPs.

    Any wonder why I get so violent when I hear a marketeer trying to tell me I should “shut up and listen”?

    yup!… :-)

    Anyways: 100% in agreement with most of your points. The only thing I’d add is something I’ve said before: it’s getting harder and harder to draw the line at what is a DW, what is a DM and what is a DSS db. It’s all blurring.

    Personally, I think they will blur even more. With the right kind of hardware and matching software, there is no reason why that shouldn’t be the case.

    But the warning is there for anyone eaves-dropping: do NOT listen to marketing! Do a proper study! Or get ready for scars and pain…

  2. naresh Says:

    From Noons comment: “it’s getting harder and harder to draw the line at what is a DW, what is a DM and what is a DSS db”

    what is a DSS db? A DM (Data Mart) is probably a mini DW that contains data for a particular business/application area. This is the first time I hear DSS db.

  3. Peter Scott Says:

    DSS = Decision Support System – maybe an old fashioned term these days but Noons & I are old ;-)

  4. Pythian Group Blog » Log Buffer #28: a Carnival of the Vanities for DBAs Says:

    […] Scott responded on Pete-s random notes with his thoughts on extremely large databases and searching the unstructured. He says, “Maybe (Extremely Large Databases) are not the way to go for unstructured data or […]

Website Design & Build: