Looking at the unstructured

Curt Monash has been talking about text mining a lot this week, he also notes that, from a text point of view, that the four preeminent database vendors (data store not query tool) are Oracle, Microsoft, Teradata and Netezza, this seems to reflect my experiences of what is going on in the BI space as a whole.

I first started out in the intelligence space (deliberately omitting the word business) working on high performance text stores. In reality this was all about storing streams of text in a highly searchable form, in effect the text was held as an index, here I came across then esoteric concepts as tokenisation, tagging, synonym dictionaries, and stop-words; however, I did not use stop-words as every word stored (even if misspelt) was potentially significant. Since those days analysis of free-text has become more mainstream. Linguistics academics use it to unravel language semantics (my wife's dissertation was on computational linguistics) Others I know have worked on voice to text systems capable of identifying speakers and what they say. But although text is the most accessible (or understandable) form of unstructured data, increasingly people need to search binary data, whether it be biometrics, images, DNA sequences, spatial data or whatever.

Finding single records that contain text fragments, topics, DNA patterns can be an interesting indexing challenge, but when we start to move into the space where we looking for patterns that occur in multiple records we start to move well away from the normal data mining domain. Determining that a  particular DNA sequence correlates with increased risk of disease is a complex task; there is little reason (apart from the enormity of the computational task) that DNA samples could not be used to predict facial appearance. Computational problems exist in text analysis especially where rich vocabularies make semantic matching difficult - so how do you get cookie to match biscuit in some contexts and biscuit to match brown in others.