Indexing the unusual

For many years I had an interest in non-standard indexing and exotic data types, that is things that weren't NUMBER or VARCHAR2. In fact before I came in to data warehousing I was involved in indexing free text such as conversation transcripts and and narrative reports; some of this was pushing the technology of the time, but was as achievable. As Noons pointed out in passing on a response to yesterday's piece technology moves on and we will soon have to resolve the challenge of developing new indexing techniques to cope with the grossly unstructured such as HD Video and recorded sound.

I briefly discussed this last year on my blog and probably also over at Datageekgal's blog (and congratulations Beth on the new job!) The indexing needs of the security services, medicine and a whole host of organisations that need to index patterns within a LOB type object will spawn some pretty clever indexing methods, and hopefully some of those will become accessible through database vendors products.

In a way there are some similarities with data mining, except that a LOB could (I think would) contain bit patterns for more than one index key value. We are probably talking about non-unique indexes, as for non-archival purposes researchers are usually concerned with finding similar records.

But one good thing about the need to index LOB contents is that they are usually non-volatile, a recorded conversation or a DNA sequence is historical fact and is not going to change so perhaps index updates are not going to be important. Most of the building blocks to do this type of indexing are already available (especially if we choose to create an index of the "index" by using some form of indirect table approach) the only bit to do is to write the domain specific code to identify the keys values in the LOB... hang on that's the hard bit!