<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rittman Mead Consulting &#187; Technology</title>
	<atom:link href="http://www.rittmanmead.com/category/technology/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rittmanmead.com</link>
	<description>Delivering Oracle Business Intelligence</description>
	<lastBuildDate>Mon, 06 Feb 2012 21:18:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Indexing the unusual</title>
		<link>http://www.rittmanmead.com/2008/03/indexing-the-unusual/</link>
		<comments>http://www.rittmanmead.com/2008/03/indexing-the-unusual/#comments</comments>
		<pubDate>Mon, 10 Mar 2008 16:18:20 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[dw]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/03/10/indexing-the-unusual/</guid>
		<description><![CDATA[For many years I had an interest in non-standard indexing and exotic data types, that is things that weren&#8217;t NUMBER or VARCHAR2. In fact before I came in to data warehousing I was involved in indexing free text such as conversation transcripts and and narrative reports; some of this was pushing the technology of the [...]]]></description>
			<content:encoded><![CDATA[<p>For many years I had an interest in non-standard indexing and exotic data types, that is things that weren&#8217;t NUMBER or VARCHAR2. In fact before I came in to data warehousing I was involved in indexing free text such as conversation transcripts and and narrative reports; some of this was pushing the technology of the time, but was as achievable. As <a href="http://dbasrus.blogspot.com/" target="_blank">Noons</a> pointed out in passing on a response to yesterday&#8217;s <a href="../2008/03/09/data-warehouses-are-not-dead-yet/" target="_blank">piece</a> technology moves on and we will soon have to resolve the challenge of developing new indexing techniques to cope with the grossly unstructured such as HD Video and recorded sound.</p>
<p>I briefly discussed this last year on my blog and probably also over at <a href="http://datageekgal.blogspot.com/" target="_blank">Datageekgal&#8217;s</a> blog (and congratulations Beth on the new job!) The indexing needs of the security services, medicine and a whole host of organisations that need to index patterns within a LOB type object will spawn some pretty clever indexing methods, and hopefully some of those will become accessible through database vendors products.</p>
<p>In a way there are some similarities with data mining, except that a LOB could (I think would) contain bit patterns for more than one index key value. We are probably talking about non-unique indexes, as for non-archival purposes researchers are usually concerned with finding similar records.</p>
<p>But one good thing about the need to index LOB contents is that they are usually non-volatile, a recorded conversation or a DNA sequence is historical fact and is not going to change so perhaps index updates are not going to be important. Most of the building blocks to do this type of indexing are already available (especially if we choose to create an index of the &#8220;index&#8221; by using some form of indirect table approach) the only bit to do is to write the domain specific code to identify the keys values in the LOB&#8230; hang on that&#8217;s the hard bit!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/03/indexing-the-unusual/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Oracle Data Integration Suite</title>
		<link>http://www.rittmanmead.com/2008/02/oracle-data-integration-suite/</link>
		<comments>http://www.rittmanmead.com/2008/02/oracle-data-integration-suite/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 20:25:35 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Application Server]]></category>
		<category><![CDATA[Oracle Data Integrator]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/02/05/oracle-data-integration-suite/</guid>
		<description><![CDATA[Keeping my head down building Oracle 11g OLAP cubes for research and self-education meant the I missed yesterday&#8217;s product announcement from Oracle, but with the wonders of Blog aggregators (and in particular Beth&#8217;s) I spotted a mention on Vincent McBurney&#8217;s blog of the newly announced (and available) Oracle Data Integration Suite. This is one of [...]]]></description>
			<content:encoded><![CDATA[<p>Keeping my head down building Oracle 11g OLAP cubes for research and self-education meant the I missed yesterday&#8217;s product announcement from Oracle, but with the wonders of Blog aggregators (and in particular <a href="http://feeds.feedburner.com/InfoQualityAggregator" target="_blank">Beth&#8217;s</a>) I spotted a mention on <a href="http://blogs.ittoolbox.com/bi/websphere/archives/with-a-wave-of-the-magic-wand-oracle-produces-a-data-integration-suite-22287" target="_blank">Vincent McBurney&#8217;s blog</a> of the newly announced (and available)  <a href="http://www.oracle.com/technologies/integration/odi-suite.html" target="_blank">Oracle Data Integration Suite</a>.</p>
<p>This is one of the fruits of the recent purchases by Oracle of Hyperion (Data Relationship Manager), Tangosol (Coherence *) and Sunopsis (Data Integrator) and with a bit of Application Server, BPEL and Enterprise Service Bus thrown in and the ability to use an embed data quality and profiling product from Trillium *</p>
<p>* Coherence and the data quality options are add-on to the base ODI Suite</p>
<p>This looks interesting, I might write more on this later</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/02/oracle-data-integration-suite/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Virtually there</title>
		<link>http://www.rittmanmead.com/2007/11/virtually-there/</link>
		<comments>http://www.rittmanmead.com/2007/11/virtually-there/#comments</comments>
		<pubDate>Wed, 21 Nov 2007 20:15:35 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Apple]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/11/21/virtually-there/</guid>
		<description><![CDATA[I have a couple of days between assignments so I am spending a little time doing research. Ultimately, there is a piece to write on Oracle 11g OLAP, and some course notes on aspects of data warehousing. But do this sort of thing I need to get myself set up with a database (Enterprise Edition [...]]]></description>
			<content:encoded><![CDATA[<p>I have a couple of days between assignments so I am spending a little time doing <em>research.</em> Ultimately, there is a piece to write on Oracle 11g OLAP, and some course notes on aspects of data warehousing. But do this sort of thing I need to get myself set up with a database (Enterprise Edition &#8211; I will need partitions, OLAP and all of the other VLDB bells and whistles) for self-education purposes; in my old job I had a small database running under Widows set up on my laptop and access to several Unix style databases (Solaris, AIX, HP) so could simply knock up a simple test to show me what happens and learn about getting the syntax right. But now I have an Apple MacBook Pro and no Windows or Unix (I do know that OS-X is a Unix, but that&#8217;s beside the point) to run Oracle on.</p>
<p>Step up VMWare Fusion &#8211; I have relatively rapidly built a Oracle Enterprise Linux 5 64-Bit VM guest OS to run on my Apple, cloned it as my base build, mastered how to change the name of the screen header for the VM machine so that I know which VM I am running and proceeded to install 11g database software and a small test database. This is great fun as before this week I had not touched Linux (let alone gnome) &#8211; true I could speak Solaris, Dynix and AIX, and it is so long since anyone allowed me a root password. So, now I have two linux servers, an <em>empty</em> one that I can base subsequent builds on and a working 11g database.</p>
<p>I suppose the next thing is to build a Windows VM for the stuff that needs Microsoft Windows (OBI SE-ONE, for example) &#8211; but why is Windows so expensive to license? If I do go with the clean &#8216;gold-build&#8217; VM and clone it before adding the minimal software required for the purpose in hand I will need to watch out how I clone the machine &#8211; or else VMWare will change the MAC address of my virtual network adaptor and break the license key</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/11/virtually-there/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Looking at the unstructured</title>
		<link>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/</link>
		<comments>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/#comments</comments>
		<pubDate>Sat, 06 Oct 2007 21:43:30 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/10/06/looking-at-the-unstructured/</guid>
		<description><![CDATA[Curt Monash has been talking about text mining a lot this week, he also notes that, from a text point of view, that the four preeminent database vendors (data store not query tool) are Oracle, Microsoft, Teradata and Netezza, this seems to reflect my experiences of what is going on in the BI space as [...]]]></description>
			<content:encoded><![CDATA[<p>Curt Monash has been talking about <a href="http://www.dbms2.com/2007/10/05/the-four-horsemen-of-data-warehousing/" target="_blank">text mining</a> a lot this week, he also <a href="http://www.dbms2.com/2007/10/05/the-four-horsemen-of-data-warehousing/" target="_blank">notes</a> that, from a text point of view, that the four preeminent database vendors (data store not query tool) are Oracle, Microsoft, Teradata and Netezza, this seems to reflect my experiences of what is going on in the BI space as a whole.</p>
<p>I first started out in the intelligence space (deliberately omitting the word <em>business</em>) working on high performance text stores. In reality this was all about storing streams of text in a highly searchable form, in effect the text was held as an index, here I came across then esoteric concepts as tokenisation, tagging, synonym dictionaries, and stop-words; however, I did not use stop-words as every word stored (even if misspelt) was potentially significant. Since those days analysis of free-text has become more mainstream. Linguistics academics use it to unravel language semantics (my wife&#8217;s dissertation was on computational linguistics) Others I know have worked on voice to text systems capable of identifying speakers and what they say. But although text is the most accessible (or understandable) form of unstructured data, increasingly people need to search binary data, whether it be biometrics, images, DNA sequences, spatial data or whatever.</p>
<p>Finding single records that contain text fragments, topics, DNA patterns can be an interesting indexing challenge, but when we start to move into the space where we looking for patterns that occur in multiple records we start to move well away from the normal data mining domain. Determining that a&nbsp; particular DNA sequence correlates with increased risk of disease is a complex task; there is little reason (apart from the enormity of the computational task) that DNA samples could not be used to predict facial appearance. Computational problems exist in text analysis especially where rich vocabularies make semantic matching difficult &#8211; so how do you get cookie to match biscuit in some contexts and biscuit to match brown in others.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Super models</title>
		<link>http://www.rittmanmead.com/2007/07/super-models/</link>
		<comments>http://www.rittmanmead.com/2007/07/super-models/#comments</comments>
		<pubDate>Wed, 18 Jul 2007 20:10:04 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/07/18/super-models/</guid>
		<description><![CDATA[No, not the size-zero (or below) fashion-sticks that pogo across the catwalks of the world enticing the somewhat-larger to buy, but the modeling of all of the information within an organisation in a single unified form. Said like that, it&#8217;s simple, but in reality there are lots of complexity buried away that need to be [...]]]></description>
			<content:encoded><![CDATA[<p>No, not the size-zero (or below) <em>fashion-sticks </em>that pogo across the catwalks of the world enticing the <em>somewhat-larger</em> to buy, but the modeling of all of the information within an organisation in a single unified form. Said like that, it&#8217;s simple, but in reality there are lots of complexity buried away that need to be teased out and resolved (or, pragmatically, glossed over)</p>
<p><strong>In the beginning there was the product</strong>, and then came the customers, the systems that counted the money made from selling, the mergers and acquisitions that brought in new products (and thinking, processes and models) and finally there was the desire to tie all of this together in one holistic view of the enterprise. And there is our first problem, dotted throughout the company are fragments of a model to suit specific purposes. How an ERP system such as SAP treats product, may not be the same as how a product lifecycle system sees it; and it may well not be be sensible to coerce the two models to look the same (have you ever wanted to make a major change to an ERP system just for the sake of it?).</p>
<p>As my e-friend Beth <a href="http://datageekgal.blogspot.com/2007/05/data-quality-standardizing-product.html" target="_blank">mentioned</a> unifying product is one of the trickier problems in creating a super model. Take for example an ice cream maker. Product development devise hundreds of recipes, and track them on their product development system, the best of these recipes go to taste trials and results recorded, the winners here go through to the marketeers who have a final say on whether they think that <em>fig and chili</em> ice cream will really be a commercial success. And then production recipes are created and costed, foods standards clearances obtained, launch budgets set and sales performance targets cast. Each stage we are dealing with the same product and the potential need to track back through this data when needed (think compliance auditors) Then add in the need to combine all of our systems with those of the competitor we just bought and need to make sure the commercial attributes of the product remain private for SoX compliance reasons, and the need to keep existing IT systems working without the need to be rebuilt to accept the new data model and the only practical solution is to build a metadata based model that links the existing systems.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/07/super-models/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>More on fraud and money laundering</title>
		<link>http://www.rittmanmead.com/2007/02/more-on-fraud-and-money-laundering/</link>
		<comments>http://www.rittmanmead.com/2007/02/more-on-fraud-and-money-laundering/#comments</comments>
		<pubDate>Sat, 24 Feb 2007 16:22:27 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/02/24/more-on-fraud-and-money-laundering/</guid>
		<description><![CDATA[Yesterday I mentioned a small analytic project we are doing for a retail customer around money laundering. Joel Gary commented and linked to an article in his local press about the rise of anti-fraud analytic companies in his neck of California. Money laundering is a big issue in the UK, not necessarily because it is [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I mentioned a small analytic project we are doing for a retail customer around money laundering. Joel Gary <a href="http://pjsrandom.wordpress.com/2007/02/23/end-of-week-catch-up/#comment-5381" target="_blank">commented</a> and linked to an article in his local press about the rise of anti-fraud analytic companies in his neck of California.</p>
<p>Money laundering is a big issue in the UK, not necessarily because it is a major problem, but more because of its political focus. This manifests itself in two ways: the need to establish that you are who you say that you are when we you open a bank account (see <a href="http://oracledoug.com/serendipity/index.php?/archives/1211-The-Continuing-Tale-of-RBS.html" title="Doug Burns">here</a> &amp; <a href="http://wedonotuse.blogspot.com/2006/08/miracle-scotland-rbs-and-trolley.html" title="Mogens Norgaard">here</a>) and need to track <em>large</em> cash transactions both in the banking world and at stores. Somewhere there is conception that cash is not a legitimate form of currency; cash is the proceeds of crime (be it drug dealing or VAT (sales tax) avoidance) It is also the way which people without bank accounts and credit cards trade, and often the main part of the daily finances of all those cornershops that sell the odd carton of milk and a newspaper to passers by. So tracking potential laundering in retail transactions is not that clear cut; we need to detect the crime from the &#8220;noise&#8221;</p>
<p>Looking at fraud &#8211; we have two significant strands with credit cards; falsely obtain cards from providers by using stolen ids, which was the main focus in the San Diego Union Tribune piece Joel cited &#8211; that is fraudulent applications. The second is where a legitimate card is hijacked (stolen, cloned, or just the details used on the web) Here we need to develop techniques to verify the fraud at transaction time. Goodness know how much the banks are doing to achieve this.</p>
<p>But fraud is not restricted to credit card finance. Fraudulent return of goods is also costing retailers money. Again we need systems that can be used at point of sale to indicate whether the refund is valid. And this needs to happen quickly: the customer should not feel they are being questioned, and the extra transaction time should not slow the sales process too much.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/02/more-on-fraud-and-money-laundering/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Thoughts on extremely large databases and searching the unstructured</title>
		<link>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/</link>
		<comments>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/#comments</comments>
		<pubDate>Mon, 15 Jan 2007 20:37:18 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/01/15/thoughts-on-extremely-large-databases-and-searching-the-unstructured/</guid>
		<description><![CDATA[Nuno Souto posts an interesting set of thoughts on Extremely Large Databases. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of [...]]]></description>
			<content:encoded><![CDATA[<p>Nuno Souto posts an interesting set of thoughts on <a href="http://dbasrus.blogspot.com/2007/01/no-moore-part-2.html" target="_blank">Extremely Large Databases</a>. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of partitioning (distributing) the data so that techniques such as indexing or table scans can succeed in a usably short time frame.</p>
<p>It can be argued that coupled with sensible partitioning schemes we can minimise IO in a data warehouse (of any large size) by building pre-aggregated summary tables; the pain of the build query happens just once instead of each time the aggregate is needed and by getting the partition granularity right we can minimise aggregations operations to just the partitions that need rebuild.</p>
<p>But what when stray from the domain of of structured DSS type queries? That is away from those queries where (even for ad-hoc queries) we remain able to use our wisely chosen aggregates or exploit the partitioning scheme of the base data. Suppose we look at data mining where the statistical relationships between data items are being determined or we search unstructured data for patterns, relationships or just index its content. And now-a-days unstructured data is not just <em>text</em> (if it ever was) we have speech and speaker recognition, image recognition from the (now) trivial OCR, through fingerprints and facial recognition to the more complex ability search libraries of image components to identify photographic locations. These require inventive indexing techniques but they also require fast access to the underlying data, and for big datasets is that going to be possible?</p>
<p>Maybe ELDBs are not the way to go for unstructured data or data that needs extensive analysis. Perhaps the way to go here is through database federation, keeping the computation close to the disk and coordinating the outputs to produce the end result</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Is &quot;Data Warehouse&quot; a future-proof term?</title>
		<link>http://www.rittmanmead.com/2006/08/is-data-warehouse-a-future-proof-term/</link>
		<comments>http://www.rittmanmead.com/2006/08/is-data-warehouse-a-future-proof-term/#comments</comments>
		<pubDate>Sun, 13 Aug 2006 13:22:44 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2006/08/13/is-data-warehouse-a-future-proof-term/</guid>
		<description><![CDATA[A while back it was quite feasible to draw circles around discrete databases in an organisation&#8217;s IT structure and say &#8216;this is the data warehouse, here is the billing system and that is the blinkity-boo system. But now those circles are pretty defuse. It is harder to differentiate between where document storage diverges from data [...]]]></description>
			<content:encoded><![CDATA[<p>A while back it was quite feasible to draw circles around discrete databases in an organisation&#8217;s IT structure and say &#8216;this is the data warehouse, here is the billing system and that is the blinkity-boo system. But now those circles are pretty defuse. It is harder to differentiate between where document storage diverges from data warehousing, ERP systems and electronic content management and workflow. Federated databases, messaging (both XML and older queue technologies) mean that not everything is one place; the challenges of providing a unified view of information becomes so much harder. Reporting is no longer about finding the sum of past activity, we are increasingly looking for predictive measures and trying to find patterns. This is happening everywhere, in commerce, finance and especially government.</p>
<p>Before I started out with data warehouses I did a lot of work on analyzing data, that is searching masses of information looking for links between data items &#8211; seeking out new and unknown connections, often connections separated by considerable time delays. To make this work I had to impose structure on the unstructured and this meant borrowing techniques from all over; artificial intelligence and fuzzy logic, network theory and even astro-physics (not all of my data sources were textual!). Borrowing continue apace, and now there seems little to delineate technologies. What I used to do then has become common and not just in the traditional DW monoliths of the past: Real-time credit card fraud detection is being pushed out to the point of sale by tapping in on the XML streams back to the centre, events recorded on different systems can be linked through messaging and not just rely of them being transported to a single data repository and being found by some batch process running way after the event has significance; as I said in an earlier <a href="/2006/04/13/real-time-reporting/">post</a>, the number of executive choices after an event diminishes as time progresses &#8211; there will be a stage where there is no longer a choice.</p>
<p>Maybe we are heading to future when the disciplines are no longer ERP and OLAP or transactional and decision support but become just <strong>Data Storage</strong> and <strong>Data Retrieval</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2006/08/is-data-warehouse-a-future-proof-term/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big, bad disk.</title>
		<link>http://www.rittmanmead.com/2006/04/big-bad-disk/</link>
		<comments>http://www.rittmanmead.com/2006/04/big-bad-disk/#comments</comments>
		<pubDate>Sat, 22 Apr 2006 22:35:55 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2006/04/22/big-bad-disk/</guid>
		<description><![CDATA[Over on Doug Burn’s blog there is a link to an interesting piece on large disks. Some people would think that the data warehousing community would welcome large disks. But probably for the majority (those of us that use conventional relational databases) this is not the case. An exception may be for those people that [...]]]></description>
			<content:encoded><![CDATA[<p>Over on Doug Burn’s blog there is a <a href="http://www.pythian.com/blogs/170/750g-disks-are-bahd-for-dbs-a-call-to-arms">link</a> to an interesting piece on large disks. Some people would think that the data warehousing community would welcome large disks. But probably for the majority (those of us that use conventional relational databases) this is not the case. An exception may be for those people that use data warehouse appliances; here data is hashed across all the available disks in the system and predicate processing is pushed out to the on-disk processors, that is, the system processor only sees the data after predicate and column filtering. Very few of these data warehouse appliances exist in the wild; I would guess low hundreds worldwide.</p>
<p>Disks drives can only read from one location of the disk at a time, true, this data read rate may be high and the time to jump between locations may be low, but this is still a slow process compared to CPU and memory operations. If all of a system’s data sits on one disk then every disk read and write will go through a single point and consequently IO throughput will suffer. This would be particularly noticeable in a data warehouse where table scans are common events.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2006/04/big-bad-disk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Token databases</title>
		<link>http://www.rittmanmead.com/2005/10/token-databases/</link>
		<comments>http://www.rittmanmead.com/2005/10/token-databases/#comments</comments>
		<pubDate>Mon, 17 Oct 2005 12:41:00 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2005/10/17/token-databases/</guid>
		<description><![CDATA[Mark Rittman recently posted on the subject of column orientated databases such Sybase IQ and various SAND Technology Inc products. One aspect Mark mentioned was the use of tokens to store attribute data. In a previous role (before moving back into the mainstream world of Oracle data warehouses) I worked on &#39;free-text&#39; information systems and [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.rittman.net/archives/001368.html">Mark Rittman</a> recently posted on the subject of column orientated databases such Sybase IQ and various SAND Technology Inc products. One aspect Mark mentioned was the use of tokens to store attribute data.</p>
<p>In a previous role (before moving back into the mainstream world of Oracle data warehouses) I worked on &#39;free-text&#39; information systems and in particular those produced by the Memex company. The basic idea behind these databases is that each unique word in the database is a assigned an integer token and instead of storing the actual text only a stream of integers is stored. On data input each word is checked against a dictionary &#8211; if the word exists it is substituted with its token, if it is a new word the next available integer is assigned as the token and the word added to the dictionary. For example if the first phrase ever to be entered in to the database was &#39;The cat sat on the mat&#39; then: the will be assigned 1, cat = 2, sat = 3, on = 4, the is already 1, and mat will be 5. Of course this is a gross simplification but you can see how this works. Data output requires each token to be decoded and built into a text stream.</p>
<p>Of course this word dictionary based approach has its own set of problems. We are storing each unique character string as a token &#8211; bad typing or spelling can cause multiple tokens for the &#39;same&#39; thing, however, it is not always appropriate to &#39;correct&#39; the input text to a standard form. Likewise the use of synonyms can be problematic &#8211; if the text says &#39;car&#39; is this the same as &#39;automobile&#39; or &#39;motor vehicle&#39; or indeed is &#39;colour&#39; the same as &#39;color&#39; and then there are parts of speech &#8211; &#39;drive&#39;, &#39;driven&#39; and &#39;drove&#39; are distinct words, but people may need to search for any of the words. The approach adopted by Memex is to store the &#39;as entered&#39; text (tokenized) in the database but to produce tools that change the search to use, in effect, an in-list of tokens for alternate words.<br />
Searching for the use of a single word in the database is fast &#8211; if it is not in the dictionary then it is not the database otherwise the database is scanned for a bitwise match with the token. It is possible to enhance the search by looking for multiple words, phrase searching and proximity searching (that is word 1 appearing within n words of word 2). Further enhancements allow the documents to be tagged in much the same way as XML so that data between specific tags takes a particular semantic meaning (or if you like, structure) Tagged data is especially useful when coupled with specialised data mining and data relationship graphing tools.</p>
<p>Back to Mark&#39;s piece &#8211; is this a useful technique in BI? And can something similar be implemented in Oracle? Firstly, the Memex approach is non-relational and as such does not fit readily into the conventional relational model DW approaches (Kimball or Inmon) The target user of these free-text databases are typically looking for single (or very few) record based on a search, this is perhaps closer to the Electronic Content Management community (interMedia?) than BI where users are looking for quantitative results and trends based on multiple records. But can any part of this technique be used in a conventional Oracle data warehouse context? Of course it is possible to build a token dictionary, perhaps as IOT of word token pairs, this will need to be indexed on both tokens and word to permit efficient coding and decoding. But where could we use these tokens in a data warehouse? If we adopt a model where the dimensional attributes are only stored in stored in one or more &#39;reference&#39; data tables and not stored in the &#39;fact&#39; tables at all (well apart from the join key which is, in a way, a token!). If tokens are to save space in the database (because the token takes less storage than the original word &#8211; which may not be true anyway) we will need to use each token at least twice in the reference data table to recover the extra storage needed for the word-token dictionary. For a 3NF DW we will have a hierarchy of reference data tables with each table consisting of unique records, this reduces the potential for duplicating attribute tokens; it does not eradicate it though as items could easily share attributes, for example in a fashion DW a tee-shirt and a dress could share the attribute of &#39;blue&#39;. In a fully denormalized DW there is far more scope for attribute sharing. But is it worthwhile? &#8211; the reference tables themselves are short &#8211; how many customers or products do you have?, probably less than 1000000 rows. And add the necessity to join all the queries to token dictionary table to create the selection criteria and again to decode the tokens to make the output human friendly then tokens seem to add a performace overhead that may not be worth living with</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2005/10/token-databases/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

