<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.3.3" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Rittman Mead Consulting</title>
	<link>http://www.rittmanmead.com</link>
	<description>Delivered Intelligence</description>
	<pubDate>Thu, 20 Nov 2008 16:12:02 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.3</generator>
	<language>en</language>
			<item>
		<title>Preprocessing Input Files and 11.1.0.7 External Tables</title>
		<link>http://www.rittmanmead.com/2008/11/20/preprocessing-input-files-and-11107-external-tables/</link>
		<comments>http://www.rittmanmead.com/2008/11/20/preprocessing-input-files-and-11107-external-tables/#comments</comments>
		<pubDate>Thu, 20 Nov 2008 16:12:02 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Oracle Database]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/20/preprocessing-input-files-and-11107-external-tables/</guid>
		<description><![CDATA[Just a quick note to point to Greg Rahn&#8217;s posting on External Table Preprocessors in Oracle 11g 11.1.07. Prior to 11.1.0.7, if a file that you were going to use via an external table needed uncompressing, you had to do the task prior to reading it using the external table, potentially ending up with an [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick note to point to <a href="http://structureddata.org/2008/11/19/preprocessor-for-external-tables/">Greg Rahn&#8217;s posting on External Table Preprocessors in Oracle 11g 11.1.07</a>. Prior to 11.1.0.7, if a file that you were going to use via an external table needed uncompressing, you had to do the task prior to reading it using the external table, potentially ending up with an uncompressed file several times larger than the compressed one, or use SQL*Loader and feed data into it via an uncompression utility and named pipes.</p>
<p>Now in 11g, you can add a PREPROCESSOR clause to your external table DDL to point to an executable that will preprocess and pipe your data, allowing the external table to read from the file via the uncompression utility without having to uncompress it all first. It&#8217;s one of those things that make you think &#8220;why wasn&#8217;t that there before?&#8221;, as Greg says though it&#8217;s now in 11.1.0.7 albeit sketchily documented, so if you&#8217;re planning to move large files around as part of your ETL proces and you&#8217;re on the latest version of the database, it&#8217;ll be worth taking a look at Greg&#8217;s article.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/20/preprocessing-input-files-and-11107-external-tables/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Oracle OLAP Supports MDX (Sort Of&#8230;)</title>
		<link>http://www.rittmanmead.com/2008/11/20/oracle-olap-supports-mdx-sort-of/</link>
		<comments>http://www.rittmanmead.com/2008/11/20/oracle-olap-supports-mdx-sort-of/#comments</comments>
		<pubDate>Thu, 20 Nov 2008 15:43:43 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Oracle OLAP]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/20/oracle-olap-supports-mdx-sort-of/</guid>
		<description><![CDATA[Dan Vlamis broke the news yesterday that Simba Technologies will be showing off their &#8220;Native Microsoft Excel 2007 Connectivity for Oracle OLAP&#8221; solution at the upcoming BIWA SIG in December. From looking at Dan&#8217;s blog posting and the Simba Technologies press release, the product uses an MDX 2005 to Oracle OLAP bridge that the company [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.vlamis.com/blog.php/?p=76">Dan Vlamis broke the news yesterday</a> that Simba Technologies will be showing off their &#8220;Native Microsoft Excel 2007 Connectivity for Oracle OLAP&#8221; solution at the upcoming BIWA SIG in December. From looking at Dan&#8217;s blog posting and the <a href="http://www.simba.com/news/Simba-Demonstrates-Native-Excel-2007-Connectivity-to-Oracle-OLAP-11g.htm">Simba Technologies press release</a>, the product uses an MDX 2005 to Oracle OLAP bridge that the company have developed in order to offer the same pivot table support that Microsoft Excel 2007 offers for Microsoft Analysis Services for Oracle OLAP customers.</p>
<p>This is of course pretty significant, as Oracle have been more or less the only major OLAP vendor to not support MDX as a query language, at least in their in-database OLAP product (Essbase of course supports MDX for Aggregate Storage Option cubes). The rumour about Oracle extending MDX support to Oracle OLAP has been circulating the various conferences for the past year or so, I&#8217;d heard that it might be through a third-party spreadsheet add-in rather than direct support in the database and obviously this is it. </p>
<p>Of course this falls short of direct support for MDX, I guess with Oracle&#8217;s clear backing of SQL as their preferred OLAP query language that would have been a step too far, but this does give customers who prefer Excel as their OLAP query tool (i.e. most of them) and want to create calculations and selections in MDX the ability to choose Oracle OLAP, with all it&#8217;s clear security, scalability and managability benefits, rather than having to use a separate OLAP server. I don&#8217;t know what pricing is like, and how well the bridge performs, but it&#8217;s certainly interesting and gives Oracle OLAP 11g customers a means to analyze their cubes in a proper multi-dimensional environment, at least until Answers+ arrives.</p>
<p>More details at the <a href="http://www.vlamis.com/blog.php/?p=76">Vlamis blog</a> and on the <a href="http://www.simba.com/news/Simba-Demonstrates-Native-Excel-2007-Connectivity-to-Oracle-OLAP-11g.htm">Simba Technologies website</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/20/oracle-olap-supports-mdx-sort-of/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Investigating the Oracle BI Management Pack for OBIEE and DAC</title>
		<link>http://www.rittmanmead.com/2008/11/18/investigating-the-oracle-bi-management-pack-for-obiee-and-dac/</link>
		<comments>http://www.rittmanmead.com/2008/11/18/investigating-the-oracle-bi-management-pack-for-obiee-and-dac/#comments</comments>
		<pubDate>Tue, 18 Nov 2008 22:54:27 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Oracle BI Apps]]></category>

		<category><![CDATA[Oracle BI Suite EE]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/18/investigating-the-oracle-bi-management-pack-for-obiee-and-dac/</guid>
		<description><![CDATA[Although I posted the other week about having finished all my presentation writing, in fact I&#8217;ve actually been working on and off on a couple of fairly big demos. One, that I&#8217;ll try and post about soon, is an attempt to get all of the EPM 11.1.1 (or even, EPM 11.1.1.1) software up and running [...]]]></description>
			<content:encoded><![CDATA[<p>Although I posted the other week about having finished all my presentation writing, in fact I&#8217;ve actually been working on and off on a couple of fairly big demos. One, that I&#8217;ll try and post about soon, is an attempt to get all of the EPM 11.1.1 (or even, EPM 11.1.1.1) software up and running and integrated with OBIEE; the other is about getting the new BI Management Pack installed and running alongside OBIEE and the Oracle BI Applications.</p>
<p>The basic premise with the BI Management Pack is that it comes with Oracle Grid Control 10.2.0.4 and allows you to manage your BI infrastructure as well. Price-wise I think it adds about $11k to the standard $260k (list price) for OBIEE and it&#8217;s already pre-installed if you patch Grid Control up to 10.2.0.4, from the standard 10.2.0.1 version that&#8217;s on OTN. Once you apply the patch you then have to make sure the <a href="http://www.rittmanmead.com/2008/10/11/comparing-obiee-usage-tracking-with-nqsquerylog/">Usage Tracking</a> and Scheduler tables are installed and configured for OBIEE, there&#8217;s an additional JMX agent to get running so that OEM can check all the OBIEE counters and so on, and then you can go through the &#8220;discovery&#8221; process detailed in the docs to register your OBIEE and DAC components. </p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/add-biee-target.jpg" height="277" width="500" border="1" hspace="4" vspace="4" alt="Add Biee Target" /></p>
<p>This went fairly straightforward for me, the only issue I had was that the discovery process couldn&#8217;t initially &#8220;log on&#8221; to my Windows XP host until I made sure the account I was using had the &#8220;log on as a batch job&#8221; privilege. Once I did that, I was able to add my OBIEE BI Server, Presentation Server and Scheduler to the Grid Control list of targets, along with the DAC (Data Warehouse Administration Console, used for controlling BI Apps ETL jobs) repository. Once you&#8217;ve done this, and you&#8217;ve added the various Grid Control, Oracle Management Agent and Oracle Management Service elements to your OBIEE setup, your architectural diagram now looks like this:</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/bi-mgmt-pack-arch.jpg" height="287" width="500" border="1" hspace="4" vspace="4" alt="Bi Mgmt Pack Arch" /></p>
<p>So once you&#8217;ve set this all up, what do you get?</p>
<p>Once you&#8217;ve set everything up, you access the various OBIEE components by initially selecting the host they&#8217;re on, and then locating the &#8220;targets&#8221; in the host&#8217;s list, like this:</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/targets.jpg" height="221" width="500" border="1" hspace="4" vspace="4" alt="Targets" /></p>
<p>The little &#8220;up arrow&#8221; next to the target shows whether, for example, the BI Server is currently running, the Scheduler is up and so on. Notice the DAC Server bit in the middle? That&#8217;s reporting on whether the DAC Server is up and running and able to service Execution Plan requests from the DAC Console.</p>
<p>The overview for the BI Server shows the general CPU usage, memory usage and so on of the BI Server process. You can also use this page to get a general overview of how &#8220;busy&#8221; the BI Server is, either currently in minute or so intervals, or over the past day, seven days or so on.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/bi-server-general-performance.jpg" height="364" width="500" border="1" hspace="4" vspace="4" alt="Bi Server General Performance" /></p>
<p>The Dashboard Reports tab shows you data from the Usage Tracking tables, so that you can see which dashboards (note, not requests/reports) are the most active, which users are making the most use of dashboards and so on.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/top-dashboards-by-resource-usage.jpg" height="364" width="500" border="1" hspace="4" vspace="4" alt="Top Dashboards By Resource Usage" /></p>
<p>In my case, the OMMgr and admin users have been making the most use of the system.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/top-by-user.jpg" height="247" width="500" border="1" hspace="4" vspace="4" alt="Top By User" /></p>
<p>You can also see to what degree the cache has been used, the cache hit ratio (!) and so on - note that this is the BI Server cache, not the database buffer cache, before anyone gets too excited&#8230;</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/cache-perforamnce.jpg" height="317" width="500" border="1" hspace="4" vspace="4" alt="Cache Perforamnce" /></p>
<p>The Presentation Server pages give you a bit more information about how active the front-end of the application has been, how much activity the charting engine has undertaken and so on.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/presentation-server-details.jpg" height="368" width="500" border="1" hspace="4" vspace="4" alt="Presentation Server Details" /></p>
<p>Here&#8217;s the overview page, not much to see here though apart from uptime and load.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/presentation-server.jpg" height="331" width="500" border="1" hspace="4" vspace="4" alt="Presentation Server" /></p>
<p>Now something I did find interesting was integration with the DAC Server and Repository. The BI Management Pack links in to the DAC Repository and shows you graphs on ETL runs, for example:</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/dac-etl-runs.jpg" height="378" width="500" border="1" hspace="4" vspace="4" alt="Dac Etl Runs" /></p>
<p>Here&#8217;s a bit more detail on the specific ETL runs that have happened.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/dac-etl-runs-2.jpg" height="169" width="500" border="1" hspace="4" vspace="4" alt="Dac Etl Runs 2" /></p>
<p>So, it looks quite interesting, particularly if you&#8217;re already a user of Grid Control 10g and you&#8217;d like to bring your OBIEE targets into the fold. For me, what&#8217;s particularly interesting isn&#8217;t what&#8217;s here at the moment; this is obviously a pretty early version of what&#8217;s possible, the data that Grid Control is getting from the various OBIEE servers is pretty basic and nothing really more than what you can get from the Usage Tracking tables and reports combined with operating system CPU, RAM and I/O reports; for me though, the interesting thing is what&#8217;s possible going into the future. </p>
<p>If you think about your complete Oracle BI stack - going from storage, through to the database, application server and then the various OBIEE servers, then you can see how a tool like this can help you manage performance and your architecture from top to bottom. If you get the classic request from users - &#8220;the system is running slow, can you take a look and find out what&#8217;s up?&#8221; - well with a system like this, if Grid Control in the future links together the reports that are running, with the application server session that hosted them, with the database query that provided the data, combined with the self-tuning (ASH, AWR etc) bits of the database, plus the diagnostics and so on that are in ASM and the general storage layer, well you can really imagine being able to trace a query through from report through to the underlying server system and understand just exactly where the time is going. I don&#8217;t think we&#8217;re quite there yet, the reports that the BI Management Pack produces are a bit simple and somewhat disjointed, but if the product management team start adding in the <a href="http://www.rittmanmead.com/2008/09/17/tuning-an-oracle-11g-data-warehouse-using-the-dbconsole-advisors/">sort of advisors that we get from the database side</a> - perhaps an &#8220;Aggregate Storage Advisor&#8221; that recommends aggregate tables or even Essbase cubes, then passes the details through to the Aggregate Persistence Wizard to create the required summaries; maybe a cache advisor that recommends caching, maybe even a request performance wizard that recommends changes to the underlying data structures including the addition of indexes or materialized views, well, that would be very interesting.</p>
<p>Of course this is all potentially quite a while off, so my next task is to come up with some sort of approach where we can use what&#8217;s currently in the BI Management Pack, along with the various database advisors and application server reports to try and diagnose and improve the performance of queries today. I&#8217;m actually writing this up as an article for OTN, so keep an eye out after Christmas and hopefully I&#8217;ll come up with some good guidelines and a simple methodology so that you can start using this interesting new addition to Grid Control 10gR4.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/18/investigating-the-oracle-bi-management-pack-for-obiee-and-dac/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Rittman Mead at the UKOUG Conference 2008</title>
		<link>http://www.rittmanmead.com/2008/11/15/rittman-mead-at-the-ukoug-conference-2008/</link>
		<comments>http://www.rittmanmead.com/2008/11/15/rittman-mead-at-the-ukoug-conference-2008/#comments</comments>
		<pubDate>Sat, 15 Nov 2008 15:04:40 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[User Groups &amp; Conferences]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/15/rittman-mead-at-the-ukoug-conference-2009/</guid>
		<description><![CDATA[It&#8217;s just a few weeks now until the user group event of the year, the UK Oracle User Group Conference &#038; Exhibition 2009. This year, as well as presenting at the conference we will have a stand in the exhibition hall, where we&#8217;ll be demonstrating all the latest Oracle BI&#038;DW tools and answering any development [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s just a few weeks now until the user group event of the year, the <a href="http://conference.ukoug.org">UK Oracle User Group Conference &#038; Exhibition 2009</a>. This year, as well as presenting at the conference we will have a stand in the <a href="http://conference.ukoug.org/default.asp?p=822">exhibition hall</a>, where we&#8217;ll be demonstrating all the latest Oracle BI&#038;DW tools and answering any development or project questions you might have.</p>
<p>In terms of presentations, our schedule is as follows:</p>
<ul>
<li><strong>&#8220;Extending &#038; Customizing the Oracle BI Apps Data Warehouse&#8221;</strong> - Mark Rittman, Tuesday 12:25 - 13:10</li>
<li><strong>&#8220;A Better BI Methodology&#8221;</strong> - Jon Mead, Tuesday 15:15 - 16:00</li>
<li><strong>&#8220;Oracle BI EE Performance Optimization&#8221;</strong> - Mark Rittman, Wednesday 14:35 - 15:20</li>
<li><strong>&#8220;High Availability in Oracle BI EE&#8221;</strong> - Jon Mead, Wednesday 15:40 - 16:25</li>
<li><strong>&#8220;Deploying a Dimensional Model on the Oracle Database&#8221;</strong> - Stewart Bryson, Thursday 13:00 - 13:45</li>
<li><strong>&#8220;Oracle 11g Data Warehousing Masterclass&#8221;</strong> - Mark Rittman, Friday 09:30 - 11:30</li>
</ul>
<p>As well as doing the presentations and manning the stand, we&#8217;ll also be giving away t-shirts and chilled bottles of an exclusive Rittman Mead beer that Borkur&#8217;s sourced from Belgium. Hopefully we&#8217;ll see some of you there, if you make it to the exhibition keep a look out for stand 90 and come and say hello.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/15/rittman-mead-at-the-ukoug-conference-2008/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Welcome to Jennifer and Ragnar</title>
		<link>http://www.rittmanmead.com/2008/11/13/welcome-to-jennifer-and-ragnar/</link>
		<comments>http://www.rittmanmead.com/2008/11/13/welcome-to-jennifer-and-ragnar/#comments</comments>
		<pubDate>Thu, 13 Nov 2008 16:20:18 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Rittman Mead]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/13/welcome-to-jennifer-and-ragnar/</guid>
		<description><![CDATA[I&#8217;ve been meaning to post this for a couple of months now, but we&#8217;ve recently grown our team at Rittman Mead and taken on two new Principal Consultants, Jennifer Albu and Ragnar Wessels, bringing our consulting team up to six full-timers.
Jennifer comes to us from a major IT services provider and specialises in Informatica, Warehouse [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been meaning to post this for a couple of months now, but we&#8217;ve recently grown our team at Rittman Mead and taken on two new Principal Consultants, Jennifer Albu and Ragnar Wessels, bringing our <a href="http://www.rittmanmead.com/about/our-team/">consulting team</a> up to six full-timers.</p>
<p>Jennifer comes to us from a major IT services provider and specialises in Informatica, Warehouse Builder, data warehouse design and project management, whilst Ragnar comes from another big IT services provider and specialises in OWB, OBIEE, Discoverer and database development. Continuing our international jet-setting theme, Ragnar is Dutch, worked in the past with Borkur in Iceland, and was up until recently living and working in Canada, whilst Jen is originally from Canada but has lived and worked in the UK for the past few years, and is now working with Borkur and Ragnar on a project in Brussels. Given the fact that three of the team are over in Belgium,  Pete&#8217;s in Athens whilst Jon and I are working in London, Ireland and Oslo over the next couple of weeks, working out where to hold our team Christmas party is turning into a tricky decision (On the Eurostar, perhaps, or at Schiphol Airport?)</p>
<p>Anyway welcome, Jen and Ragnar, and we look forward to working with you (and reading your blog posts) over the next few months!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/13/welcome-to-jennifer-and-ragnar/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Data quality is not a one-off</title>
		<link>http://www.rittmanmead.com/2008/11/12/data-quality-is-not-a-one-off/</link>
		<comments>http://www.rittmanmead.com/2008/11/12/data-quality-is-not-a-one-off/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 16:59:59 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
		
		<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/12/data-quality-is-not-a-one-off/</guid>
		<description><![CDATA[In my Blog post on End-to-end data quality I mentioned the desirability of fixing bad data at source. This certainly attracted comment both here and on other blogs for example 
One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications [...]]]></description>
			<content:encoded><![CDATA[<p>In my Blog post on <a href="http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/" target="_blank">End-to-end data quality</a> I mentioned the desirability of fixing bad data at source. This certainly attracted comment both here and on other blogs for <a href="http://blogs.oracle.com/robreynolds/2008/11/continuing_the_data_quality_co_1.html" target="_blank">example </a></p>
<p>One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications or bad processes. Often fixing source applications is just not going to happen, especially with the case of legacy and third-party packaged applications. Likewise process can be hard to fix; if we rely on humans to key information in and the application is incapable of enforcing data rules then data entry problems will occur. For example take a university admissions system; here data entry is highly seasonal, first off a lot of prospective candidates (say about 10x the number who will start as students in the new year) then a reduced number of place offers which get further reduced by the candidates choosing another university or not reaching the entrance grades. This work flow often requires extra temporary staff just to key in details, they are not full-time employees, they do not know the significance of the data, they do not know about the reporting systems that give insight to the data, they are just there to get the data in as quickly as possible and with enough accuracy to let the process as whole work. It really does not matter to the process if the country name is not standardised providing the funding code is correct; even the course code may not matter that much providing the faculty is correct, especially if all faculty members follow a common scheme of study in the first year.<br />
So, fixing data at source is most unlikely to be a one off exercise. True, a one-off data fix will improve the quality of our historic reporting, but what we are not doing is preventing data errors from occurring in the future. We must build into our ETL processes methods to continually monitor source data quality and build processes so that any errors detected can be corrected at source.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/12/data-quality-is-not-a-one-off/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Bitmap Indexes Redux</title>
		<link>http://www.rittmanmead.com/2008/11/06/bitmap-indexes-redux/</link>
		<comments>http://www.rittmanmead.com/2008/11/06/bitmap-indexes-redux/#comments</comments>
		<pubDate>Thu, 06 Nov 2008 09:10:40 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Data Warehousing]]></category>

		<category><![CDATA[Oracle Database]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/06/bitmap-indexes-redux/</guid>
		<description><![CDATA[I was running a data warehousing training event for a client recently, and in one of the sessions we looked at bitmap indexes and bitmap join indexes. Whilst the session went OK I remember thinking to myself at the end, &#8220;you know what, there&#8217;s a fair bit about bitmap indexes I&#8217;m not too clear on&#8221;, [...]]]></description>
			<content:encoded><![CDATA[<p>I was running a data warehousing training event for a client recently, and in one of the sessions we looked at bitmap indexes and bitmap join indexes. Whilst the session went OK I remember thinking to myself at the end, &#8220;you know what, there&#8217;s a fair bit about bitmap indexes I&#8217;m not too clear on&#8221;, and so I made a mental note after the session to take a closer look at how they work.</p>
<p>Of course bitmap indexes, and indexes in general, are a subject that people like <a href="http://richardfoote.wordpress.com/">Richard Foote</a> and <a href="http://jonathanlewis.wordpress.com">Jonathan Lewis</a> have covered very well in the past and there are number of good article on the internet that explain how the feature works. In particular three articles by Jonathan, on <a href="http://www.dbazine.com/oracle/or-articles/jlewis3">bitmap indexes</a>, <a href="http://www.dbazine.com/oracle/or-articles/jlewis6">bitmap star transformations</a> and <a href="http://www.dbazine.com/oracle/or-articles/jlewis7">bitmap join indexes</a>, are about the closest I&#8217;ve seen to a definitive introduction to the subject, whereas <a href="http://www.oracle.com/technology/pub/articles/sharma_indexes.html">this article by Vivek Sharma</a> compares bitmap indexes to b- tree indexes in the context of decision support with some useful examples of test cases to illustrate the differences. There&#8217;s also a <a href="http://en.wikipedia.org/wiki/Bitmap_index">good article on Wikipedia</a> that sets out a generic background to bitmap indexes, which of course are used across the database industry and aren&#8217;t just an Oracle feature. Looking through these articles was certainly useful in filling in some of the gaps in my knowledge, and in particular confirmed the basic understanding that I had around the subject.</p>
<p>Bitmap indexes differ from the more common b-tree indexes in that they use bit arrays (also known as &#8220;bitmaps&#8221;) to indicate whether a particular row in a table contains a particular column value. To take an example, you might have a table of customer details, and you have a column called &#8220;hair colour&#8221;. A bitmap index on this column would logically store the index data in the following way:</p>
<p style="text-align: center"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/bitmap-index-1.jpg" alt="Bitmap Index 1" border="1" height="165" hspace="4" vspace="4" width="450" /></p>
<p>So what you have there is a string of ones and zeros (the &#8220;bit array&#8221; mentioned above&#8221;) that record whether a particular row has that particular column value. Now the clever bit comes when you want to filter based on more than one bitmap-indexed value; in this case, the bit arrays are logically ANDed, OR&#8217;d or XOR&#8217;d so that you can quickly find out, for example, which customers have red hair, big feet, snore and have bad breath (so that presumably you can avoid sending them an invite to next week&#8217;s customer reception):</p>
<p style="text-align: center"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/11/bitmap-index-2.jpg" alt="Bitmap Index 2" border="1" height="130" hspace="4" vspace="4" width="500" /></p>
<p>So if that&#8217;s how they logically work, the next question I had was &#8220;how are these details stored internally?&#8221; Jonathan&#8217;s <a href="http://www.dbazine.com/oracle/or-articles/jlewis3">first article</a>, on bitmap indexes, does a good job on explaining this (it&#8217;s written for Oracle 9i so I&#8217;m not sure if the details have changed) and includes a symbolic dump that shows bitmap index leaf blocks being made up of five components:</p>
<p>1) A set of flags<br />
2) A lock byte<br />
3) The indexed value (&#8221;Red Hair&#8221;, for example, or &#8220;Big Feet&#8221;)<br />
4) A pair of row IDs, and<br />
5) A stream of bits</p>
<p>The lock byte is used to determine whether this particular part of the index is locked, because of some DML happening to the underlying table. The pair of row IDs designates the portion of the table that this index leaf block covers (back in the 9i days, Jonathan reckons this range can cover up to 24,000 rows, this may have changed with more recent releases), whilst the stream of bits tells Oracle whether a particular row ID contains the indexed value in question. Presumably then some sort of compression algorithm is applied and, depending on the number of distinct values that are being indexed and the distribution of the data, the index is then stored taking up potentially far less space than a regular b- tree index.</p>
<p>So the general thinking here is that bitmap indexes are particularly suited to decision support applications as they</p>
<p>(a) can be easily combined using bit-wise operations to filter data based on multiple criteria<br />
(b) their high compression ratios mean that they can take up less space than comparable b- tree indexes<br />
(c) they can be quick to create, meaning you can create more of them in a given ETL window or your ETL process can take less time, and<br />
(d) if you are using Oracle, you can use additional features such as bitmap star transformations to rapidly return data from your data warehouse.</p>
<p>Now there are a lot of claims, myths and ideas out there about bitmap indexes which are partly explained by the fact that it&#8217;s often it&#8217;s difficult to test them under the right conditions (multiple concurrent users, for example, or with big enough data sets) and the characteristics of the feature have changed over time (the effect of multiple updates onto bitmap-indexed columns has become less pronounced since the advent of 10g, with these two blog postings (<a href="http://technology.amis.nl/blog/1420/myths-on-bitmap-indexes">here</a>, and <a href="http://jonathanlewis.wordpress.com/2006/12/19/mything-in-action/#comments">here</a>) being good examples of where people can end up drawing the wrong conclusions.</p>
<p>That said though, looking back at the session I ran, in hindsight what it lacked were some basic examples of bitmap index creation, with some tests afterwards to compare their build time, size and query performance compared to regular b-tree indexes. I&#8217;m also curious to see what extra benefit is gained by using bitmap join indexes both in comparison to regular bitmap indexes and compared to what seems to me to be a similar thing - creating join-only materialized views (though I could be wrong here). Given this, I took the SH Sales History data set and started to knock up some examples. Now bear with me here as this is still only a very limited example, I&#8217;m also happy for anyone who knows more about this (Pete? Richard? Stuart?) to chip in and point out if I&#8217;ve drawn the wrong conclusion.</p>
<p>To start off then, I&#8217;m going to take two copies of the SH.CUSTOMERS table, and load the table data back into the tables a couple of times to increase the table row counts, otherwise most of the tests I do will return data too quickly. The version I&#8217;m using of Oracle is 11.1.0.6 on Windows XP SP2, running on a VMWare Fusion virtual machine with 1.25GB of RAM and 1 processor enabled.</p>
<pre>SQL&gt; select count(*) from customers_copy_btree;

COUNT(*)
----------
1776000                                                                                                                                                                                          

SQL&gt; select count(*) from customers_copy_bitmap;

COUNT(*)
----------
1776000</pre>
<p>OK, so both CUSTOMER table copies have now got around 1.7M rows within them. The SH schema from which they are sourced is a star schema, so the CUSTOMERS table has lots of &#8220;attribute&#8221; columns that I can index and then filter on.</p>
<pre>SQL&gt; select count(distinct cust_gender) from customers_copy_bitmap;

COUNT(DISTINCTCUST_GENDER)
--------------------------
2                                                                                                                                                                          

SQL&gt; select count(distinct cust_year_of_birth) from customers_copy_bitmap;

COUNT(DISTINCTCUST_YEAR_OF_BIRTH)
---------------------------------
75                                                                                                                                                                   

SQL&gt; select count(distinct cust_last_name) from customers_copy_bitmap;

COUNT(DISTINCTCUST_LAST_NAME)
-----------------------------
908                                                                                                                                                                       

SQL&gt; select count(distinct cust_street_address) from customers_copy_bitmap;

COUNT(DISTINCTCUST_STREET_ADDRESS)
----------------------------------
50945</pre>
<p>So looking at those distinct row counts, the table&#8217;s attributes range from 2 distinct values (gender) through 75 (year of birth) through to 50,945 for street addresses. That should give us a range of different column cardinalities to test.</p>
<pre>SQL&gt; set timing on

SQL&gt; create index cus_gender_btree_idx on customers_copy_btree(cust_gender);

Index created.

Elapsed: 00:00:34.01

SQL&gt; create index cus_yob_btree_idx on customers_copy_btree(cust_year_of_birth);

Index created.

Elapsed: 00:00:37.12

SQL&gt; create index cus_lname_btree_idx on customers_copy_btree(cust_last_name);

Index created.

Elapsed: 00:00:29.25

SQL&gt; create index cus_adr_btree_idx on customers_copy_btree(cust_street_address);

Index created.

Elapsed: 00:00:46.43</pre>
<p>Creating b-tree indexes on a selection of columns comes in at a total of 34+37+29+46 = 2 mins 26 seconds. How long to comparable bitmap indexes take to create?</p>
<pre>SQL&gt; create bitmap index cus_gender_bix on customers_copy_bitmap(cust_gender);

Index created.

Elapsed: 00:00:36.11

SQL&gt; create bitmap index cus_yob_bix on customers_copy_bitmap(cust_year_of_birth);

Index created.

Elapsed: 00:00:10.57

SQL&gt; create bitmap index cus_lname_bix on customers_copy_bitmap(cust_last_name);

Index created.

Elapsed: 00:00:13.92

SQL&gt; create bitmap index cus_adr_bix on customers_copy_bitmap(cust_street_address);

Index created.

Elapsed: 00:00:10.87</pre>
<p>So to create the bitmap indexes, the total time was 36+10+14+10 = 1 min 10 seconds, about half the time of the b-tree indexes.  I then gather statistics on the tables and indexes and then run a query to determine the table and index sizes.</p>
<pre>SQL&gt; select segment_name
2  ,	    bytes/1024/1024 "Size in MB"
3  from   user_segments
4  where  segment_name like 'CUSTOMERS_COPY%';

SEGMENT_NAME                             Size in MB
---------------------------------------- ----------
CUSTOMERS_COPY_BITMAP                           368
CUSTOMERS_COPY_BTREE                            368                                                                                                                                                 

SQL&gt; select segment_name
2  ,	    bytes/1024/1024 "Size in MB"
3  from   user_segments
4  where  segment_name like '%_IDX';

SEGMENT_NAME                             Size in MB
---------------------------------------- ----------
CUS_ADR_BTREE_IDX                                72
CUS_GENDER_BTREE_IDX                             26
CUS_LNAME_BTREE_IDX                              36
CUS_YOB_BTREE_IDX                                30                                                                                                                                                 

SQL&gt; select segment_name
2  ,	    bytes/1024/1024 "Size in MB"
3  from   user_segments
4  where  segment_name like '%_BIX';

SEGMENT_NAME                             Size in MB
---------------------------------------- ----------
CUS_ADR_BIX                                      11
CUS_GENDER_BIX                                 .625
CUS_LNAME_BIX                                     3
CUS_YOB_BIX                                       5</pre>
<p>The tables we&#8217;re indexing are 368MB in size, and the b-tree indexes range from 72MB for the high-cardinality address index, through to 30MB for the low-cardinality gender index. The bitmap indexes, though, are around six times smaller, going from 11MB for the address index through to just over half a MB for the gender index. So they&#8217;re certainly smaller, at least for this dataset&#8217;s data distribution, and as we saw before, they take less time to prepare.</p>
<p>How about some queries then, how do the explain plans look for simple queries using these columns as filters? Let&#8217;s try filtering on year of birth, which had 75 different values in our test data.</p>
<pre>SQL&gt; set lines 200
SQL&gt; set autotrace traceonly
SQL&gt; select *
  2  from   customers_copy_btree
  3  where  cust_year_of_birth = 1967;

30592 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2533345702
------------------------------------------------------------------------------------------
| Id  | Operation         | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                      | 30592 |  5526K| 12728   (1)| 00:02:33 |
|*  1 |  TABLE ACCESS FULL| CUSTOMERS_COPY_BTREE | 30592 |  5526K| 12728   (1)| 00:02:33 |
------------------------------------------------------------------------------------------                                                                                                              

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                     

   1 - filter("CUST_YEAR_OF_BIRTH"=1967)                                                                                                                                                                

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      48332  consistent gets
      46337  physical reads
          0  redo size
    4711189  bytes sent via SQL*Net to client
      22845  bytes received via SQL*Net from client
       2041  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      30592  rows processed                                                                                                                                                                             

SQL&gt; select *
  2  from   customers_copy_bitmap
  3  where  cust_year_of_birth = 1967;

30592 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3003431057                                                                                                                                                                             

------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                       | 30592 |  5526K|  5652   (1)| 00:01:08 |
|   1 |  TABLE ACCESS BY INDEX ROWID | CUSTOMERS_COPY_BITMAP | 30592 |  5526K|  5652   (1)| 00:01:08 |
|   2 |   BITMAP CONVERSION TO ROWIDS|                       |       |       |            |          |
|*  3 |    BITMAP INDEX SINGLE VALUE | CUS_YOB_BIX           |       |       |            |          |
------------------------------------------------------------------------------------------------------                                                                                                  

Predicate Information (identified by operation id):
---------------------------------------------------                                                                                                                                                     

   3 - access("CUST_YEAR_OF_BIRTH"=1967)                                                                                                                                                                

Statistics
----------------------------------------------------------
        170  recursive calls
          0  db block gets
      24427  consistent gets
      16164  physical reads
          0  redo size
    5975063  bytes sent via SQL*Net to client
      22845  bytes received via SQL*Net from client
       2041  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      30592  rows processed</pre>
<p>So in that example, Oracle actually used a full table scan for the b-tree indexed table, whilst the bitmap indexed one used an index lookup. The cost of the execution plan against the bitmap indexed table was the lowest out of the two. Now what about if we filter on several columns, this is where bitmap indexes surely come into their own?</p>
<pre>SQL&gt; select *
  2  from   customers_copy_bitmap
  3  where  cust_gender = 'M'
  4  and    cust_year_of_birth = 1980
  5  and    cust_last_name = 'SMITH';

no rows selected

Execution Plan
----------------------------------------------------------
Plan hash value: 2767299321                                                                                                                                                                             

------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                       |    21 |  3885 |     8   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID | CUSTOMERS_COPY_BITMAP |    21 |  3885 |     8   (0)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS|                       |       |       |            |          |
|   3 |    BITMAP AND                |                       |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE| CUS_LNAME_BIX         |       |       |            |          |
|*  5 |     BITMAP INDEX SINGLE VALUE| CUS_YOB_BIX           |       |       |            |          |
|*  6 |     BITMAP INDEX SINGLE VALUE| CUS_GENDER_BIX        |       |       |            |          |
------------------------------------------------------------------------------------------------------                                                                                                  

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("CUST_LAST_NAME"='SMITH')
   5 - access("CUST_YEAR_OF_BIRTH"=1980)
   6 - access("CUST_GENDER"='M')                                                                                                                                                                        

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          2  consistent gets
          2  physical reads
          0  redo size
       1701  bytes sent via SQL*Net to client
        405  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          0  rows processed                                                                                                                                                                             

SQL&gt; select *
  2  from   customers_copy_btree
  3  where  cust_gender = 'M'
  4  and    cust_year_of_birth = 1980
  5  and    cust_last_name = 'SMITH';

no rows selected

Execution Plan

----------------------------------------------------------
Plan hash value: 2580896903                                                                                                                                                                             

---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |    21 |  3885 |    77   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID     | CUSTOMERS_COPY_BTREE |    21 |  3885 |    77   (2)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS    |                      |       |       |            |          |
|   3 |    BITMAP AND                    |                      |       |       |            |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS|                      |       |       |            |          |
|*  5 |      INDEX RANGE SCAN            | CUS_LNAME_BTREE_IDX  |  1956 |       |     7   (0)| 00:00:01 |
|   6 |     BITMAP CONVERSION FROM ROWIDS|                      |       |       |            |          |
|*  7 |      INDEX RANGE SCAN            | CUS_YOB_BTREE_IDX    |  1956 |       |    62   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------                                                                                               

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("CUST_GENDER"='M')
   5 - access("CUST_LAST_NAME"='SMITH')
   7 - access("CUST_YEAR_OF_BIRTH"=1980)                                                                                                                                                                

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          3  consistent gets
          2  physical reads
          0  redo size
       1701  bytes sent via SQL*Net to client
        405  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          0  rows processed</pre>
<p>Interestingly, Oracle has converted the b-tree indexes into bitmap ones, when querying the b-tree indexed table, so that it can do a bitmap AND on the filter results, which I presume brings the performance of data warehouse b-tree queries closer to &#8220;native&#8221; bitmap ones (this, I think, has been a feature of the database since the 10g version at least). Even so, the cost of the bitmap query appears to be a lot less than the b-tree index one, and so far, the results are as you&#8217;d expect.</p>
<p>So what about the famous issue with bitmap indexes, the one where updates to the column that&#8217;s indexed cause the index to grow in size, because the index loses compression? The fact this happens seems to be received wisdom, especially when updates are in the form of lots of little updates rather than a small amount of large updates, however people have mentioned that in recent versions, the effect is less pronounced. The other commonly held issue with updates against bitmap indexes is that they are slow, in comparison to tables with regular b-tree indexes, but this seems to be more of an issue when the updates are applied concurrently, due to portions of the indexed table (or perhaps portions of the indexed column) getting locked whilst the associated bitmap index is reconfigured. The concurrency test is tricky to perform when you&#8217;ve just got a single-user laptop, but what about the size issue, what if I applied a couple of hundred single-row updates to a bitmap indexed table and column using Oracle 11.1.0.6, would this increase the size of  the index now?</p>
<pre>SQL&gt; create index test_cust_id_1_idx on customers_copy_btree (cust_id);

Index created.

SQL&gt; create index test_cust_id_2_idx on customers_copy_bitmap (cust_id);

Index created.

SQL&gt; set timing on

SQL&gt; declare
  2  upd_cust_id number(5);
  3  cust_yob_value number(4);
  4  begin
  5    for i in 1 .. 500 loop
  6  	 upd_cust_id := dbms_random.value(1,55000);
  7  	 cust_yob_value := dbms_random.value(1900,2000);
  8  	 update customers_copy_btree
  9  	   set cust_year_of_birth = cust_yob_value
 10  	   where cust_id = upd_cust_id;
 11  	 commit;
 12    end loop;
 13  end;
 14  /

PL/SQL procedure successfully completed.

Elapsed: 00:03:15.67

SQL&gt; declare
  2  upd_cust_id number(5);
  3  cust_yob_value number(4);
  4  begin
  5    for i in 1 .. 500 loop
  6  	 upd_cust_id := dbms_random.value(1,55000);
  7  	 cust_yob_value := dbms_random.value(1900,2000);
  8  	 update customers_copy_bitmap
  9  	   set cust_year_of_birth = cust_yob_value
 10  	   where cust_id = upd_cust_id;
 11  	 commit;
 12    end loop;
 13  end;
 14  /

PL/SQL procedure successfully completed.

Elapsed: 00:05:46.21

SQL&gt; set timing off

SQL&gt; select segment_name
  2  ,	    bytes/1024/1024 "Size in MB"
  3  from   user_segments
  4  where  segment_name in ('CUS_YOB_BTREE_IDX','CUS_YOB_BIX');

SEGMENT_NAME                             Size in MB
---------------------------------------- ----------
CUS_YOB_BIX                                       5
CUS_YOB_BTREE_IDX                                30</pre>
<p>So the update to the bitmap indexed column took a fair bit longer than the b-tree indexed column, although as I said this should really be tested with lots of concurrent updates as this is where, I believe, the effect is most pronounced. When it comes to the size of the indexes though, they&#8217;re more or less the same as before the updates took place, which seems to indicate that in this release, bitmap indexes don&#8217;t grow is size so much when updates are applied. YMMV though.</p>
<p>Ok, so how about bitmap join indexes though? If we&#8217;ve got a fact table that we want to join to the dimension table, how much better performance do we get if we create bitmap join indexes on the fact table that join to the dimension key and dimension attribute values, compared to just standard bitmap indexes? To test this I took a copy of the SH.SALES and SH.CUSTOMER tables and created some example indexes.</p>
<pre>SQL&gt; select count(*) from customers_bijx_test_bitmap;

  COUNT(*)
----------
     55500                                                                                                                                                                                              

SQL&gt; select count(*) from sales_bijx_test_bitmap;

  COUNT(*)
----------
  14701488                 

SQL&gt; alter table customers_bijx_test_bitmap add constraint cust_bijx_test_bitmap_pk primary key (cust_id);

Table altered.

SQL&gt; select count(*) from customers_bijx_test_bitjoin;

  COUNT(*)
----------
     55500                                                                                                                                                                                              

SQL&gt; select count(*) from sales_bijx_test_bitjoin;

  COUNT(*)
----------
  14701488             

SQL&gt; alter table customers_bijx_test_bitjoin add constraint cust_bijx_test_bitjoin_pk primary key (cust_id);

SQL&gt; set timing on

SQL&gt; create bitmap index sales_bijx_test_bitmap_bix1 on sales_bijx_test_bitmap(cust_id);

Index created.

Elapsed: 00:00:33.43

SQL&gt; create bitmap index cust_bijx_test_bitmap_bix1 on customers_bijx_test_bitmap(cust_last_name);

Index created.

Elapsed: 00:00:00.73

SQL&gt; create bitmap index sales_bijx_test_bitjoin_bjx1 on sales_bijx_test_bitjoin(customers_bijx_test_bitjoin.cust_id)
  2  from sales_bijx_test_bitjoin, customers_bijx_test_bitjoin
  3  where sales_bijx_test_bitjoin.cust_id = customers_bijx_test_bitjoin.cust_id;

Index created.

Elapsed: 00:02:38.64

SQL&gt; create bitmap index sales_bijx_test_bitjoin_bjx2 on sales_bijx_test_bitjoin(customers_bijx_test_bitjoin.cust_last_name)
  2  from sales_bijx_test_bitjoin, customers_bijx_test_bitjoin
  3  where sales_bijx_test_bitjoin.cust_id = customers_bijx_test_bitjoin.cust_id;

Index created.

Elapsed: 00:02:46.70

SQL&gt; set timing off</pre>
<p>That&#8217;s the bitmap index and bitmap join indexes created then, with the bitmap join indexes obviously taking a lot more time to create, as the have to perform the join between the fact and dimension tables. As far as I can see I&#8217;ve now got the same sets of columns indexed, so I then gather statistics for the tables and their indexes (&#8221;analyze table &#8230; compute statistics for table for all indexes for all indexed columns&#8221;) and check the sizes of the indexes.</p>
<pre>SQL&gt; col segment_name for a30

SQL&gt; select segment_name
  2  ,	    bytes/1024/1024 "Size in MB"
  3  from   user_segments
  4  where  segment_name in ('SALES_BIJX_TEST_BITMAP_BIX1','CUST_BIJX_TEST_BITMAP_BIX1',
  5  'SALES_BIJX_TEST_BITJOIN_BJX1','SALES_BIJX_TEST_BITJOIN_BJX2');

SEGMENT_NAME                   Size in MB
------------------------------ ----------
CUST_BIJX_TEST_BITMAP_BIX1           .125
SALES_BIJX_TEST_BITJOIN_BJX1           48
SALES_BIJX_TEST_BITJOIN_BJX2           37
SALES_BIJX_TEST_BITMAP_BIX1            48 

So the two indexes (SALES_BIJX_TEST_BITJOIN_BJX1 and SALES_BIJX_TEST_BITMAP_BIX1) that index the CUST_ID column on the fact able are the same size, even though one of them includes the join through to the dimension table. The bitmap index just on the customer name column on the dimension table is very small, whereas the bitmap join index on the CUST_ID fact table column joined to the customer name dimension column is a bit smaller than the ones that just index the CUST_ID column. How do things look when we run some queries than, and generate some execution plans?

SQL&gt; set lines 200

SQL&gt; set autotrace traceonly

SQL&gt; select sum(s.amount_sold)
  2  from   sales_bijx_test_bitmap s
  3  ,	    customers_bijx_test_bitmap c
  4  where  s.cust_id = c.cust_id
  5  and    c.cust_last_name = 'SMITH';

Execution Plan
----------------------------------------------------------
Plan hash value: 3492243084                                                                                                                                                                             

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                            |     1 |    28 | 19594   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE                |                            |     1 |    28 |            |          |
|*  2 |   HASH JOIN                    |                            |   127K|  3480K| 19594   (2)| 00:03:56 |
|   3 |    TABLE ACCESS BY INDEX ROWID | CUSTOMERS_BIJX_TEST_BITMAP |    61 |   671 |    15   (0)| 00:00:01 |
|   4 |     BITMAP CONVERSION TO ROWIDS|                            |       |       |            |          |
|*  5 |      BITMAP INDEX SINGLE VALUE | CUST_BIJX_TEST_BITMAP_BIX1 |       |       |            |          |
|   6 |    TABLE ACCESS FULL           | SALES_BIJX_TEST_BITMAP     |    14M|   238M| 19512   (2)| 00:03:55 |
-------------------------------------------------------------------------------------------------------------                                                                                           

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("S"."CUST_ID"="C"."CUST_ID")
   5 - access("C"."CUST_LAST_NAME"='SMITH')                                                                                                                                                             

Statistics
----------------------------------------------------------
         24  recursive calls
          0  db block gets
          4  consistent gets
          2  physical reads
          0  redo size
        426  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed                                                                                                                                                                             

SQL&gt; select sum(s.amount_sold)
  2  from   sales_bijx_test_bitjoin s
  3  ,	    customers_bijx_test_bitjoin c
  4  where  s.cust_id = c.cust_id
  5  and    c.cust_last_name = 'SMITH';

Execution Plan
----------------------------------------------------------
Plan hash value: 4184703125                                                                                                                                                                             

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                              |     1 |    26 |  3352   (1)| 00:00:41 |
|   1 |  SORT AGGREGATE               |                              |     1 |    26 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID | SALES_BIJX_TEST_BITJOIN      | 18240 |   463K|  3352   (1)| 00:00:41 |
|   3 |    BITMAP CONVERSION TO ROWIDS|                              |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE | SALES_BIJX_TEST_BITJOIN_BJX2 |       |       |            |          |
--------------------------------------------------------------------------------------------------------------                                                                                          

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("S"."SYS_NC00009$"='SMITH')                                                                                                                                                               

Statistics
----------------------------------------------------------
         24  recursive calls
          0  db block gets
          5  consistent gets
          2  physical reads
          0  redo size
        426  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed                                                                                                                                                                             

SQL&gt; set autotrace off</pre>
<p>Looking at the execution plan for the table with just a regular bitmap index, you can see both tables being accessed, though you can also see a full table scan on the sales table, which isn&#8217;t what I&#8217;d expected as the CUST_ID column is indexed. The execution plan for the query against the table with a bitmap join index clearly shows just the one table (the sales table) being accessed, as Oracle can get the dimension information it needs from the bitmap join index. So I&#8217;m not sure about the first test, but the second shows the bitmap join index working and the dimension table not needing to be accessed on this occaision. Again, I&#8217;d like to test how much extra time updates take to a table with bitmap join indexes on it, but again I think the effect will be most noticeable with lots of concurrent updates and so a simple single-user test won&#8217;t really show this properly, suffice to say, my impression is that if bitmap indexes cause contention issues with lots of concurrent updates, bitmap join indexes will be even worse.</p>
<p>Now all this thinking about &#8220;pre-joining&#8221; tables through the use of bitmap join indexes got me thinking, &#8220;this sounds a lot like the pre-joining you can do through join-only materialized views&#8221;. Join-only materialized views differ from your common or garden materialized views in that they don&#8217;t aggregate or calculate any data, they just store the join in advance so that subsequent queries don&#8217;t need to incur the join cost on every query execution. Surely this sounds a lot like bitmap join indexes? Let&#8217;s create one and compare it to the bitmap join index we created earlier, against another copy of the SH.SALES and SH.CUSTOMERS tables.</p>
<pre>SQL&gt; create materialized view log on sales_mvjoin_test with rowid;

Materialized view log created.

SQL&gt; create materialized view log on cust_mvjoin_test with rowid;

Materialized view log created.

SQL&gt; create materialized view sales_cust_join_mv
2  build immediate
3  refresh fast
4  enable query rewrite as
5  select s.rowid sales_rid, c.rowid cust_rid,
6  	    s.amount_sold, c.cust_id c_cust_id, s.cust_id s_cust_id, c.cust_last_name
7  from   sales_mvjoin_test s, cust_mvjoin_test c
8  where  s.cust_id = c.cust_id;

Materialized view created.

SQL&gt; create bitmap index sales_cust_join_mv_bix1 on sales_cust_join_mv(cust_rid);

Index created.

SQL&gt; create bitmap index sales_cust_join_mv_bix2 on sales_cust_join_mv(sales_rid);

Index created.

SQL&gt; create bitmap index sales_cust_join_mv_bix3 on sales_cust_join_mv(s_cust_id);

Index created.

SQL&gt; create bitmap index sales_cust_join_mv_bix4 on sales_cust_join_mv(c_cust_id);

Index created.

SQL&gt; create bitmap index sales_cust_join_mv_bix5 on sales_cust_join_mv(cust_last_name);

Index created.

SQL&gt; analyze table sales_cust_join_mv
2  compute statistics for table for all indexes
3  for all indexed columns;

Table analyzed.

SQL&gt; alter session set query_rewrite_enabled=true;

Session altered.</pre>
<p>Notice that I had to create materialized view logs on the underlying tables so that my materialized view would &#8220;fast refresh&#8221; (this is the closest comparison to how indexes behave), and I had to subsequently index and then analyze the materialized view, which obviously adds to the preparation time and space taken up, making the creation of the materialized view a fair bit more involved than the creation of the bitmap join index. So how do queries against the bitmap-indexed table, bitmap join indexed-table and tables with a materialized view containing joins compare?</p>
<pre>SQL&gt; set lines 200

SQL&gt; set autotrace traceonly

SQL&gt; 

SQL&gt; select sum(s.amount_sold)
  2  from   sales_bijx_test_bitmap s
  3  ,	    customers_bijx_test_bitmap c
  4  where  s.cust_id = c.cust_id
  5  and    c.cust_last_name = 'SMITH';

Execution Plan
----------------------------------------------------------
Plan hash value: 3492243084                                                                                                                                                                             

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                            |     1 |    28 | 19594   (2)| 00:03:56 |
|   1 |  SORT AGGREGATE                |                            |     1 |    28 |            |          |
|*  2 |   HASH JOIN                    |                            |   127K|  3480K| 19594   (2)| 00:03:56 |
|   3 |    TABLE ACCESS BY INDEX ROWID | CUSTOMERS_BIJX_TEST_BITMAP |    61 |   671 |    15   (0)| 00:00:01 |
|   4 |     BITMAP CONVERSION TO ROWIDS|                            |       |       |            |          |
|*  5 |      BITMAP INDEX SINGLE VALUE | CUST_BIJX_TEST_BITMAP_BIX1 |       |       |            |          |
|   6 |    TABLE ACCESS FULL           | SALES_BIJX_TEST_BITMAP     |    14M|   238M| 19512   (2)| 00:03:55 |
-------------------------------------------------------------------------------------------------------------                                                                                           

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("S"."CUST_ID"="C"."CUST_ID")
   5 - access("C"."CUST_LAST_NAME"='SMITH')                                                                                                                                                             

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          2  consistent gets
          2  physical reads
          0  redo size
        426  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed                                                                                                                                                                             

SQL&gt; select sum(s.amount_sold)
  2  from   sales_bijx_test_bitjoin s
  3  ,	    customers_bijx_test_bitjoin c
  4  where  s.cust_id = c.cust_id
  5  and    c.cust_last_name = 'SMITH';

Execution Plan
----------------------------------------------------------
Plan hash value: 4184703125                                                                                                                                                                             

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name                         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |                              |     1 |    26 |  3352   (1)| 00:00:41 |
|   1 |  SORT AGGREGATE               |                              |     1 |    26 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID | SALES_BIJX_TEST_BITJOIN      | 18240 |   463K|  3352   (1)| 00:00:41 |
|   3 |    BITMAP CONVERSION TO ROWIDS|                              |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE | SALES_BIJX_TEST_BITJOIN_BJX2 |       |       |            |          |
--------------------------------------------------------------------------------------------------------------                                                                                          

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("S"."SYS_NC00009$"='SMITH')                                                                                                                                                               

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          3  consistent gets
          2  physical reads
          0  redo size
        426  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed                                                                                                                                                                             

SQL&gt; select sum(s.amount_sold)
  2  from   sales_mvjoin_test s
  3  ,	    cust_mvjoin_test c
  4  where  s.cust_id = c.cust_id
  5  and    c.cust_last_name = 'SMITH';

Execution Plan
----------------------------------------------------------
Plan hash value: 1046520199                                                                                                                                                                             

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                         |     1 |    20 |  3497   (1)| 00:00:42 |
|   1 |  SORT AGGREGATE                         |                         |     1 |    20 |            |          |
|   2 |   MAT_VIEW REWRITE ACCESS BY INDEX ROWID| SALES_CUST_JOIN_MV      | 18240 |   356K|  3497   (1)| 00:00:42 |
|   3 |    BITMAP CONVERSION TO ROWIDS          |                         |       |       |            |          |
|*  4 |     BITMAP INDEX SINGLE VALUE           | SALES_CUST_JOIN_MV_BIX5 |       |       |            |          |
-------------------------------------------------------------------------------------------------------------------                                                                                     

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("SALES_CUST_JOIN_MV"."CUST_LAST_NAME"='SMITH')                                                                                                                                            

Statistics
----------------------------------------------------------
         81  recursive calls
          0  db block gets
        183  consistent gets
       2865  physical reads
          0  redo size
        426  bytes sent via SQL*Net to client
        416  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
          1  rows processed                                                                                                                                                                             

SQL&gt; set autotrace off</pre>
<p>So there&#8217;s not a lot in it between the costs in the execution plans for the bitmap join indexed-table and the table with an associated materialized view containing joins only, but the consistent gets and physical reads when using the materialized view are far higher than the bitmap join index approach. Given that the materialized view took a fair bit more work to set up than the bitmap join index based on these figures I&#8217;d stick with the bitmap join index, but two things I know I&#8217;ve ignored are (a) I could just as easily make the materialized view contain the aggregate of sales rather than just detail, you can&#8217;t do this with indexes and (b) again one thing I can&#8217;t really test here is how well the both of these solutions stands up to concurrent inserts, updates and deletes. So far to me there&#8217;s not a lot in it, except for bitmap join indexes appear to be more efficient at least for querying comparing apples to apples, but I suspect there&#8217;s a bit more to it here - anyone else have any thoughts?</p>
<p>Anyway, to conclude. With the big caveats - &#8220;your mileage may vary&#8221; and &#8220;these were only simple tests, on a single user one-CPU system&#8221;, here&#8217;s my thoughts using 11.1.0.6</p>
<ul>
<li>As is generally accepted, bitmap indexes can be a good solution for DSS-style applications that filter data based on multiple columns</li>
<li>In recent versions of Oracle though (10g+ I think) Oracle will however often convert b-tree indexes to bitmap ones on the fly, bringing b-tree indexes nearer to &#8220;native&#8217; bitmap index performance</li>
<li>Updates to bitmap indexes are still slow, compared to b-tree indexes, but the size increase issue is perhaps less pronounced now with recent (10g+ versions of Oracle)</li>
<li>The main issue around bitmap index updates though is where updates are concurrent, and you can only really test this with a realistic workload. However this isn&#8217;t really the situation where you&#8217;d want to use bitmap indexes anyway (OLTP applications)</li>
<li>Bitmap join indexes can give you even more of a DSS-query performance boost, by removing the need at query time to join to the dimension table. However my suspicion is that they are even more costly to update, especially when updates are highly concurrent</li>
<li>Materialized Views using joins only are an alternative to bitmap join indexes, but probably have little additional value in their non-aggregated form, and they take a lot more work to set up. If you&#8217;re considering using materialized views, you might as well go the whole hog and include the aggregates in them as well.</li>
</ul>
<p>Any thoughts from anyone on this?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/06/bitmap-indexes-redux/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Oracle BIWA Summit, 2nd-3rd December 2008</title>
		<link>http://www.rittmanmead.com/2008/11/01/oracle-biwa-summit-2nd-3rd-december-2008/</link>
		<comments>http://www.rittmanmead.com/2008/11/01/oracle-biwa-summit-2nd-3rd-december-2008/#comments</comments>
		<pubDate>Sat, 01 Nov 2008 09:03:26 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[User Groups &amp; Conferences]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/01/oracle-biwa-summit-2nd-3rd-december-2008/</guid>
		<description><![CDATA[One conference that I won&#8217;t be going to, but I really wish I was, is the second Oracle BIWA Summit on December 2nd and 3rd 2008 at Redwood Shores, California. The Oracle BIWA SIG (Business Intelligence, Warehousing and Analytics Special Interest Group) is a SIG within the IOUG that focuses on the database-driven part of [...]]]></description>
			<content:encoded><![CDATA[<p>One conference that I won&#8217;t be going to, but I really wish I was, is the second <a href="http://ioug.itconvergence.com/pls/htmldb/f?p=219:25:1610458228956689::NO:::">Oracle BIWA Summit</a> on December 2nd and 3rd 2008 at Redwood Shores, California. The <a href="http://www.oraclebiwa.org">Oracle BIWA SIG</a> (Business Intelligence, Warehousing and Analytics Special Interest Group) is a SIG within the IOUG that focuses on the database-driven part of BI and analytics, which contrasts with ODTUG which mainly focuses on the tools. I was one of the keynote speakers at last year&#8217;s event whilst Jon did one of the presentations, this year Pete is going along to represent Rittman Mead as the UK Oracle User Group conference is on at the same time.</p>
<p>Looking at the <a href="http://ioug.itconvergence.com/pls/htmldb/f?p=219:45:1610458228956689::NO">agenda</a>, you&#8217;ve got sessions from Bob Stackowiak (who shared the stage with me at the recent PHLOUG event), Tim Gorman, Chris Claterbos, Ian Abramson (IOUG Chair),  Bud Endress and Marty Gubar (Oracle OLAP), Joe Leva, Edward Roske (Essbase Oracle ACE Director, entering the lions den), Charlie Berger, Scott Rappoport and many other US-based and international speakers. The BIWA event is about as close as you can get to a worldwide Oracle BI&#038;DW conference, and I&#8217;m pleased to see that ODTUG are taking part as one of the official sponsors. If you&#8217;re interested, I&#8217;d definately recommend the event and it&#8217;s a shame I can&#8217;t be there. Good luck to Shyam, Mark, Charlie and the other organizers and I hope it&#8217;ll be a great event.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/01/oracle-biwa-summit-2nd-3rd-december-2008/feed/</wfw:commentRss>
		</item>
		<item>
		<title>ODTUG Kaleidoscope 2009 : Abstract Deadline About To Close</title>
		<link>http://www.rittmanmead.com/2008/10/29/odtug-kaleidoscope-2009-abstract-deadline-about-to-close/</link>
		<comments>http://www.rittmanmead.com/2008/10/29/odtug-kaleidoscope-2009-abstract-deadline-about-to-close/#comments</comments>
		<pubDate>Wed, 29 Oct 2008 06:28:28 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/10/29/odtug-kaleidoscope-2009-abstract-deadline-about-to-close/</guid>
		<description><![CDATA[ODTUG Kaleidoscope 2009 is being held next June in Monterey, California, and the call for papers closes on November 3rd. If you&#8217;re thinking about putting an abstract in, there&#8217;s just a few days left now.

ODTUG Kaleidoscope is different to all the other conferences in that it&#8217;s practical, technical, focused on developers and has a very [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.odtugkaleidoscope.com/">ODTUG Kaleidoscope 2009</a> is being held next June in Monterey, California, and the <a href="http://www.odtugkaleidoscope.com/abstracts.html">call for papers</a> closes on November 3rd. If you&#8217;re thinking about putting an abstract in, there&#8217;s just a few days left now.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp2/wp-content/uploads/2008/10/image001.jpg" height="59" width="400" border="1" hspace="4" vspace="4" alt="Image001" /></p>
<p>ODTUG Kaleidoscope is different to all the other conferences in that it&#8217;s practical, technical, focused on developers and has a very strong BI and Essbase slant. Together with Kent Graziano I chair the ODTUG BI&#38;DW SIG, and we&#8217;re keen to get papers this year that go beyond the basics and really get under the hood of tools like Oracle Warehouse Builder, Oracle BI Enterprise Edition, Oracle Discoverer and Oracle Essbase. The event runs over five days and has a relatively small, select list of attendees (compared to events like Collaborate or OOW) meaning that you get to meet everyone, the catering is superb and we can focus just on the needs of developers.</p>
<p>If you&#8217;re interested in speaking, <a href="http://www.odtugkaleidoscope.com/abstracts.html">submit an abstract</a>, and if you&#8217;re thinking of attending and would like to suggest a topic or speaker, take a look at the <a href="http://www.odtugkaleidoscope.com/forum/">Kaleidoscope Forum</a> where you can make suggestions or discuss the conference. Whichever way, make a note in your diary and hopefully we&#8217;ll see you in Monterey next year!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/10/29/odtug-kaleidoscope-2009-abstract-deadline-about-to-close/feed/</wfw:commentRss>
		</item>
		<item>
		<title>End-to-end data quality</title>
		<link>http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/</link>
		<comments>http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/#comments</comments>
		<pubDate>Sat, 25 Oct 2008 10:56:14 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
		
		<category><![CDATA[BI (General)]]></category>

		<category><![CDATA[Data Quality]]></category>

		<category><![CDATA[Methodology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/</guid>
		<description><![CDATA[One of our customers is about to embark on a significant BI project; but being in the &#8220;public sector&#8221; they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to [...]]]></description>
			<content:encoded><![CDATA[<p>One of our customers is about to embark on a significant BI project; but being in the &#8220;public sector&#8221; they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to be selected, BI infrastructure can be implemented and work on building the solution can start.</p>
<p>In the interim, the customer can work on data quality; they know what they need to report on (it&#8217;s in the project mandate!) and they know the sources of information (their operational systems) so they can start to verify that all of the required facts can be found in the source systems and more importantly look at the data content and assess &#8220;fitness for purpose&#8221;. If data defects are found then it may be possible to get them fixed before the serious construction of the ETL layer starts. Besides, the knowledge of source and target gives a good head start in the specification of ETL interfaces.</p>
<p>One particular issue they might meet, and one that is sadly far too common across many business sectors, is the use of operational systems that do not enforce data integrity. For whatever reasons there is just too much freedom in data entry and although it may not affect the operational system much it really can cause problems when you try to aggregate information on the BI system.</p>
<p>But how do we deal with this? Recently I joined in on a thread on one of LinkedIn BI groups where it was proposed that a &#8220;receive garbage, store garbage strategy was adopted&#8221; - in my opinion this might be OK for a mature BI system where users can understand that the reporting accurately reflects the source, but for a new venture into BI? To me, this seems to be too much a risk; it might be that the new BI users do not have sufficient exposure to the source systems to realise that the data is at fault on the source. We could prevent data that fails a quality threshold from loading on the BI system, but then we would show <em>incomplete</em> results which although correctly aggregated are misleading because of omission; at the end of the day load policy is a business choice. If we go with the &#8220;reject poor data&#8221; route we should seriously  think about providing a data quality dashboard on the reporting system to indicate the numbers of records that failed to be loaded and drill-down to the reasons why they failed.</p>
<p>So what do we do with data that fails the quality standard? Ideally, we should get it fixed at source. Auto-fixing on load is possible, but then we need to think about data governance and the possible &#8216;trust&#8217; problems of the data being not aligned with the source. Maybe you could &#8217;standardise&#8217; country names and other columns on loading; I&#8217;ve seen systems with &#8216;USA&#8217;, &#8216;U.S.A&#8217;, &#8216;U S A&#8217;, &#8216;US of A&#8217;, &#8216;America&#8217;,  and &#8216;US&#8217; in the country data feed and that&#8217;s before we get to the mis-keying of &#8216;United&#8217; to get &#8216;Untied&#8217;!  But maybe that sort of improvement in quality should also be available to operational systems users.</p>
<p>For this customer, I have suggested that they construct a source to BI target matrix and include some basic traffic light measures on the source data:</p>
<ul>
<li>How good is it?</li>
<li>What sort of errors are present; missing items, typographical errors, missing or incorrect parents, inconsistent use of names, even data entered in the wrong fields.</li>
<li>How important is it to be correct in the BI system; for example street address can not be aggregated in reporting and we may not be going to use BI to create mailing lists, but postal code (or a sub string of it) can be used to aggregate people by location areas.</li>
<li>How important is it to be correct on the operational source - do we need to apply the corrections at source to improve the operational use of the system</li>
</ul>
<p>But this type of quality review may not tackle the data problem that is probably hardest to deal with what is a correct fact? How do I know if house value of £20,000 is reasonable (it  could be in a shared ownership scheme) or £2,0000,000 or £20,000,000? We could set a validation range, but where is there that point that one penny more is obviously wrong, but the current value OK?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
