<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rittman Mead Consulting &#187; Peter Scott</title>
	<atom:link href="http://www.rittmanmead.com/author/peter-scott/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rittmanmead.com</link>
	<description>Delivering Oracle Business Intelligence</description>
	<lastBuildDate>Mon, 06 Feb 2012 21:18:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>More Notes on Right-Time BI</title>
		<link>http://www.rittmanmead.com/2011/11/more-notes-on-right-time-bi/</link>
		<comments>http://www.rittmanmead.com/2011/11/more-notes-on-right-time-bi/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 15:02:23 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=9066</guid>
		<description><![CDATA[Over the past couple of years Stewart Bryson and I have been looking into things &#8220;right time&#8221; (or is that realtime?). It is great to have him around to trade ideas (and graphics for presentations!). Most of what we have been discussing has been about &#8220;traditional reporting&#8221;, either with or without a data warehouse, and [...]]]></description>
			<content:encoded><![CDATA[<p>Over the past couple of years Stewart Bryson and I have been looking into things &#8220;right time&#8221; (or is that realtime?). It is great to have him around to trade ideas (and graphics for presentations!). Most of what we have been discussing has been about &#8220;traditional reporting&#8221;, either with or without a data warehouse, and definitely in the realms of &#8220;how well have we done&#8221;. However, that is not the sole use case for right time BI.</p>
<p>I have long felt that BI is only done for one of three reasons &#8211; the law says we must report things, it saves us money, or it makes us money; so if knowing something sooner gives us competitive advantage then surely that is a good thing. Knowing sooner is not enough though; it is also about being able to act on the information to facilitate a change in the organization that enhances return (or lowers costs). To my mind we are moving from the traditional &#8220;let&#8217;s look at this in aggregate&#8221; stance to a world where we ask &#8220;what is the significance of this newly observed fact&#8221;. This type of analysis requires a body of data to create a reference model and access to smart statistical tools to allow us to make judgments based on probabilities. Making such decisions based on dynamic events is not just for stock markets and bankers, the same principles apply in many sectors. I know of some restaurant chains that have investigated using centrally monitored sales across all outlets to dynamically adjust staff levels based on likely demand &#8211; staff are sent home, brought in, moved between outlets based on a predictive model that uses past trading patterns across many outlets.</p>
<p>As usual, most of the building blocks we need to do this are available to us, we just need a bit of creativity to join them together into an architecture. For this kind of use I feel that messaging should be core to the data capture &#8211; we want to look at single items of &#8220;fact&#8221; and do some statistical analysis on them before adding them to the data warehouse (or what ever form our data repository takes) so that the new fact can become part of the base data set we use to analyze the next fact to arrive. Micro-batch loading of log based change data is probably less suited here as we are:</p>
<ol>
<li>adding to the latency by using discrete loads at fixed intervals and</li>
<li>the processing of many items at a time complicates the statistical analysis and alerting phases (after all if we get 2453 credit card transactions in a batch only a few will be potentially fraudulent).</li>
</ol>
<p>After we capture a message from the source system we can pass the information through a chain of processes to analyze the information, propagate alerts based on the statistical significance of the item and add the data to the data store so that it becomes part of the knowledge. This last stage of adding the message to the data will probably need to be in a micro-batch mode rather than one-row-at-a-time-as-it-arrives &#8211; the latencies of adding fact to a conformed OLAP system (database, cubes, whatever) are such that single row additions will just take too much time, even if our target is an in-memory system. Here the art of the designer is to balance the availability of data, the time to reprocess the OLAP structures, the desire to keep the system up to date. It is always worth noting that for many data domains having to-the-second data is not that important as any new rows are unlikely to change the statistical results, however some subject domains will need access to all of the recent information including that which has not yet made it to the data warehouse, and here the creativity comes in.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/11/more-notes-on-right-time-bi/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Oracle Warehouse Builder and Data Integrator</title>
		<link>http://www.rittmanmead.com/2011/10/oracle-warehouse-builder-and-data-integrator/</link>
		<comments>http://www.rittmanmead.com/2011/10/oracle-warehouse-builder-and-data-integrator/#comments</comments>
		<pubDate>Sun, 30 Oct 2011 13:22:29 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Oracle Data Integrator]]></category>
		<category><![CDATA[Oracle Warehouse Builder]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=9043</guid>
		<description><![CDATA[Sometimes, when I am working with customers on data warehousing projects I am asked questions about Oracle Warehouse Builder and its future. I know no more on this than what I read in Oracle&#8217;s reposted a statement of direction from May 2011 and recent internet postings elsewhere which states that OWB 11gR2 will be the [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes, when I am working with customers on data warehousing projects I am asked questions about Oracle Warehouse Builder and its future. I know no more on this than what I read in Oracle&#8217;s <a title="Statement of Direction" href="http://www.oracle.com/technetwork/middleware/data-integrator/overview/sod-1-134268.pdf" target="_blank">reposted</a> a statement of direction from May 2011 and recent internet postings elsewhere which states that OWB 11gR2 will be the final release, although it will be patched to work with the Oracle 12 database when that comes along. To me this means that for existing OWB projects there is no hurry to migrate to ODI &#8211; Oracle have signaled in their statement of direction that a future ODI release will help smooth the migration path. However, I think that for new projects ODI should be considered as first choice &#8211; unless you only require the basic OWB functionality that is included with the Oracle database&#8217;s license, and even then I would be tempted to look at the advantages of using the enterprise-quality features you gain with the purchase of ODI.</p>
<p>One question that often comes up is &#8220;How is OWB different from ODI, after all they both do E-LT?&#8221; I have written a small series of blogs to be published over the next few months that look at this subject from the point of view of an OWB developer moving to ODI.</p>
<p>To start things off here is the first of the series where I am looking at OWB and ODI in high level terms and point out some of the key differences and similarities. I will be considering the two current releases (OWB 11.2 and ODI 11.1.1.5). Later blogs will look in more detail about the actual development of ETL process and how to orchestrate them.</p>
<p>Both ODI and OWB have a similar (I am being very simplistic here) three-component design of: a metadata repository, a development environment where the developer defines the processes and data flows and a runtime component that executes the code and flows. It is the &#8220;how&#8221; of these things that is different for the two tools.</p>
<p>Both are repository driven, that is the metadata that describes the ELT processes, data structures being accessed and host of other things is held in a database schema. For OWB the repository is pre-installed (the user needs to create a workspace though) in an Oracle 11gR2 database, optionally, the OWB repository can be installed into an other Oracle database if required. ODI&#8217;s repository is installed using Oracle Fusion Middleware&#8217;s Repository Creation Utility into a supported (and not necessarily Oracle) database. With ODI, the repository can be shared with other components that use the Fusion Middleware stack such as OBIEE 11g, whether this is desirable would depend on your circumstances and factors such as your organization&#8217;s software release process and network topology &#8211; just because it is possible to have all on one database does not make it desirable.</p>
<p>Cosmetically, there is a lot of similarity between the two development environments, they are both part of the same unified family of Java IDE applications as JDeveloper and SQLDeveloper; the look and feel is similar, for example double-clicking on a tab has the same effect (it toggles the tab&#8217;s panel between full-sized and windowed). What is different however is the content of the windows and navigators and that is a big topic for later postings.In practice, with OWB the key parts of the IDE are those for the development of MAPPINGS and (optionally) the design of process flows to orchestrate mappings. In the ODI world think INTERFACES for mappings and PACKAGES for process flows. This is simplistic though as ODI also has PROCEDURES (code developed in one of the ODI supported languages) and LOAD PLANS (multiple packages orchestrated to execute in serial or parallel). OWB mappings require the developer to include all of the components needed to facilitate the mapping &#8211; we connect source columns to target columns through a logic flow of joiners, filters, expressions, aggregates and a whole palette of other activities. Typically, this would generate a single, but large, SQL statement with much use of in-line views. ODI interfaces are simply about connecting source columns to target columns in a logical relationship (we also create expressions, joins and filters here) and allowing the physical implementation to be supplied by a knowledge module.</p>
<p>In its most common usage mode, OWB deploys its executable code into PL/SQL packages in the target database. Even pure SQL set-based insert code is wrapped into a package that contains the control and audit methods that allow it to execute under the control of the Control Center and the OWB runtime. The code generated by ODI depends on the knowledge modules used and might be native SQL which is executed directly against the target database by the Java agent executing the code. Again this is a big topic and more will follow in later blogs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/10/oracle-warehouse-builder-and-data-integrator/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Geography Hierarchies</title>
		<link>http://www.rittmanmead.com/2011/05/geography-hierarchies/</link>
		<comments>http://www.rittmanmead.com/2011/05/geography-hierarchies/#comments</comments>
		<pubDate>Tue, 24 May 2011 17:28:22 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Dimensional Modelling]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2011/05/geography-hierarchies/</guid>
		<description><![CDATA[I have been thinking about address a lot recently, in part it was moving house and all of the 1001 people that need to be notified. In the main, though, it was thoughts inspired by a data warehouse project I am working on. For this DWH people are geo-located by their street address, but for [...]]]></description>
			<content:encoded><![CDATA[<p>I have been thinking about address a lot recently, in part it was moving house and all of the 1001 people that need to be notified. In the main, though, it was thoughts inspired by a data warehouse project I am working on. For this DWH people are geo-located by their street address, but for most reporting we are only concerned with a grain of city. This all sounds so simple but how do we build a hierarchy from address, the line between the street where you live and the planet you live on. I remember as a child thinking it cool to address a letter to 23 Railway Cuttings, East Cheam then adding Surrey, England, Europe, The World, and however far I could get through navigating the solar system and the universe. To a child the hierarchy of address is relatively straightforward. But in data warehouse modelling things are not quite so simple.</p>
<p>Take the postal code (or zip code) where does it fit in the hierarchy? Well the answer is might not fit at all. Postal codes were developed to help post offices deliver mail &#8211; and each postal authority did their own thing. The UK and the Netherlands have postal code systems that can identify a single street or even a cluster of houses of within a street. Other countries work on a code per town or group of nearby towns &#8211; so straight away we have a difference in grain; a few houses in the UK a few towns in France. In Germany postal codes relate to geographic areas but those areas are not aligned to the Bundesländer; on the other hand, France ties postal code to Department but there are anomalies notably where a river runs through a village and opposite banks share a postal code but are different Departments (and in one case, different regions). Some national postal codes are numeric, some area alphanumeric (like Canadian and UK ones), the length of the postcode varies between countries too.</p>
<p>Perhaps the sensible thing, especially if you are dealing with addresses from multiple countries, is to not use postal code as a level in geographical hierarchies. If you use them at all just make them as an attribute of the address and remember that they don&#8217;t always have geographical parents.</p>
<p>I think the key point about modelling geography is that just because you know how addresses work in your own country you can&#8217;t assume that they work like that in the country next door. If you have a requirement to report, for example, the efficiency of the postal service in delivering your goods by postal region you need to ensure that your reporting handles the anomalies and exceptions. As always, knowing your data is key to creating a correct model.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/05/geography-hierarchies/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Not The Only 11.1.1.5.0 In Town</title>
		<link>http://www.rittmanmead.com/2011/05/not-the-only-11-1-1-5-0-in-town/</link>
		<comments>http://www.rittmanmead.com/2011/05/not-the-only-11-1-1-5-0-in-town/#comments</comments>
		<pubDate>Mon, 23 May 2011 13:39:17 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Oracle Data Integrator]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=8235</guid>
		<description><![CDATA[Just a short posting to mention that earlier this month Oracle released version 11.1.1.5.0 of Data Integrator Look out for more Rittman Mead postings on ODI over the coming months]]></description>
			<content:encoded><![CDATA[<p>Just a short posting to mention that earlier this month Oracle released version 11.1.1.5.0 of Data Integrator</p>
<p>Look out for more Rittman Mead postings on ODI over the coming months</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/05/not-the-only-11-1-1-5-0-in-town/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Moving Data</title>
		<link>http://www.rittmanmead.com/2011/02/moving-data/</link>
		<comments>http://www.rittmanmead.com/2011/02/moving-data/#comments</comments>
		<pubDate>Sat, 19 Feb 2011 16:37:17 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Oracle Data Integrator]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2011/02/moving-data/</guid>
		<description><![CDATA[I just looked at the Rittman Mead Blog archive and saw the last post that was written by me was back in August 2010! Where has the time gone? Some of the time has been spent preparing and giving conference and user group talks, (as I write this I am just finishing off a new [...]]]></description>
			<content:encoded><![CDATA[<p>I just looked at the Rittman Mead Blog archive and saw the last post that was written by me was back in August 2010! Where has the time gone? Some of the time has been spent preparing and giving conference and user group talks, (as I write this I am just finishing off a new talk on external data sources for Collaborate 11), but the majority of my time has been consumed working on two large and very different ETL projects; developing a new data warehouse over JD Edwards for an international distribution company (using Oracle Warehouse Builder) and my current project to move large amounts of data around a company using Oracle Data Integrator with the odd touch of GoldenGate replication.</p>
<p>The more that I use ODI the more I like it, especially the recent 11g version with its ability to build source data sets within the GUI using set operators such as UNION and MINUS. Developing ETL processes can be as simple as connecting columns on the mapping tab and then specifying any knowledge modules needed to load, validate and store the data. However, the real challenge for me (and I love to be challenged) comes from making the generated code do exactly what I want it to do and in this ODI gives me immense flexibility. I can specify where individual transformations occur (source, stage, target or even a mix of all three), I am able to specify which columns are updated and which are inserted in an upsert type of incremental load, and, probably most significantly, I can adapt knowledge modules to do exactly what I want them to do.</p>
<p>Sometimes a change to a knowledge module can be as trivial as altering the name of the Oracle directory created for loading files as Oracle External Tables &#8211; this would be important if we need to process multiple external table locations within a single database schema. Other times we need to make more substantial alterations to the load, or even repurpose it. For example the Oracle MERGE incremental load includes a step  using SQL MINUS to detect changes between source and target; we could change the table order in the MINUS operator so that we look for rows in the target that don&#8217;t exist in the source and thus build a knowledge module to do a &#8220;logical delete&#8221;.</p>
<p>Which for me means that ODI is not a cookie-cutter tool (though some people treat it so) but a way that allows me use my skill and judgement to do ETL in the way that I think that is most appropriate for the dataset.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/02/moving-data/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>More on Interval Partitioning</title>
		<link>http://www.rittmanmead.com/2010/08/more-on-interval-partitioning/</link>
		<comments>http://www.rittmanmead.com/2010/08/more-on-interval-partitioning/#comments</comments>
		<pubDate>Sat, 07 Aug 2010 12:34:24 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=5214</guid>
		<description><![CDATA[As I mentioned in the past, I am a great fan of Oracle bitmap indexes; they allow the database to do some really good optimizations and reduce the impact of typical data warehouse queries that need to filter and fetch large chunks of fact data. The flip side to their success in queries is that [...]]]></description>
			<content:encoded><![CDATA[<p>As I mentioned in the past, I am a great fan of Oracle bitmap indexes; they allow the database to do some really good optimizations and reduce the impact of typical data warehouse queries that need to filter and fetch large chunks of fact data. The flip side to their success in queries is that they can slow the ETL data load process as we update or create individual index entries that refer to many rows of data. The traditional way to deal with this is set the index to be unusable before loading and rebuild it after the ETL completes; for partitioned indexes (and if the indexed table is partitioned then any bitmap indexes must be locally partitioned) you can do this a partition level and just rebuild the unusable partitions.</p>
<p>Typically, I set indexes to be unusable by issuing an <em>ALTER INDEX my_bitmap_index UNUSABLE </em>and for partitioned tables <em>ALTER INDEX my_bitmap_index PARTITION aug_10_part UNUSABLE</em>, here I use a &#8220;smart partition name&#8221; that allows me to programatically determine the correct index partitions to manipulate &#8211; using the HIGH_VALUE of the partition from DBA_IND_PARTITIONS is problematic as it is a LONG data type and thus a tad convoluted to query (see <a href="http://www.oracle-developer.net/display.php?id=430" target="_blank">Adrian Billington&#8217;s note on working with LONG columns</a>).</p>
<p>However, when we use Oracle 11g interval partitioning the partition names are system generated; so how do we find the index&#8217;s partitions that need to be set unusable? There is no &#8220;<em>PARTITION FOR</em>&#8221; construct to alter an index&#8217;s partitions to be unusable, and the name of partition is not inherited from the table -it gets its own system generated name. There is still a way to do this and assuming that my_table is range partitioned by DATE interval; <em>ALTER TABLE my_table MODIFY PARTITION FOR (TO_DATE(&#8217;1-Aug-2010&#8242;,&#8217;dd-mon-yyyy&#8217;)) UNUSABLE LOCAL INDEXES</em><br />
Before interval partitions I have rebuilt unusable index partitions by writing a procedure to loop through all of the index partitions marked as unusable in the data dictionary and issue an ALTER INDEX &#8230;. REBUILD PARTITION command &#8211; here I can easily find the partition name to rebuild as the unusable marker is not held in a LONG, there is also an ORACLE package (DBMS_PCLXUTIL) that can do this; now, with Oracle 11g maybe the simplest thing though is to use <em>ALTER TABLE my_table MODIFY PARTITION FOR (TO_DATE(&#8217;1-Aug-2010&#8242;,&#8217;dd-mon-yyyy&#8217;)) REBUILD UNUSABLE LOCAL INDEXES</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2010/08/more-on-interval-partitioning/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Realtime Data Warehouse Challenges – Part 2</title>
		<link>http://www.rittmanmead.com/2010/06/realtime-data-warehouse-challenges-%e2%80%93-part-2/</link>
		<comments>http://www.rittmanmead.com/2010/06/realtime-data-warehouse-challenges-%e2%80%93-part-2/#comments</comments>
		<pubDate>Sun, 27 Jun 2010 17:04:27 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=4987</guid>
		<description><![CDATA[Last time I mentioned some of the challenges of taking realtime feeds and publishing them into a data warehouse. This time I am going propose a way to meet those challenges. But before that I will take a small detour around what Oracle refers to as their Reference Data Warehouse and Business Intelligence Architecture. Here [...]]]></description>
			<content:encoded><![CDATA[<p>Last time I <a href="http://www.rittmanmead.com/2010/05/27/realtime-data-warehouse-challenges-part-1/" target="_blank">mentioned</a> some of the challenges of taking realtime feeds and publishing them into a data warehouse. This time I am going propose a way to meet those challenges.</p>
<p>But before that I will take a small detour around what Oracle refers to as their Reference Data Warehouse and Business Intelligence Architecture. Here we are dividing the data warehouse into three &#8220;sections&#8221;: &#8220;staging&#8221; which is a local copy of data from the source systems; &#8220;foundation&#8221; which is typically a &#8220;process neutral&#8221; 3NF representation of the business data such as &#8220;customer&#8221;, &#8220;product&#8221; and &#8220;orders&#8221;. It is likely to be different in structure from the staged tables in that it could well be the merging of data from multiple sources, for example customer attributes could come from both CRM and ERP systems. This foundation layer is likely to be versioned (that is, whenever a dimensional attribute changes a new, current version is created) and non-aggregated. The data in this layer are our BI jewels; we don&#8217;t know what future analysis and data mining needs will be and by aggregating things we lose the flexibility we might need; remember there is no UNGROUP BY clause in SQL. The third tier is performance and access layer where we typically present optimized table structures to the query tools, it&#8217;s here that we have the aggregated tables, bitmap indexes and all of those other &#8216;traditional&#8217; data warehouse features. This not really a revolutionary (or even new) architecture &#8211; I have been doing similar things in my data warehouse design since the 1990s.</p>
<p>One of the key things to note is that the staging tables for the dimensional data should be complete replicas of the source tables and not just a set of extracted rows provided by the source data owner. Here is the ideal place to bring in replication technology and hence the beginnings of a real time data warehouse. Fact (or more accurately in this case &#8220;events&#8221;) in the staging area are not going to be full replicas of the fact source but rather all of the events that have occurred since the last load, agin this can be achieved by realtime replication. Remember we only need to replicate the tables of interest and not all of the structures of the source applications. At first glance it might seem extravagant to have effectively three copies of dimensional data (one in each layer) and two of facts &#8211; but these days disk is cheap and it is also (tongue firmly in cheek) a good way to use some more of the disk space you had to buy to get the required data throughput.</p>
<p>Acquiring data in realtime is not going to be our problem, and if we can use the stage tables directly for reporting then we can say &#8220;job done&#8221; and not worry more. Our problems arise if we need to do significant work on the staged data to report over it. We might see problems with data quality, surrogate key management for dimensions, particularly slowly changing dimensions, and the need to aggregate facts to improve performance of the query tool.</p>
<p>I am not going to get into the debate on what to do about data quality, I have blogged about that in the <a href="http://www.rittmanmead.com/category/data-quality/" target="_blank">past</a>. The only thing I will say though is that the resolution to data quality problems should be in the source system(s) &#8211; data warehouses should report the same data as used in the transactional systems and if that requires a master data management program then so be it.</p>
<p>I suspect that this next point might be considered heresy, but if you have immutable business keys for your dimensional data (perhaps from a master data management system) then consider using them in the data warehouse &#8211; this will reduce the complexity of the ETL processes needed to push data from stage to the data warehouse, it might also remove a time dependency of pushing dimensional data through to the foundation and access layers in realtime. The need to track slowly changing type 2 dimensions (where we keep a history of change) might force the use of surrogate keys, but other approaches are possible that might avoid the need for surrogate keys being used on the fact tables; one approach is to split the non-volatile and SCD-1 attributes from the versioned (SCD-2) attributes and store the dimension in two tables, with the first table joined to the FACT table (on the business key) and the versioned table of SCD2 attributes joined to first dimension table on business key; queries against the second table will need to also pass a date so the correct version is selected, but this is not hard to achieve with most query tools. By far the easiest thing to do, though, is avoid SCD-2 all together; many organizations think they need to implement SCD-2, but when they come to use the system they find that SCD-1 actually fits the reporting requirements of the vast majority of their users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2010/06/realtime-data-warehouse-challenges-%e2%80%93-part-2/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Realtime Data Warehouse Challenges &#8211; Part 1</title>
		<link>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-challenges-part-1/</link>
		<comments>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-challenges-part-1/#comments</comments>
		<pubDate>Thu, 27 May 2010 07:00:29 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Exadata]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=4868</guid>
		<description><![CDATA[In a previous part of my notes on Realtime Data Warehousing I mentioned some of the challenges of reducing latency. The piece picked up quite a few comments &#8211; to which I say thanks to all that posted responses. One of the comments from Matt Hosking mentioned some of the points I was to raise [...]]]></description>
			<content:encoded><![CDATA[<p>In a previous part of my notes on Realtime Data Warehousing I mentioned some of the challenges of reducing latency. The piece picked up quite a few comments &#8211; to which I say thanks to all that posted responses. One of the comments from Matt Hosking mentioned some of the points I was to raise in this posting.</p>
<p>If you ask someone on the outside of developing a (near) realtime data warehouse what the greatest challenge will be they probably would say &#8220;capturing the change&#8221; since they know we can already &#8220;do&#8221; data warehouses. I think that is wrong, capturing change is easy; the big problem is applying that change in a timely fashion to a data warehouse that also remains available for query. Adding relatively few rows of new fact to a table is trivial compared to the actions needed to validate, transform, apply keys, index, and publish the fact; and then think about the impact of merging that new fact into existing aggregate tables or materialized views. A lot of moving parts, a lot of challenge.</p>
<p>Realistically, we could populate an &#8220;atomic data store style&#8221; layer in realtime with what is in effect a versioned (timestamped, journalized or however you term it) replica of the source, a replica which is probably suited for realtime reporting but what we don&#8217;t get are the features of a data warehouse that we come to expect in a traditional star schema DW. We possibly miss out on: data validation through the ETL process, data enrichment and derived measures, conformed dimensions, slowly changing dimensions (especially type 2 SCD) through surrogate keys. It may well be that you don&#8217;t actually need a star model, after all one of the viable DW models for an Exadata warehouse is just that; a bunch of conventional tables joined on the natural business keys.</p>
<p>Another point to consider is that it is quite unlikely that all of the fact domains in a data warehouse need to be realtime ones; for example data sourced from a supplier&#8217;s EDI feed may arrive far less frequently than, say, sales transactions from the company&#8217;s web-store. Obviously, if we have realtime feed of sales, we must ensure we have all of the dimensional (reference) data loaded before a new transaction arrives, or else develop robust ways to handle this. This is a situation where we need business knowledge; if a new customer can be created at time of purchase (as often is the case for a web sale) we will need a realtime customer feed along with the realtime sales feed, but for banks with strict money laundering regulations customers are registered way before transactions occur, so a timely load of customer is likely to be sufficient.</p>
<p>Not only is it unlikely that all data feeds to a data warehouse need be realtime, it quite likely for some &#8220;facts&#8221; that only some measures are realtime measures. Consider sales: we know the quantity and the price charged to the customer at the time of the sale, but we may well not know the cost of goods until the time the order is fulfilled.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-challenges-part-1/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Realtime Data Warehouse Loading</title>
		<link>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-loading/</link>
		<comments>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-loading/#comments</comments>
		<pubDate>Thu, 06 May 2010 12:36:59 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=4789</guid>
		<description><![CDATA[Last time I wrote about the use of replication to provide a source for realtime BI. This time I am going to look at putting data in to a data warehouse in &#8220;realtime&#8221;. Note the quotes, when we speak of realtime there is always some degree of latency: the time from the source transaction to [...]]]></description>
			<content:encoded><![CDATA[<p>Last time I wrote about the use of replication to provide a source for realtime BI. This time I am going to look at putting data in to a data warehouse in &#8220;realtime&#8221;. Note the quotes, when we speak of realtime there is always some degree of latency: the time from the source transaction to its committal on source database, the time to notice this change, the time to propagate it to the target database. And then there is the time to process the newly arrived change: loading, validating, transforming, publishing, and perhaps aggregating; oh, and then the time to actually query the data and to react to the query. We can strive to reduce this time, we can never totally eradicate it as the laws of physics are not on our side; but latency reduction comes at a cost and it is going to be a business call between the value of knowing something promptly and the cost of knowing it.</p>
<p>Non-realtime data warehouses often use a periodic batch data load paradigm; once a month, a week, a day, or whenever, we execute a batch process that extracts data from the source, identifies what has changed and applies that do the data warehouse. But what if we modify this batch to run much more frequently, say half-hourly. We are moving towards realtime. Loading data more frequently should reduce the volume of data in the individual loads, and less data should equate to a reduced batch times (but this is not likely to be a linear reduction in batch duration). But we are imposing some new challenges on the source extract &#8211; we will have frequently running queries that access multiple source rows, this will have an impact on the source performance &#8211; we will probably need to modify the extract code to robustly identify the contents of the extract windows, that is we must not miss or duplicate data. In addition we are imposing an impact on the target, we need a good method to publish the received data to the data warehouse without adversely affecting the query workload on the data warehouse and we must ensure that our micro-batch load-to-publish time is substantially shorter than the interval between the micro-batches.</p>
<p>However say &#8220;realtime data warehousing&#8221; to most people and they think of continuous data capture, possibly through a extract built on streams or SOA messaging from the source, but more than likely through synchronous or asynchronous change data capture using database triggers or redo logs. Again there is going to be some degree of latency between event and the data being replicated to the target. But now we have a design choice on the target. Do we consume the captured changes as continuous trickle-feed process? Or do we run a series of micro-batches to consume the data? By necessity true trickle-feed will move us into row by row processing and possibly significant impact on processes that need to aggregate data. I feel that most continuously-fed data warehouses will use some micro-batch for the majority of the DW transform and publish process, even if trickle-feed processing is used to populate an non-aggregated ODS style layer for the special cases when people need to see &#8220;now data&#8221;</p>
<p>As I mentioned in a previous blog, captured commit-based change can generate a lot of &#8220;noise&#8221;; commits associated with no data change, multiple updates to the same row, changes to columns of no interest to the DW system. How we choose to handle this in our load procedures will depend on what the business needs to see: final status for a row within the batch, treating clusters of row changes within a short period (such as a few seconds) as single change, or applying all changes (in the order they happened)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2010/05/realtime-data-warehouse-loading/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Realtime BI Is Not Realtime DW</title>
		<link>http://www.rittmanmead.com/2010/04/realtime-bi-is-not-realtime-dw/</link>
		<comments>http://www.rittmanmead.com/2010/04/realtime-bi-is-not-realtime-dw/#comments</comments>
		<pubDate>Sat, 24 Apr 2010 11:22:55 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=4712</guid>
		<description><![CDATA[This posting is slightly edited to clarify that the majority of  the post is about using transactional or transactionally structured sources for reporting. One of the points I make early on in my talk &#8220;Getting Real Data Warehouse Loading as it Happens&#8221; is that realtime business intelligence is not the same as realtime data warehousing; [...]]]></description>
			<content:encoded><![CDATA[<p><em>This posting is slightly edited to clarify that the majority of  the post is about using transactional or transactionally structured sources for reporting.</em></p>
<p>One of the points I make early on in my talk &#8220;Getting Real Data Warehouse Loading as it Happens&#8221; is that realtime business intelligence is not the same as realtime data warehousing; having one does not imply having the other.</p>
<p>People can, and do, direct their BI reporting at the source transactional system or systems and for these people there is no data warehouse involved; conversely others populate their data warehouse continuously through a realtime trickle-feed since that is the only viable way to load their data but choose to keep reporting aligned to an event in the past (such as midnight yesterday) so that users do not see changing data throughout the day.</p>
<p>So if it is possible to do realtime BI against the source systems why have a data warehouse? In my opinion the data warehouse is doing three things: it is protecting the transactional source from excessive, potentially sub-optimal queries; it is acting as a platform to conform data from multiple source systems and it allows the use of slowly changing dimensions.</p>
<p>Assuming we need realtime BI but do not want a realtime data warehouse we will need to mix transactional and reporting functionalities on the same system. Although it could be argued that a system such as Oracle&#8217;s Exadata V2 allows mixed reporting / transactional workloads few people have implemented Exadata yet and thus, for many people it means that they are mixing two distinctly different workloads on a platform designed for one workload. DML operations that are designed to insert or update a single row as quickly as possible are mixed with queries that access a great number of rows based on a combination of predicates that may well not be indexed on the OLTP system. Furthermore, some of the database features that assist performance in a data warehouse such as bitmap indexes and parallel query can actually inhibit performance of transactional systems. The fundamental problem of mixing workload is that if resource is busy doing one thing it is then not available to do something else and performance for all users suffers.</p>
<p>There are some things that we can do to reduce the BI query impact on the source. We can use data federation in a reporting application such as OBIEE and restrict the transactional source to current data and use a separate data warehouse, perhaps with pre-aggregated data to supply historic information; that is, we remove the need to repeatedly scan the whole of the transactional system to provide historic data that is no longer changing and by reducing the volume of data we are accessing we lessen the conflict with transactional activity. Of course we will still need a process that updates the data warehouse, but that can now be deferred to a later time. However, federation has its own drawbacks; where we once had one query we now have two and the results of these two queries need to be stitched together and that process has to happen on the BI server, where it consumes memory and processing resource to join the result sets. Another thing we could do is to use a replica database to be the source of reporting, thus reducing the impact on the transactional source (but not fully removing it as any from of replication has some impact as it will involve some process executing on or against the source). We could use a standby instance of the OLTP system for reporting, but we need to consider the licensing requirements of the replica, the location of the replica (it may be at a different site) and what happens if we need to use the standby system for its real purpose, business continuity &#8211; do we lose BI when that happens? Perhaps a better approach is to build a specific reporting replica. This replica can be the right size and have the appropriate product licensing for a reporting load. It does not have to be complete replica of the source system, just the tables of interest; if we have no reporting interest in audit and journal tables then we need not replicate them.</p>
<p>In fact we can get quite sophisticated with a reporting replica in that we change the database so that it is better geared for reporting; this might include dropping unhelpful indexes and possibly partitioning the data to allow partition pruning to occur. Adding bitmap indexes may be a step too far though as we are essentially receiving row by row updates and inserts and that results in lower throughput caused by table locking (actually multi-row locking) as the bitmaps change.</p>
<p>We can even combine the the federated approach with a database replica and use the replica as source for the recent transactions and the data warehouse for the historic data; and if we chose to implement our replica on a fast database, perhaps an in-memory one, we could get excellent performance with low impact on the transactional source.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2010/04/realtime-bi-is-not-realtime-dw/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

