<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rittman Mead Consulting &#187; Data Quality</title>
	<atom:link href="http://www.rittmanmead.com/category/data-quality/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rittmanmead.com</link>
	<description>Delivering Oracle Business Intelligence</description>
	<lastBuildDate>Mon, 06 Feb 2012 21:18:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Geography Hierarchies</title>
		<link>http://www.rittmanmead.com/2011/05/geography-hierarchies/</link>
		<comments>http://www.rittmanmead.com/2011/05/geography-hierarchies/#comments</comments>
		<pubDate>Tue, 24 May 2011 17:28:22 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Dimensional Modelling]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2011/05/geography-hierarchies/</guid>
		<description><![CDATA[I have been thinking about address a lot recently, in part it was moving house and all of the 1001 people that need to be notified. In the main, though, it was thoughts inspired by a data warehouse project I am working on. For this DWH people are geo-located by their street address, but for [...]]]></description>
			<content:encoded><![CDATA[<p>I have been thinking about address a lot recently, in part it was moving house and all of the 1001 people that need to be notified. In the main, though, it was thoughts inspired by a data warehouse project I am working on. For this DWH people are geo-located by their street address, but for most reporting we are only concerned with a grain of city. This all sounds so simple but how do we build a hierarchy from address, the line between the street where you live and the planet you live on. I remember as a child thinking it cool to address a letter to 23 Railway Cuttings, East Cheam then adding Surrey, England, Europe, The World, and however far I could get through navigating the solar system and the universe. To a child the hierarchy of address is relatively straightforward. But in data warehouse modelling things are not quite so simple.</p>
<p>Take the postal code (or zip code) where does it fit in the hierarchy? Well the answer is might not fit at all. Postal codes were developed to help post offices deliver mail &#8211; and each postal authority did their own thing. The UK and the Netherlands have postal code systems that can identify a single street or even a cluster of houses of within a street. Other countries work on a code per town or group of nearby towns &#8211; so straight away we have a difference in grain; a few houses in the UK a few towns in France. In Germany postal codes relate to geographic areas but those areas are not aligned to the Bundesländer; on the other hand, France ties postal code to Department but there are anomalies notably where a river runs through a village and opposite banks share a postal code but are different Departments (and in one case, different regions). Some national postal codes are numeric, some area alphanumeric (like Canadian and UK ones), the length of the postcode varies between countries too.</p>
<p>Perhaps the sensible thing, especially if you are dealing with addresses from multiple countries, is to not use postal code as a level in geographical hierarchies. If you use them at all just make them as an attribute of the address and remember that they don&#8217;t always have geographical parents.</p>
<p>I think the key point about modelling geography is that just because you know how addresses work in your own country you can&#8217;t assume that they work like that in the country next door. If you have a requirement to report, for example, the efficiency of the postal service in delivering your goods by postal region you need to ensure that your reporting handles the anomalies and exceptions. As always, knowing your data is key to creating a correct model.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2011/05/geography-hierarchies/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Regular Expressions in OBIEE</title>
		<link>http://www.rittmanmead.com/2009/12/regular-expressions-in-obiee/</link>
		<comments>http://www.rittmanmead.com/2009/12/regular-expressions-in-obiee/#comments</comments>
		<pubDate>Fri, 18 Dec 2009 16:20:00 +0000</pubDate>
		<dc:creator>Stewart Bryson</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Oracle BI Suite EE]]></category>
		<category><![CDATA[Oracle Database]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/?p=3924</guid>
		<description><![CDATA[When reporting against an OLTP system, in many ways, OBIEE works like an ETL tool, transforming the source system data and presenting it as a star schema. After years of writing ETL code, if there&#8217;s one thing I hate to be without, it&#8217;s regular expressions. So, when working on a project to map an OLTP [...]]]></description>
			<content:encoded><![CDATA[<p>When reporting against an OLTP system, in many ways, OBIEE works like an ETL tool, transforming the source system data and presenting it as a star schema. After years of writing ETL code, if there&#8217;s one thing I hate to be without, it&#8217;s regular expressions. So, when working on a project to map an OLTP source system to a logical model in OBIEE, I came across the following issue, and knew immediately that I would need regular expressions to solve it.</p>
<p>The CONTACT table in the source system had two columns storing the name: FIRST_NAME and LAST_NAME. There were two different processes that wrote entries in the CONTACT table&#8230; and one of them was faulty, writing the entire name in the FIRST_NAME column, though it was never corrected. So the following is a decent representation of what the data looked like:</p>
<pre>
SQL&gt; create table CONTACT (contact_id NUMBER, first_name varchar2(50), last_name varchar2(50));

Table created.

Elapsed: 00:00:00.02
SQL&gt; insert into CONTACT values (1, 'Bryson, Stewart W.', NULL);

1 row created.

Elapsed: 00:00:00.02
SQL&gt; insert into CONTACT values (2, 'Mead, Jon', NULL);

1 row created.

Elapsed: 00:00:00.00
SQL&gt; insert into CONTACT values (3, 'Mark', 'Rittman');

1 row created.

Elapsed: 00:00:00.00
SQL&gt; select * from CONTACT;

CONTACT_ID | FIRST_NAME           | LAST_NAME
---------- | -------------------- | --------------------
         1 | Bryson, Stewart W.   |
         2 | Mead, Jon            |
         3 | Mark                 | Rittman

3 rows selected.

Elapsed: 00:00:00.00
SQL&gt;
</pre>
<p>I needed to map the BMM such that I could return FIRST_NAME and LAST_NAME regardless of whether the entire name was concatenated into the FIRST_NAME, or whether it was correctly distributed across both columns. Additionally, the fact that the middle initial needed to be included with FIRST_NAME also proved a little troubling. At the end of the day, this is what I came up with in SQL:</p>
<pre>
SQL&gt; SELECT CASE
  2           WHEN last_name IS null THEN trim(regexp_substr(first_name,'[^,]+$'))
  3           ELSE first_name
  4         END first_name,
  5         CASE
  6           WHEN last_name IS null THEN regexp_substr(first_name,'^([^,]+)')
  7           ELSE last_name
  8         END last_name
  9    FROM contact
 10  /

FIRST_NAME           | LAST_NAME
-------------------- | --------------------
Stewart W.           | Bryson
Jon                  | Mead
Mark                 | Rittman

3 rows selected.

Elapsed: 00:00:00.01
SQL&gt;
</pre>
<p>To explain a bit, I&#8217;ll start with how I extracted the first name information from the FIRST_NAME column. I needed to start at the comma and then get the entire string until the end of the column. So I used the [^] structure in regular expressions, which basically says, return anything EXCEPT the character between the brackets and after the carrot (^). The plus (+) instructs the RegEx engine to return one or more instances of the previous structure. And at the end, the dollar sign ($) dictates that the entire string must run to the end of the column value. So taken all together, [^,]+$ instructs the RegEx engine to:</p>
<p>&#8220;Start at the first character that is not a comma, and return all non-comma characters all the way to the end of the column value.&#8221;</p>
<p>The only kludge introduced here is that the first non-comma character was actually a space, and to remove it, I simply used a TRIM. If some one has a way to do this without a TRIM, then I&#8217;d be glad to hear it.</p>
<p>To extract the last name information from the FIRST_NAME column, I used a similar mechanism, except that, instead of using the dollar sign ($) at the end, I put the carrot (^) at the beginning. It&#8217;s the same concept: it means that the expression returned has to begin at the start of the column value. So, the ^([^,]+ instructs the RegEx engine to:</p>
<p>&#8220;Start at the beginning of the column value, and return the whole string until a comma is encountered.&#8221;</p>
<p>Easy enough.</p>
<p>Now I want OBIEE to accept this SQL in the BMM. The only issue here is that OBIEE does not support regular expressions in it&#8217;s SQL language, so I have to use the EVALAUTE command to pass Oracle&#8217;s regular expression syntax back through to the database. So I&#8217;ll demonstrate how to do this, but first I&#8217;ll need to create a fact table to join to the CONTACT table in OBIEE.</p>
<pre>
SQL&gt; CREATE TABLE activity (contact_id number, activity_date date, num_calls NUMBER);

Table created.

Elapsed: 00:00:00.04
SQL&gt; INSERT INTO activity VALUES (1, SYSDATE-2, 10);

1 row created.

Elapsed: 00:00:00.00
SQL&gt; INSERT INTO activity VALUES (2, SYSDATE-1, 20);

1 row created.

Elapsed: 00:00:00.00
SQL&gt; INSERT INTO activity VALUES (3, sysdate, 30);

1 row created.

Elapsed: 00:00:00.00
SQL&gt;
SQL&gt; SELECT * FROM activity;

CONTACT_ID | ACTIVITY_DATE          |  NUM_CALLS
---------- | ---------------------- | ----------
         1 | 12/16/2009 10:21:34 AM |         10
         2 | 12/17/2009 10:21:34 AM |         20
         3 | 12/18/2009 10:21:34 AM |         30

3 rows selected.

Elapsed: 00:00:00.01
SQL&gt;
</pre>
<p>To demonstrate the whole process in OBIEE, I first built the BMM to bring the data in how it is from the database:</p>
<div style="text-align:center"><img src="http://www.rittmanmead.com/wp-content/uploads/2009/12/non-regexp-data.png" alt="non-regexp data.png" border="0" width="500" height="179" /></div>
<p></p>
<p>To generate the correct data, using the regular expressions developed above, here is how I mapped the First Name attribute:</p>
<pre>CASE
WHEN "bidw".""."STEWART"."CONTACT"."LAST_NAME" IS NULL
  THEN  Trim(BOTH ' ' FROM Evaluate('regexp_substr(%1,''[^,]+$'')', "bidw".""."STEWART"."CONTACT"."FIRST_NAME" ))
ELSE
  "bidw".""."STEWART"."CONTACT"."FIRST_NAME"
END
</pre>
<div style="text-align:center"><img src="http://www.rittmanmead.com/wp-content/uploads/2009/12/first_name-column-mapping.png" alt="first_name column mapping.png" border="0" width="500" height="280" /></div>
<p></p>
<p>And here is how I mapped the Last Name attribute:</p>
<pre>CASE
WHEN "bidw".""."STEWART"."CONTACT"."LAST_NAME" IS NULL
  THEN Evaluate('regexp_substr(%1,''^[^,]+'')', "bidw".""."STEWART"."CONTACT"."FIRST_NAME" )
ELSE
  "bidw".""."STEWART"."CONTACT"."LAST_NAME"
END
</pre>
<div style="text-align:center"><img src="http://www.rittmanmead.com/wp-content/uploads/2009/12/last_name-column-mapping.png" alt="last_name column mapping.png" border="0" width="500" height="312" /></div>
<p></p>
<p>And finally&#8230; the results:</p>
<div style="text-align:center"><img src="http://www.rittmanmead.com/wp-content/uploads/2009/12/regexp-data.png" alt="regexp data.png" border="0" width="500" height="197" /></div>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2009/12/regular-expressions-in-obiee/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data quality is not a one-off</title>
		<link>http://www.rittmanmead.com/2008/11/data-quality-is-not-a-one-off/</link>
		<comments>http://www.rittmanmead.com/2008/11/data-quality-is-not-a-one-off/#comments</comments>
		<pubDate>Wed, 12 Nov 2008 16:59:59 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/11/12/data-quality-is-not-a-one-off/</guid>
		<description><![CDATA[In my Blog post on End-to-end data quality I mentioned the desirability of fixing bad data at source. This certainly attracted comment both here and on other blogs for example One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications [...]]]></description>
			<content:encoded><![CDATA[<p>In my Blog post on <a href="http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/" target="_blank">End-to-end data quality</a> I mentioned the desirability of fixing bad data at source. This certainly attracted comment both here and on other blogs for <a href="http://blogs.oracle.com/robreynolds/2008/11/continuing_the_data_quality_co_1.html" target="_blank">example </a></p>
<p>One point to keep in mind about fixing bad data on the source; it is just that, fixing DATA. We are not fixing bad applications or bad processes. Often fixing source applications is just not going to happen, especially with the case of legacy and third-party packaged applications. Likewise process can be hard to fix; if we rely on humans to key information in and the application is incapable of enforcing data rules then data entry problems will occur. For example take a university admissions system; here data entry is highly seasonal, first off a lot of prospective candidates (say about 10x the number who will start as students in the new year) then a reduced number of place offers which get further reduced by the candidates choosing another university or not reaching the entrance grades. This work flow often requires extra temporary staff just to key in details, they are not full-time employees, they do not know the significance of the data, they do not know about the reporting systems that give insight to the data, they are just there to get the data in as quickly as possible and with enough accuracy to let the process as whole work. It really does not matter to the process if the country name is not standardised providing the funding code is correct; even the course code may not matter that much providing the faculty is correct, especially if all faculty members follow a common scheme of study in the first year.<br />
So, fixing data at source is most unlikely to be a one off exercise. True, a one-off data fix will improve the quality of our historic reporting, but what we are not doing is preventing data errors from occurring in the future. We must build into our ETL processes methods to continually monitor source data quality and build processes so that any errors detected can be corrected at source.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/11/data-quality-is-not-a-one-off/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>End-to-end data quality</title>
		<link>http://www.rittmanmead.com/2008/10/end-to-end-data-quality/</link>
		<comments>http://www.rittmanmead.com/2008/10/end-to-end-data-quality/#comments</comments>
		<pubDate>Sat, 25 Oct 2008 10:56:14 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Methodology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/10/25/end-to-end-data-quality/</guid>
		<description><![CDATA[One of our customers is about to embark on a significant BI project; but being in the &#8220;public sector&#8221; they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to [...]]]></description>
			<content:encoded><![CDATA[<p>One of our customers is about to embark on a significant BI project; but being in the &#8220;public sector&#8221; they have to (by EU law) publish tender documents so that qualified suppliers throughout the EU can bid to do the work. This means they have a gap of almost a year before the, yet to be selected, BI infrastructure can be implemented and work on building the solution can start.</p>
<p>In the interim, the customer can work on data quality; they know what they need to report on (it&#8217;s in the project mandate!) and they know the sources of information (their operational systems) so they can start to verify that all of the required facts can be found in the source systems and more importantly look at the data content and assess &#8220;fitness for purpose&#8221;. If data defects are found then it may be possible to get them fixed before the serious construction of the ETL layer starts. Besides, the knowledge of source and target gives a good head start in the specification of ETL interfaces.</p>
<p>One particular issue they might meet, and one that is sadly far too common across many business sectors, is the use of operational systems that do not enforce data integrity. For whatever reasons there is just too much freedom in data entry and although it may not affect the operational system much it really can cause problems when you try to aggregate information on the BI system.</p>
<p>But how do we deal with this? Recently I joined in on a thread on one of LinkedIn BI groups where it was proposed that a &#8220;receive garbage, store garbage strategy was adopted&#8221; &#8211; in my opinion this might be OK for a mature BI system where users can understand that the reporting accurately reflects the source, but for a new venture into BI? To me, this seems to be too much a risk; it might be that the new BI users do not have sufficient exposure to the source systems to realise that the data is at fault on the source. We could prevent data that fails a quality threshold from loading on the BI system, but then we would show <em>incomplete</em> results which although correctly aggregated are misleading because of omission; at the end of the day load policy is a business choice. If we go with the &#8220;reject poor data&#8221; route we should seriously  think about providing a data quality dashboard on the reporting system to indicate the numbers of records that failed to be loaded and drill-down to the reasons why they failed.</p>
<p>So what do we do with data that fails the quality standard? Ideally, we should get it fixed at source. Auto-fixing on load is possible, but then we need to think about data governance and the possible &#8216;trust&#8217; problems of the data being not aligned with the source. Maybe you could &#8216;standardise&#8217; country names and other columns on loading; I&#8217;ve seen systems with &#8216;USA&#8217;, &#8216;U.S.A&#8217;, &#8216;U S A&#8217;, &#8216;US of A&#8217;, &#8216;America&#8217;,  and &#8216;US&#8217; in the country data feed and that&#8217;s before we get to the mis-keying of &#8216;United&#8217; to get &#8216;Untied&#8217;!  But maybe that sort of improvement in quality should also be available to operational systems users.</p>
<p>For this customer, I have suggested that they construct a source to BI target matrix and include some basic traffic light measures on the source data:</p>
<ul>
<li>How good is it?</li>
<li>What sort of errors are present; missing items, typographical errors, missing or incorrect parents, inconsistent use of names, even data entered in the wrong fields.</li>
<li>How important is it to be correct in the BI system; for example street address can not be aggregated in reporting and we may not be going to use BI to create mailing lists, but postal code (or a sub string of it) can be used to aggregate people by location areas.</li>
<li>How important is it to be correct on the operational source &#8211; do we need to apply the corrections at source to improve the operational use of the system</li>
</ul>
<p>But this type of quality review may not tackle the data problem that is probably hardest to deal with what is a correct fact? How do I know if house value of £20,000 is reasonable (it  could be in a shared ownership scheme) or £2,0000,000 or £20,000,000? We could set a validation range, but where is there that point that one penny more is obviously wrong, but the current value OK?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/10/end-to-end-data-quality/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Being Flexible</title>
		<link>http://www.rittmanmead.com/2008/10/being-flexible/</link>
		<comments>http://www.rittmanmead.com/2008/10/being-flexible/#comments</comments>
		<pubDate>Sat, 18 Oct 2008 13:21:57 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Courses]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[General]]></category>
		<category><![CDATA[Rittman Mead]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/10/18/being-flexible/</guid>
		<description><![CDATA[One of the things I really love about working with the team here at Rittman Mead is how we can offer our clients a flexible service, I guess this flexibility must be appreciated by our customers as we have just won the UKOUG Business Intelligence Partner of the Year award. The past week I have [...]]]></description>
			<content:encoded><![CDATA[<p>One of the things I really love about working with the team here at Rittman Mead is how we can offer our clients a flexible service, I guess this flexibility must be appreciated by our customers as we have just won the <a href="http://www.ukoug.org/communities/show_community.jsp?id=1401&amp;parent=771">UKOUG Business Intelligence Partner of the Year</a> award.</p>
<p>The past week I have been with a customer delivering a mixture of training and best practice advice. The training was based on my data warehouse design course, but heavily customised to fit their specific needs, and with complete new sections to look at data quality strategies and some specific approaches to managing a major BI development project. In reality, I guess that 40% of the material I presented came from existing Rittman Mead courses and the rest I created to fit the customer&#8217;s needs. I think this is a good way to work &#8211; I had always intended to write more extensively on data quality and the new slide set will be a useful add-in to our other training materials &#8211; and, of course, the customer gets the course they want.  I will write about some of the data quality issues we discussed in a future posting, a lot of this will be highly relevant to most BI / DW owners.</p>
<p>Next week I am working on some course customisations for another customer and then joining the others at our first ever <a href="http://www.rittmanmead.com/oracle-bi-training-days-october-22nd-24th-london-uk/">Oracle BI Training Days</a> &#8211; I have been working on this project for quite a while now and am really looking forward to seeing the fruits of our work. The whole team have put in a lot of effort; it should be a good event!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/10/being-flexible/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Quality thoughts as I continue to chill out</title>
		<link>http://www.rittmanmead.com/2008/01/quality-thoughts-as-i-continue-to-chill-out/</link>
		<comments>http://www.rittmanmead.com/2008/01/quality-thoughts-as-i-continue-to-chill-out/#comments</comments>
		<pubDate>Wed, 23 Jan 2008 19:11:30 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/01/23/quality-thoughts-as-i-continue-to-chill-out/</guid>
		<description><![CDATA[Over at another Blog, Beth notes that she is editor of the month for the Carnival of Data Quality. When It comes out I urge you to take a look. With those strange quirks of global IT companies Beth and I once worked for the same employer, but we have never met &#8211; I did [...]]]></description>
			<content:encoded><![CDATA[<p>Over at another <a href="http://datageekgal.blogspot.com/2008/01/data-quality-changes-large-and-small.html" target="_blank">Blog, Beth</a> notes that she is editor of the month for the Carnival of Data Quality. When It comes out I urge you to take a look. With those strange quirks of global IT companies Beth and I once worked for the same employer, but we have never met &#8211; I did see her photo once on the internet and if she donates to charity I&#8217;ll tell her where!</p>
<p>So what has data quality got to do with BI? Everything, For a BI system to be successful you need to fulfil three objectives:<br />
•    It must tell people what they need to know &#8211; by that I mean it must encompass enough detail to have real use<br />
•    it must tell people what they need know rapidly enough<br />
•    and it must tell people the truth<br />
The last item is something that has to be designed in from the beginning of a BI project; performance and scope can be enhanced later but quality can&#8217;t, or not without having to replace already loaded data.<br />
When I get involved on an ETL project I probably spend less than a quarter of my time on building ETL code and getting to it run, the majority of my time goes on data quality and finding and explaining anomalies. Where we can, we get data fixed at source, but sometimes that is not going to be possible; but data profiling allows us to formulate rules to handle the expected exceptions (whether these are auto-fix procedures or park it one side and let a human make the call rule) And then there is the unexpected exception, which of course must be handled.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/01/quality-thoughts-as-i-continue-to-chill-out/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A flat world</title>
		<link>http://www.rittmanmead.com/2007/12/a-flat-world/</link>
		<comments>http://www.rittmanmead.com/2007/12/a-flat-world/#comments</comments>
		<pubDate>Thu, 27 Dec 2007 22:27:32 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/12/27/a-flat-world/</guid>
		<description><![CDATA[A lot of data warehouses get some (if not all) of their data feeds through data files; in fact I can&#8217;t recall a recent project where at least one feed from an external system did not arrive as some form of flat file. I have had customers whose corporate data movement strategy is to go [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of data warehouses get some (if not all) of their data feeds through data files; in fact I can&#8217;t recall a recent project where at least one feed from an external system did not arrive as some form of flat file. I have had customers whose corporate data movement strategy is to go moving everything through a piece of heavy-weight middleware and use a sophisticated son-of-FTP technology to ensure data delivery, and even if the data existed in a database on the same network not use inter-database links; I have also had customers that try do everything as database to database movements using change data capture, remote tables, materialized views, transportable tablespaces or whatever the latest method might be; but somewhere there is always that little nugget to be loaded that exists only in an accountant&#8217;s spreadsheet or as a piece of XML from a web server.</p>
<p>Ignoring mock-database techniques such as reading spreadsheets using an ODBC driver and querying directly against the data the most common way of loading pure text files into Oracle is to describe the data in some form of control structure and then use SQL Loader or external tables to bring the data into a structure that can be queried. We may well need to do some work making sure that the data types are recognised correctly; this is often a case of making sure the that the file is appropriate for the NLS settings involved (or is that the NLS setting appropriate for the file) how many times do you see the 6-<strong>MAY</strong>-1955 failing to load because the database wants <strong>mai</strong> or even <strong>05</strong>, or the wrong symbol is used for the decimal point. Some of these NLS induced difficulties may only affect a subset of the data for example <em>some</em> month abbreviations might be common across languages or perhaps a thousand separator is causing problems with numbers bigger than 999</p>
<p>But loading the data in only part of the battle, we need to check it for sense, to profile it for anomalies, and to try to impose some referential (or dimensional) integrity rules on it; if we load data relating to products and product type we need to verify that we already know about the product types (the parent keys) or come up with a mechanism that allows to load both parents and children; and what constitutes a duplicate record? Where we find problems we need to feedback to the data provider and workout how we can fix the source data if it is lacking or work with them on rules to fix-up the data once it is loaded. The key thing is to get this right before we start to propagate <em>suspect</em> data into a the data warehouse.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/12/a-flat-world/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Data quality thoughts</title>
		<link>http://www.rittmanmead.com/2007/10/data-quality-thoughts/</link>
		<comments>http://www.rittmanmead.com/2007/10/data-quality-thoughts/#comments</comments>
		<pubDate>Thu, 18 Oct 2007 20:14:15 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/10/18/data-quality-thoughts/</guid>
		<description><![CDATA[My e-friend, Beth, has been busy posting from the Information on Demand conference in Las Vegas. She holds dear the belief that data is key, and quality data is the master key. I think saying it like that is putting words into her mouth, but hopefully I am not distorting her view point too much. [...]]]></description>
			<content:encoded><![CDATA[<p>My e-friend, Beth, has been busy posting from the Information on Demand conference in Las Vegas. She holds dear the belief that data is key, and quality data is the master key. I think saying it like that is putting words into her mouth, but hopefully I am not distorting her view point too much. In fact we are probably kindred spirits on this, everything I do professionally is underpinned by data quality. Anyway, <a href="http://datageekgal.blogspot.com/2007/10/iod-conference-day-2-quick-comments.html" target="_blank">Beth pondered</a> a question about data quality and unstructured data, an intriguing concept that I thought I might think about too.</p>
<p>Quality in structured data is simple (from a rules viewpoint) for me: if it is not clean I won&#8217;t store it. That is, upstream of my data warehouse I have processes to clean data; it is de-duplicated, referentially checked, sense checked or <em>whatever-else-is-appropriate</em> checked before it is loaded and the nonconforming data is parked for investigation or rule-based fix-up. Dirty data is bad for reporting accuracy and user confidence, it is also problematic  to remove once it has been propagated though a BI system.  Tools exist to profile data, and database features help to clean up some of the problems with data hierarchy mismatches, for example Oracle dimension objects can be examined by a call to DBMS_DIMENSION.VALIDATE_DIMENSION (or before 10g, the similar DBMS_OLAP.VALIDATE_DIMENSION) and inspecting the exceptions table for &#8216;bad&#8217; rows; this may not be the technique of choice for a data load process, but is a good method to validate that already loaded data complies to your sense of dimensionality.</p>
<p>But when we move to unstructured data we come across a fundamental problem&#8230; what is quality? By definition there are no clearly defined dimension hierarchies, there probably aren&#8217;t even tags to identify whether data is <em>&#8216;reference&#8217; (</em>unstructured reference data sounds a bizarre concept!)  or <em>&#8216;fact&#8217;</em> let alone <em>customer</em> or <em>product </em>related. We are probably not concerned with duplicates as each unstructured item should be considered as unique, but do we need consider near duplicates to be versions (revisions) of the same item or new items? In practice the only sensible thing to <em>today</em> is store the data &#8216;as is&#8217; maybe correcting spelling to some standard form (but that could well contravene governance rules) and rely on smart contextual indexing techniques to identify data items that possesses attribute.</p>
<h4></h4>
<h4>Establishing context is hard</h4>
<p>A long while back I worked on a project for a publisher of online classified ads that takes copy submitted electronically and assess it for compliance with standards of legality and decency. Apart from a complex rule base that required certain professions to supply validation of right to practice, the brunt of the system was about scanning for rude words, their phonetic equivalent and IM and web variants where numb3rs might appear in place of letters and vowels dropped. But this approach really worked with words in isolation, moving to spans of text and scanning for expressions becomes harder &#8211; just think about electronically trying to establish meaning from poetry where metaphors and non-standard word meanings abound. And then when the feed is not keyed text but machine recognised speech or scanned handwriting.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/10/data-quality-thoughts/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Yet more data quality posts</title>
		<link>http://www.rittmanmead.com/2007/06/yet-more-data-quality-posts/</link>
		<comments>http://www.rittmanmead.com/2007/06/yet-more-data-quality-posts/#comments</comments>
		<pubDate>Wed, 13 Jun 2007 20:00:28 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Quality]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/06/13/yet-more-data-quality-posts/</guid>
		<description><![CDATA[You&#8217;d almost think that Beth and I were the same person &#8211; we both seem to post about the same topics at around the same time. Today she posts on the joys of establishing standards in product descriptions, I had intended to write something similar, but instead I write this Often the biggest problems in [...]]]></description>
			<content:encoded><![CDATA[<p><em>You&#8217;d almost think that Beth and I were the same person &#8211; we both seem to post about the same topics at around the same time. Today she posts on the </em><a href="http://datageekgal.blogspot.com/2007/05/data-quality-standardizing-product.html" target="_blank"><em>joys</em></a><em> of establishing standards in product descriptions, I had intended to write something similar, but instead I write this</em></p>
<p>Often the biggest problems in data quality projects are nothing at all to do with the technology needed but are to do with people. And those people are not always the technologists involved directly with the work; often it is the information workers spread throughout the organisation. Fixing up these people issues needs a degree of charm and tact from the data architect running the data quality project, oh, and it needs a corporate big-hitter to sponsor the project and to enforce change.</p>
<ul>
<li>Common taxonomies need to be established</li>
<li>Data stewards, or owners, (and these are business people, not IT staff) need to be identified and empowered to maintain specific types of data.</li>
<li>Where regularity compliance is important (and isn&#8217;t that everywhere?) data quality processes need to be engineered so that only the right people can make changes and that might mean that the ownership of a item is split on attribute domains &#8211; say the financial attributes are owned by an accountant and the physical attributes by someone in supply chain.</li>
</ul>
<p>As I said, sorting out the people is often the hard part, moving the clean data around is remarkably easy in comparison</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/06/yet-more-data-quality-posts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Product dimensions</title>
		<link>http://www.rittmanmead.com/2007/06/product-dimensions/</link>
		<comments>http://www.rittmanmead.com/2007/06/product-dimensions/#comments</comments>
		<pubDate>Fri, 08 Jun 2007 19:56:19 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Quality]]></category>
		<category><![CDATA[Dimensional Modelling]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/06/08/product-dimensions/</guid>
		<description><![CDATA[The other day a colleague invited me to sit in on workshop session with a customer to discuss changes to an attribute of product. On the face of it it sounds a bit of overkill to spend two hours discussing the meaning of just one attribute, but when that attribute is cost and the impact [...]]]></description>
			<content:encoded><![CDATA[<p>The other day a colleague invited me to sit in on workshop session with a customer to discuss changes to an attribute of product. On the face of it it sounds a bit of overkill to spend two hours discussing the meaning of just one attribute, but when that attribute is cost and the impact of the change is reflected in the cost of sales and ultimately reported profit it all seems worthwhile.</p>
<p>However, for this customer, <strong><em>product</em></strong> is not simple, they have <em>three</em> ways to view a product depending on whether they are buying, selling or just counting it. Let me explain.</p>
<p>Assume my customer runs a burger bar (which of course they don&#8217;t) then they sell things such as <em>double cheese burger </em>which is two burger patties, lettuce, tomato, relish, mayo, cheese, dill pickle in a bun. They also buy from their supplier disks of ground meat, the burger patty; these are delivered in boxes of 50, 100, or 250. And finally the bar manager counts his stock in terms of individual burgers. In BI terms this sort of thing does not easily map into conventional hierarchies; for one thing there is a many to many relationship between counted stock items and items in a burger recipe &#8211; do we have one burger patty or two? and what about all of the other ingredients: this is not exactly the classic parent child rollup that is so easy to model in a data warehouse.</p>
<p>The relationship between goods supplied and counted stock is much more simple, each case size could be considered the child of the counted item. That is, 50 patty boxes and 250 patty boxes all rollup to &#8220;burger patty&#8221;. But this rollup also has an impact on the attribute stock cost &#8211; the cost of a single patty; obviously it is the cost of 1/50th of a carton of 50 patties, but what if the supplier discounts the larger boxes? what value is assigned to the item cost. And then, by extension, we try to calculate the cost of a cheese burger as the sum of the individual recipe items (plus the prep costs, fuel, staff etc) but what if we used burger patties from the big carton? Do our costs go down and profits increase?</p>
<p>This type of discussion takes a long time to work though, loads of  &#8220;what ifs&#8221; and usually ends up with the accountants having to think things through. But the end result is that the reporting is more accurate than before, or at least I hope so</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/06/product-dimensions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

