Last time I wrote about the performance aspects of a BI system and how they could affect a user’s confidence. I concluded by mentioning that incorrect data might be generated by poorly coded ETL routines causing data loss or duplication. This time I am looking more at the quality of the data we load (or don’t load).
Back in the 1990’s I worked with a 4.5 TB DWH that had a single source for fact and reference data, that is the data loaded was self-consistent. Less and less these days we find a single source DWH to be the case; we are adding multiple data sources (both internal and external). Customers can now appear on CRM, ERP, social media, credit referencing, loyalty, and a whole host of other systems. This proliferation of data sources gives rise to a variety of issues we need to be at least aware of, and in reality, should be actively managing. Some of these issues require us to work out processing rules within our data warehouse such as what do we do with fact data that arrives before its supporting reference data; I once had a system where our customer source could only be extracted once a week but purchases made by new customers would appear in our fact feed immediately after customer registration. Obviously, it is a business call on whether we publish facts that involve yet to be loaded customers straight away or defer those loads until the customer has been processed in the DWH. In the case of my example we needed to auto-create new customers in the data warehouse with just the minimum of data, the surrogate key and the business key and then do a SCD type 1update when the full customer data profile is loaded the following week. Technical issues such as these are trivial, we formulate and agree a business rule to define our actions and we implement it in our ETL or, possibly, the reporting code. In my opinion the bigger issues to resolve are in Data Governance and Data Quality.
Some people combine Data Quality and Governance together as a single topic and believe that a single solution will put all right. However, to my mind, they are completely separate issues. Data quality is about the content of the data and governance is about ownership, providence and business management of the data. Today, Data Governance is increasingly becoming a regulatory requirement, especially in finance.
Governance is much more than the data lineage tools we might access in ETL tools such as ODI and even OWB. ETL lineage is about source to target mappings; our ability to say that 'bank branch name’ comes from this source attribute, travels through these multiple ODI mappings and finally updates that column in our BANK_BRANCH dimension table. In true Data Governance we probably do some or all of these:
- Create a dictionary of approved business terms. This will define every attribute in business terms and also provide translations between geographic and business-unit centric ways of viewing data. In finance one division may talk about "customer", another division will say “investor”, a third says “borrower”; in all three cases we are really talking about the same kind of object, a person. This dictionary should go down to the level of individual attribute and measures and include the type of data being held such as text, currency, date-time, these data types are logical types and not physical types as seen on the actual sources. It is important that this dictionary is shared throughout the organisation and is “the true definition" of what is reported.
- Define ownership (or stewardship) for the approved business data item.
- Map business data sources and targets to our approved list of terms (at attribute level). It is very possible that some attributes will have multiple potential sources, in such cases we must specify which source will be the master source.
- Define processes to keep our business data aligned.
- Define ownership for the sources for design (and for static data such as ISO country codes, content) change accountability. Possibility integrate into change notification mechanism of change process.
- Define data release processes for approved external reference data.
- Define data access and redaction rules for compliance purposes.
- Build-in audit and control.
Whereas governance is about using the right data and having processes and people to guarantee it is correctly sourced, Data Quality is much finer in grain and looks at the actual content. Here a tool such as Oracle Enterprise Data Quality is invaluable. By the way I have noticed that OEDQ version 12 has recently been released, I have a blog on this in the pipeline.
I tend to divide Data Quality into three disciplines:
- Data Profiling is always going to be our first step. Before we fix things we need to know what to fix! Generally, we try to profile a sample of the data and assess it column by column, row by row to build a picture of the actual content. Typically we look at data range, nulls, number of distinct values and in the case of text data: character types used (alpha, letter case, numeric, accents, punctuation etc), regular expressions. From this we develop a plan to tackle quality, for example on a data entry web-page we may want to tighten processing rules to prevent certain “anticipated” errors; more usually we come up with business rules to apply in our next stage.
- Data Assessment. Here we test the full dataset against the developed rules to identify data that conforms or needs remedy. This remedy could be referring the data back to the source system owner for correction, providing a set of data fixes to apply to the source which can be validated and applied as a batch, creating processes to “fix” data on the source at initial data entry, or (and I would strongly advise against this for governance reasons) dynamically fix in an ETL process. The reason I am against fixing data downstream in ETL is that the data we report on in our Data Warehouse is not going to match the source and this will be problematic when we try to validate if our data warehouse fits reality.
- Data de-duplication. This final discipline of our DQ process is the most difficult, identifying data that is potentially duplicated in our data feed. In data quality terms a duplicate is where two or more rows refer to what is probably (statistically) the same item, this is a lot more fuzzy than an exact match in database terms; people miskey data, call centre staff mis-hear names, companies merge and combine data sets, I have even seen customers registering a new email address because they can not be bothered to reset their password on a e-selling website. De-duplication is important to improve the accuracy of BI in general, it is nigh-on mandatory for organisations that need to manage risk and prevent fraud.