February 11th, 2009 by Mark Rittman
Now that Oracle Data Integrator and Oracle Warehouse Builder are starting to converge, I thought it worth taking a closer look at what’s new with the 10.1.3.4 and 10.1.3.5 releases of ODI. One of the most interesting new features introduced with ODI 10.1.3.4 was new data quality and profiling capabilities, and so given the new interest in ODI I thought it’d be worth downloading the new release and seeing how these new features work.
The new data quality features in ODI comes in the form of a separate application to ODI, called Oracle Data Profiling and Quality for Oracle Data Integrator (or ODQ for short), that runs alongside ODI and creates data cleansing jobs that can be executed as part of the ODI workflow. ODQ was actually developed by Oracle in association with Trillium, and as such has a different user interface to ODI, is a Windows native executable, and at this stage apart from the job export facility doesn’t really have many hooks into ODI. If you’re used to using the OWB Data Quality Option though, it’s certainly an interesting piece of software and I’ll try and draw some comparisons between the two tools as we go along.
To see how this new feature works, I worked my way through the tutorial that’s available on OTN and started working through the project. Assuming that you want to profile some data first and then clean it up, the first task you have to perform is to create what’s called a “metabase” that stores the detailed profile that you generate for your source data. As such this metabase is similar to the _PRF profiling schema that gets created when you start profiling data in OWB, except it seems to be bit easier to set up and doesn’t require an instance of Oracle to store the data in. Of course as the product is separate from ODI the metabase you create here is different from the repository you create for ODI.
Once you have a metabase in place, you can start to import the metadata on the source tables and files that you’d like to profile. This process works in two stages; first you define the metadata for the source, either entering it yourself manually, importing it from a DDL file or reading the table definitions from the relational data source.
For anyone familiar with the file and table import processes in OWB this is not too dissimilar, except of course it supports non-Oracle database sources such as SQL Server and IBM DB/2. Once you’ve imported your file or table metadata in, the second step you perform is to kick-off a job that reads the file or table and produces the initial set of profiling data.
As such, this is like the file and table import wizards that you get with OWB, coupled with the Profle > Create Profile option you get with the OWB Data Profile Editor (see this article on OTN for more details on how this OWB feature works).
Once your data is initially profiled, you can start to view information at the individual entity level, such as this overall profile “metadata”:
Statistics are also generated at the entity column level, so that you can see for example, how many unique values a column has, what the high and low values are, and so on.
In OWB, when you’re in the Data Profile Editor you get presented with three panes when viewing your data profile; the first pane gives you the overall aggregates, domain summaries and so on, whilst the second and third panes break the summaries down by values and example rows. ODQ takes a slightly different approach in that it lets you drill from data item to data item, exploring your data and allowing you to follow a line of investigation. To take an example, if I right-click on the City attribute, like this:
I can then display the individual metadata elements for that attribute. At this point I can drill further, requesting for example the complete set of unique values for that attribute:
with the unique values then being listed out in a grid. From this point on, I can display the rows that contain these values, and so on.
Other tasks you can do at the individual entity level include checking whether a column would qualify for a unique key (like the OWB Data Quality Option, obvious keys are detailed during the first profiling run, but you can ask for more details to be produced on columns that initially fell below the uniqueness threshold). Compared to OWB, the first thing you notice is that the whole process is a lot slicker and faster than working with the OWB graphical interface, and the profiling itself seems to take lot less time to run. I expect that both products have one or two features that the other ones don’t have; I know the OWB Data Quality feature a lot better, and I didn’t see in ODQ some of the OWB features such as Six Sigma ratings, comparisons of documented column definitions vs. discovered definitions, and I suspect OWB documents a lot more Oracle-specific column metadata items as well. But certainly the speed and inter-activeness of ODQ was certainly impressive, and I was able to discover information about my data far easier than when I’ve worked with the OWB Data Profile Editor, which suffers a bit from the clunkiness of the OWB Java GUI.
Like the OWB Data Profiler, you can generate statistics and venn diagrams about table joins, like this:
Compared to the OWB Data Profiler, setting this up took a couple more steps as you have (so I believe) to set up and profile the joins manually, whereas OWB tests for joins automatically. Interestingly though I was always loath to profile more than one table at a time in OWB as the whole process took so long, whereas in ODQ the profiling process is so fast that there’s very little overhead in gathering this sort of metadata. So +1 to OWB for doing a lot to auto-detect and document joins, and +1 to ODQ for making join detection, when you set it up, so fast to run.
The OWB Data Quality Option has the concept of Data Rules, where you can define rules and constraints which then get applied to your data in order to ensure compliance with these rules downstream in your ETL process. OWB goes one better in that it auto-detects and proposes data rules for you (where, for example, 90% of your column values are one of five values, and the rest are probably spelling mistakes), whereas in ODQ you set these up yourself manually. Again the ODQ interface is very slick although you obviously have to know what the rules are before you apply them.
Once you’ve set some rules up, you can run your data past them and report on how compliant it is.
So far, so good. If I had to sum the two products up, I’d say that the OWB Data Quality Option
- Discovers a lot of metadata for you, suggests business rules and more appropriate data type definitions, but
- Suffers from the clunky OWB Java Interface, non-standard cut-and-paste of results and so on
whilst ODQ in it’s profiling guise
- Makes you set up more profile data yourself, and involves more work in creating the profile in the first place, but
- Covers more than just Oracle databases, and
- Has a really fast, easy-to-use interface for exploring your profile information.
Now one of the coolest features in the OWB Data Quality Option was the ability to firstly, embed data rules and corrections into your OWB mappings, and secondly to auto-correct your data through the generation of OWB mappings. Again the article on OTN I linked to earlier sets out how this process works. So how does the equivalent feature in ODQ – the “Data Quality” part of “Oracle Data Profiling and Quality” – compare with this?
Well up until now, what we’ve been creating in the tool is referred to as a “Profiling” project, You can also create “Quality” projects are the equivalent the auto-generated OWB mappings that the OWB Data Profiler creates, except that you need to manually set all the rules and define the inputs/outputs yourself. Taking a look at one such Quality project, this cleanses names and addresses and moves the source data through a number of stages, indicated by the book icons, and transformations, indicated by the arrows.
Once the Quality project framework is in place, you add logic to the various stages, stripping out for example characters that would not be valid in a phone number.
Your Quality Project then builds up by adding conditions and logic to the various mapping transformations. Some of these transformations do specialist jobs, such as the various name and address transforms that do fuzzy matches on address, use postal address files to cleanse street addresses and so on. Others are more general and can implement the business rules you defined in the profiling stage.
Once you’ve created your project, you can run it from within the tool or export it, as a batch file, so that it can be executed from within an ODI package along with the rest of your ETL routine.
So, in summary, what do I think of the tool? Well the profiling part of ODQ certainly seems more responsive and easier to use than the OWB Data Profile Editor, and of course it handles non-Oracle sources as well as Oracle ones. OWB to me seems to auto-detect more object metadata, and of course it takes this information and suggests business rules around it. I’d say ODQ seems to have more features and is probably more of a mature tool, but if you’re an OWB user then the integrated nature of the OWB Data Profile Editor, together with it’s ability to auto-generate corrections (however useful they end up being, in reality they don’t lend themselves well to being altered after their initial generation) makes the OWB option seem quite attractive.
From the aspect of processing data against data rules, the “Quality” element of the ODI product, I’d say that the tool again looks very well featured, with some particularly strong support for name and address cleansing. The OWB Data Quality Option has the advantage that it generates standard OWB mappings, whereas the mapping element of ODQ is quite separate to ODI’s interfaces, and of course the metadata is quite separate. The ODI OdiDataQuality tool offers basic integration between the two toolsets, also my understanding is that at some point ODQ can work off of ODI sources and targets (known as models) though I work out how to do this during my tests.
It’s also worth considering how the two sets of tools are priced. According to the price list, the OWB Data Quality Option (now officially renamed the “Data Profiling and Quality Option”) costs $17.5k/CPU on top of the $47.5k/CPU for the Enterprise Edition of the Oracle Database. The equivalent features for ODI are broken down in the price list into Oracle Data Profiling, which costs costs $34.5k per named user, whereas Oracle Data Quality for Oracle Data Integrator (mapping and transformation part of ODQ together with the tool you plug into an ODI package) costs $70k per CPU, which probably explains why Oracle didn’t throw it into the ODI Enterprise Edition bundle. Pricing for ODI is already a little bit complex as you’ll also need to license ODI Enterprise Edition itself ($23k/CPU), you can license it as an option to OBIEE or you can license Data Integration Suite for $70k, but I’m not sure if the latter includes ODQ (I suspect not, as why would you then license ODQ on it’s own?). Whichever way it goes it’s probably fair to say that ODQ is going to be more expensive than the OWB Data Profiling and Quality Option, but then you’d expect it as it’s more of an enterprise-class solution, and these are generally priced a lot higher (think OBIEE vs Discoverer) than their historic Oracle equivalent.
Any, that’s the basics on the new data quality and profiling features in Oracle Data Integrator. In a few days, I’ll post something similar on Oracle OLAP and Essbase support in the recent versions of ODI, and try and wrap-up with something on the new Lightweight Designer.