Regular Expressions in OBIEE

December 18th, 2009 by Stewart Bryson

When reporting against an OLTP system, in many ways, OBIEE works like an ETL tool, transforming the source system data and presenting it as a star schema. After years of writing ETL code, if there’s one thing I hate to be without, it’s regular expressions. So, when working on a project to map an OLTP source system to a logical model in OBIEE, I came across the following issue, and knew immediately that I would need regular expressions to solve it.

The CONTACT table in the source system had two columns storing the name: FIRST_NAME and LAST_NAME. There were two different processes that wrote entries in the CONTACT table… and one of them was faulty, writing the entire name in the FIRST_NAME column, though it was never corrected. So the following is a decent representation of what the data looked like:

SQL> create table CONTACT (contact_id NUMBER, first_name varchar2(50), last_name varchar2(50));

Table created.

Elapsed: 00:00:00.02
SQL> insert into CONTACT values (1, 'Bryson, Stewart W.', NULL);

1 row created.

Elapsed: 00:00:00.02
SQL> insert into CONTACT values (2, 'Mead, Jon', NULL);

1 row created.

Elapsed: 00:00:00.00
SQL> insert into CONTACT values (3, 'Mark', 'Rittman');

1 row created.

Elapsed: 00:00:00.00
SQL> select * from CONTACT;

CONTACT_ID | FIRST_NAME           | LAST_NAME
---------- | -------------------- | --------------------
         1 | Bryson, Stewart W.   |
         2 | Mead, Jon            |
         3 | Mark                 | Rittman

3 rows selected.

Elapsed: 00:00:00.00
SQL>

I needed to map the BMM such that I could return FIRST_NAME and LAST_NAME regardless of whether the entire name was concatenated into the FIRST_NAME, or whether it was correctly distributed across both columns. Additionally, the fact that the middle initial needed to be included with FIRST_NAME also proved a little troubling. At the end of the day, this is what I came up with in SQL:

SQL> SELECT CASE
  2           WHEN last_name IS null THEN trim(regexp_substr(first_name,'[^,]+$'))
  3           ELSE first_name
  4         END first_name,
  5         CASE
  6           WHEN last_name IS null THEN regexp_substr(first_name,'^([^,]+)')
  7           ELSE last_name
  8         END last_name
  9    FROM contact
 10  /

FIRST_NAME           | LAST_NAME
-------------------- | --------------------
Stewart W.           | Bryson
Jon                  | Mead
Mark                 | Rittman

3 rows selected.

Elapsed: 00:00:00.01
SQL>

To explain a bit, I’ll start with how I extracted the first name information from the FIRST_NAME column. I needed to start at the comma and then get the entire string until the end of the column. So I used the [^] structure in regular expressions, which basically says, return anything EXCEPT the character between the brackets and after the carrot (^). The plus (+) instructs the RegEx engine to return one or more instances of the previous structure. And at the end, the dollar sign ($) dictates that the entire string must run to the end of the column value. So taken all together, [^,]+$ instructs the RegEx engine to:

“Start at the first character that is not a comma, and return all non-comma characters all the way to the end of the column value.”

The only kludge introduced here is that the first non-comma character was actually a space, and to remove it, I simply used a TRIM. If some one has a way to do this without a TRIM, then I’d be glad to hear it.

To extract the last name information from the FIRST_NAME column, I used a similar mechanism, except that, instead of using the dollar sign ($) at the end, I put the carrot (^) at the beginning. It’s the same concept: it means that the expression returned has to begin at the start of the column value. So, the ^([^,]+ instructs the RegEx engine to:

“Start at the beginning of the column value, and return the whole string until a comma is encountered.”

Easy enough.

Now I want OBIEE to accept this SQL in the BMM. The only issue here is that OBIEE does not support regular expressions in it’s SQL language, so I have to use the EVALAUTE command to pass Oracle’s regular expression syntax back through to the database. So I’ll demonstrate how to do this, but first I’ll need to create a fact table to join to the CONTACT table in OBIEE.

SQL> CREATE TABLE activity (contact_id number, activity_date date, num_calls NUMBER);

Table created.

Elapsed: 00:00:00.04
SQL> INSERT INTO activity VALUES (1, SYSDATE-2, 10);

1 row created.

Elapsed: 00:00:00.00
SQL> INSERT INTO activity VALUES (2, SYSDATE-1, 20);

1 row created.

Elapsed: 00:00:00.00
SQL> INSERT INTO activity VALUES (3, sysdate, 30);

1 row created.

Elapsed: 00:00:00.00
SQL>
SQL> SELECT * FROM activity;

CONTACT_ID | ACTIVITY_DATE          |  NUM_CALLS
---------- | ---------------------- | ----------
         1 | 12/16/2009 10:21:34 AM |         10
         2 | 12/17/2009 10:21:34 AM |         20
         3 | 12/18/2009 10:21:34 AM |         30

3 rows selected.

Elapsed: 00:00:00.01
SQL>

To demonstrate the whole process in OBIEE, I first built the BMM to bring the data in how it is from the database:

non-regexp data.png

To generate the correct data, using the regular expressions developed above, here is how I mapped the First Name attribute:

CASE
WHEN "bidw".""."STEWART"."CONTACT"."LAST_NAME" IS NULL
  THEN  Trim(BOTH ' ' FROM Evaluate('regexp_substr(%1,''[^,]+$'')', "bidw".""."STEWART"."CONTACT"."FIRST_NAME" ))
ELSE
  "bidw".""."STEWART"."CONTACT"."FIRST_NAME"
END
first_name column mapping.png

And here is how I mapped the Last Name attribute:

CASE
WHEN "bidw".""."STEWART"."CONTACT"."LAST_NAME" IS NULL
  THEN Evaluate('regexp_substr(%1,''^[^,]+'')', "bidw".""."STEWART"."CONTACT"."FIRST_NAME" )
ELSE
  "bidw".""."STEWART"."CONTACT"."LAST_NAME"
END
last_name column mapping.png

And finally… the results:

regexp data.png

Comments

  1. Doug Clark Says:

    Stewart – while I appreciate technically what you’ve done here (which might meet a clients requirements), I would argue strongly against ever using this tool (OBIEE) or an ETL tool to ‘clense’ data.

    If there is a faulty process writing data into OLTP, ensure that that process is changed – don’t mask the root issue – use the BI tools to highlight the issue.

    In my opinion it is garbage in, garbage out – this is GOOD as you are highlighting deficiencies elsewhere.

    Cheers

    Doug

  2. Stewart Bryson Says:

    Thanks for the comment Doug.

    In the perfect world, then you are correct, and hopefully, my client will correct this issue. But since I wasn’t engaged by the operations team, but instead the BI team, I’m not certain whether it will.

    There’s a whole sub-industry devoted to data quality in the BI world, and the usual implementation of these tools is for viewing enterprise data… not in correcting source systems. The reason for this is simple… sometimes, data irregularities in the source system don’t move the meter with the operations team the way they do for the BI team. QA may raise a ticket for a data quality issue and then send it to the queue to be corrected. However, it’s difficult to say how heavy an issue such as this may weigh with an operations team that has different priorities… such as rolling out new features that customers are demanding. There are only so many hours in the day, after all.

    And while this example is a rather egregious one, in my humble opinion… it’s not the worst I’ve seen by a long shot. Thank goodness we have OBIEE, OWB, etc. to help us clean data in situations such as these until the time when the operations team can correct the issue.

Write a comment





Website Design & Build: tymedia.co.uk