We have developed a tool that translates complex SQL SELECT statements into ODI Mappings.
It was implemented using the Scala Parser Combinators library. We started by combining its Parser Generators into a context-free grammar to describe the Oracle SQL SELECT statement and… hang on, this blog post is not about impressing anyone with terminology. Let us start from the beginning, with the simple.
SQL-to-ODI translation - why bother?
ETL is most often prototyped with SQL statements. But, once shown that the data transformations we are after, work in SQL, we have do discard those SELECT statements and re-implement the same logic in ODI. Would it not be nice to refactor our SQL-based prototype into a first draft of ODI Mappings?
When migrating ETL logic from a legacy ETL system to ODI, it is likely that much of the legacy ETL logic will be presented in the form of a SELECT statement. If the number of those old mappings is in the tens or even hundreds, would we not welcome any opportunity to accelerate the migration process?
Bigger ETL implementations will typically have large sets of simple mappings following the same design template. If we want to replicate all tables in an OLTP Source schema to our Staging database, we could probably generate
SELECT * FROM <source_table> statements for 100 source tables in an Excel spreadsheet in a minute. Creating that many ODI Mappings, no matter how simple, will take much longer than that. Do we not want to spend precious developer time on something more creative and challenging?
However, we do not always need a full-blown SQL parser and ODI Content generator for that. The attention-seeking introduction of this blog post is actually an extreme, complex example of ODI development acceleration. Simple ODI content generation can be done by simple means. There is no need for any form of SQL parsing if your mapping is based on a simple
SELECT * FROM <source_table> statement. Now, for a moment let us forget about parsing and take a quick detour into the world of Groovy scripting, which is the first step towards generating ODI content.
Need to accelerate ODI development? Write a Groovy script!
I have been asked a couple of times about my favourite ODI feature. Without fail, I have always given the same reply - Groovy scripting!
In ODI Studio, from the main menu you navigate to Tools → Groovy, then choose New Script and write as much Groovy script as you please. It is that simple… sort of.
When scripting in Groovy, essentially the whole ODI SDK that the ODI Studio itself is based on, is at your disposal. So, in theory, everything you can do in ODI Studio, you can also script. You can design a Mapping, set up its Physical architecture (including IKM and LKM configuration), validate it and then create a scenario for it - all that with a Groovy script. It is a great tool to accelerate simple, repetitive build and migration tasks. On the flip side, the ODI SDK public API documentation is not very developer (especially beginner developer) friendly. Fortunately, Google is quite good at helping you with simpler questions you may have - Stackoverflow has quite a few ODI-related Groovy questions asked and answered. Also, if you like Rittman Mead as much as I do, ask them for the 2-day Advanced ODI 12c Bootcamp - you will not become an ODI scripting expert in two days but it will get you started.
It will never be quicker to script the creation of a mapping in Groovy than to create the same mapping in ODI Studio. However, if we are talking about many Mappings that are simple, hence easy to script, we can save a lot of time by involving Groovy.
We can create a script that generates mappings for all tables in a source schema to replicate them to our Staging schema. We can even add ETL Batch ID or Replication Date if required. But if we need more than that, we will either need to provide heaps of metadata to base the Mapping generation on, or need to better understand the SELECT statement. The SELECT statement will usually be the preferred option, because you can easily test it on the Source database.
Now, let us return to the original subject of SQL-to-ODI translation.
SQL-to-ODI translation - can it be simple?
Yes. If all your source extract SELECT statements come in the form of
SELECT * FROM <source_table>, you could write a Regex expression to extract the table name, assuming that everything that comes after the FROM keyword is a single table name (i.e. no joins, no multiple tables). Then, if you can find a Data Store with the same name in an ODI Model, your script can add it into your Mapping and map the source columns to a Target table (mapping the columns by name or position). All that can be done with relative ease.
What if we add a filter to the SELECT statement? We can still parse it with Regex - everything between the FROM and WHERE keywords is a table name, everything after the WHERE keyword is a filter condition:
SELECT * FROM <this is all table name> WHERE <filter from here onwards>
That sounds simple enough and it is - sort of. You need to be careful when pinpointing the
WHERE keywords - a simple text search will not do the trick. The two SELECT statements below are 100% valid, but:
- in the first one the
FROMis neither space- nor carriage return-separated from the SELECT expression
- the second one has two
FROMin it, where the first one is part of the column alias
WHERE_FROMand the second one initialises the
SELECT (123)FROM DUAL; SELECT 'Milton Keynes' WHERE_FROM FROM DUAL;
Still, pinpointing the
WHERE blocks is quite doable with a well written Regex expression (assuming no unions or subqueries).
Hopefully we do not need to pick the filter statement apart, because the ODI Filter Component accepts a single condition string.
Let us look at an example:
SELECT * FROM CUSTOMER WHERE NAME IN (‘John’,’Paul’,’Peter’) AND SURNAME LIKE 'S%' AND AGE BETWEEN 25 and 85
In the WHERE clause above, we have a compound filter condition, consisting of three basic conditions joined with the AND keyword. If we would use the above in ODI, it will not be recognised as a valid ODI filter, because ODI requires column references to be in the form of
TABLE_NAME.COLUMN_NAME (we could have an Expression, an Aggregate or other Mapping Component instead of a table name as well). Whereas the SQL statement above is perfectly valid, the filter requires adjustment before we can use it in ODI. Perhaps a text search/replace could do the trick but how do we know what to search for? And if we search and replace
CUSTOMER.NAME, we will corrupt
You could write a Regex parser for this very particular filter condition to extract the column names from it. But if the filter changes to
… AND AGE >= 25 AND AGE <= 85, we will have to change or extend the Regex string accordingly. No matter how good with Regex you are, you may have to give up here.
We could ask for the SELECT statement to be rewritten in an ODI-friendly form instead of trying to transform the filter condition in our script.
In addition to the above, if we have expressions (column references, constants, functions, CASE statements, etc.) in the SELECT part of the query, can we extract those with a script so we can add them to an ODI Mapping's Expression Component?
SELECT 123, "string, with, commas,,,", COALESCE(column1, ROUND(column2,2), column3, TRUNC(column4), column5), ...
When armed with just Regex, it will be very difficult. We cannot assume that any comma separates two expressions - many SQL functions have commas as part of their expression.
Transformation logic in the SELECT part of a query (i.e. anything beyond simple column references), joins, subqueries, unions - I would not try to attack that armed with Regex alone. In the context of SQL-to-ODI translation, I would deem such SELECT statements complex or non-trivial. That is where Combinator Parsing and more serious software engineering come into play. Let us discuss that in the Part Two of this blog post series.