Slightly Fuzzy Lookups

Recently I had a requirement to selectively "translate" the data in one column of a table before loading into a data warehouse. In this case we had to "standardise" a list of countries. Normally, this is classic use of an outer join to a table containing the incorrect expression and its translation and where a match is found use the translation or else the original expression.

select case when t.new_country is null then o.country else t.new_country end country_name
from
translation_table t, orignal_table o
where o.country = t.country(+);

But in this case the usage of the column and the spellings and abbreviations used were highly inconsistent. Sometimes we had the country name entered in the local language, sometimes additional words in same column before or after the country name such as a town name or a post code, sometimes we had punctuation, occasionally people confused county with country, even in some cases delivery instructions like "leave behind garage". Although is feasible to create a translation list based on all of the incorrect variations, it is not the best idea; there can be many variations and coding all eventualities may not be sustainable

One thing that can get forgotten is that is possible to use an expression instead of an equality in the outer join where clause. For example we could use LIKE and in the translation_table store match strings (with % and _ characters as required) instead of the exact strings.

select case when t.new_country is null then o.country else t.new_country end country_name
from
translation_table t, orignal_table o
where o.country LIKE t.country(+);

But with Oracle 10g and later we have an even more flexible option, the use of regular expressions. This allows us to optionally match parts of strings, to define the length of match and even to specify alternative matching characters.
We store the regular expressions in the translation_table.

select case when t.new_country is null then o.country else t.new_country end country_name
from
translation_table t, orignal_table o
where REGEXP_LIKE(o.country , t.country(+));

The key thing to watch out for is that a string might match more than one regular expression - for example "WALES" would match "WALES" (obviously) but also "NEW SOUTH WALES" - so choose the regular expressions with care.

But can you use this is a ETL tool such as Oracle Warehouse Builder? There is no real problem with the use of LIKE, except that you should uncheck the use ANSI SQL check box on the mapping properties, but as David Allan mentions there is a difficulty in using the REGEXP_LIKE operator as OWB does not know that the REGEXP_LIKE expression is boolean. David's solution is quite simple - wrap the logic in a case operator and then check that for equality; simple and effective.