December 15th, 2010 by Stewart Bryson
In the last post, I described the concept of a “Mapping” entity in Transcend as a bundling of pre- and post-mapping processes that can be executed with a single call. The “Dimension” entity is very similar. In fact, in strict programming terms (for those of you who are interested… the rest I’ll bore for just a moment), a Dimension actually is a polymorphed Mapping for Transcend. We wrote Transcend using lots of object types, and a Dimension type actually inherits–or is “under” in Oracle object-relational speak–a Mapping type. So what this means is that a Dimension is a special kind of Mapping. It uses the Mapping’s methods for bundling processes before and after an ETL mapping to facilitate the loading of hybrid Type 1 and Type 2 slowly-changing dimensions.
First, I need a dimension table that I can work with, and one that most of us are familiar with. I’m going to use the SH.PRODUCTS table… but with a few modifications. The dimension tables in the SH schema don’t use surrogate keys, so I’m going to add one. Additionally… I’m going to pare down the columns a little bit, while also updating the data slightly to be a little more standard:
SQL> create table stewart.product_dim 2 as select 3 prod_id product_key, 4 prod_id, 5 prod_name, 6 prod_desc, 7 prod_status, 8 prod_eff_from, 9 prod_eff_to, 10 prod_valid 11 from sh.products; Table created. SQL> update stewart.product_dim 2 set prod_eff_to='12/31/9999', 3 prod_valid='Y' 4 where prod_valid='A'; 72 rows updated. SQL> update stewart.product_dim 2 set prod_valid='N' 3 where prod_valid='I'; 0 rows updated. SQL>
Notice that, with the SH.PRODUCTS table, there are only ‘Active’ rows in the table. You’ll see how this degrades the quality of the test case later, but I’ll trudge on. I’ll also add a few indexes to demonstrate Transcend’s capability for handling those, just like we saw with the Mapping. I’ll do some dog-fooding and use the product to generate these indexes:
SQL> BEGIN 2 trans_etl.build_indexes 3 ( p_owner => 'stewart', 4 p_table => 'product_dim', 5 p_source_owner => 'sh', 6 p_source_table => 'products', 7 p_index_type => 'bitmap' 8 ); 9 END; 10 / 1 index creation process executed for STEWART.PRODUCT_DIM PL/SQL procedure successfully completed. SQL>
I’ll also need a source table, which would be the target table for our ETL mapping. Hold tight… let me explain. Transcend does not try to replace your ETL tool… it just tries to make some of the heavy-lifting easier. As the SCD implementation in most of these tools is either non-standard or non-perfromant, we tried to deliver a best practices approach that could be easily called from any tool. Obviously, this would work well with custom-developed ETL mappings as well. So all you have to do in your ETL mapping is construct the business logic for how the data set from your source systems needs to be joined and transformed to be ready to roll into your dimension. You drop that in a staging table, tell Transcend that this data needs to make it’s way into the dimension table, and then the product will do the rest. My holding table will be called PRODUCT_SRC, and needs a subset of the columns that exist in the dimension table. The only columns that aren’t required are the ones that are calculated and stored to facilitate SCD processing. For my table, this is: PRODUCT_KEY, PROD_EFF_TO, and PROD_VALID. I’ll add a few rows to this table from SH.PRODUCTS and modify them to look like source-system updates:
SQL> create table staging.product_src 2 as select 3 prod_id, 4 prod_name, 5 prod_desc, 6 prod_status, 7 prod_eff_from 8 from sh.products 9 where rownum < 11; Table created. SQL> update staging.product_src 2 set prod_name = 'New '||prod_name; 10 rows updated. SQL>
Now I need to configure the Dimension and how I want it loaded, much like I configured the Mapping in the last post. Actually, many of the parameters are the same, but there are some new ones in there as well:
SQL> BEGIN 2 trans_adm.create_dimension 3 ( p_mapping => 'map_product_dim', 4 -- dimension table 5 p_owner => 'stewart', 6 p_table => 'product_dim', 7 -- SCHEMA for intermediate tables 8 p_staging_owner => 'staging', 9 -- intermediate source table 10 p_source_owner => 'staging', 11 p_source_table => 'product_src', 12 -- SEQUENCE for the dimension 13 p_sequence_owner => 'stewart', 14 p_sequence_name => 'product_key_seq', 15 p_default_scd_type => 2, 16 p_description => 'load for PRODUCT_DIM', 17 -- MANAGE indexes and constraints 18 p_indexes => 'both', 19 p_index_type => 'bitmap', 20 p_constraints => 'both' 21 ); 22 END; 23 / PL/SQL procedure successfully completed. SQL>
So I’ve told Transcend about my dimension table, including the sequence to use for the surrogate key, the default SCD type (more on this later), and the staging (or work) schema to hold any intermediate tables that Transcend needs to create. I’m also registering the PRODUCT_SRC table which I described above, and I’m telling it how to handle constraints and indexes, which I described in the last post.
With the basics around the dimension table configured, I now need to work on the columns. I’ve configured the default SCD type as a Type 2. That means that I don’t have to do anything with the columns that are regular dimensional attributes for which I want to capture change. So the only things I need to register with Transcend are the following: any Type 1 dimensional attributes, the surrogate key, the natural key, the current indicator column, the effective date and the expiration date:
SQL> BEGIN 2 trans_adm.create_dim_attribs 3 ( p_mapping => 'map_product_dim', 4 p_surrogate => 'product_key', 5 p_effective_dt => 'prod_eff_from', 6 p_expiration_dt => 'prod_eff_to', 7 p_current_ind => 'prod_valid', 8 p_nat_key => 'prod_id', 9 p_scd1 => 'prod_status' 10 ); 11 END; 12 / PL/SQL procedure successfully completed. SQL> commit; Commit complete. SQL>
It’s worth noting that the P_NAT_KEY and P_SCD1 columns take comma-separated lists of values in case you need to pass multiple column names in. I can also use the P_SCD2 parameter if my default SCD type was a 1.
Since I haven’t told Transcend the specific approach I want to take to getting the rows into the dimension table, the default ‘merge’ methodology is employed. This represents a change in the default behavior beginning in version 2.5 away from the old default of ‘exchange’, which I will demonstrate in a bit. A single “INSERT into… SELECT…” statement will be issued to pull all the rows with a PROD_VALID of ‘Y’ from PRODUCT_DIM along with all the incoming rows from PRODUCT_SRC into an intermediate table, doing complex SQL analytics along the way to evaluate and process any Type 1 and Type 2 changes, as well as setting PROD_EFF_TO, PROD_EFF_FROM and PROD_VALID. The changes are then MERGED back into the dimension table. Because we use a single, set-based process, the performance over comparable row-by-row processing can be substantial. Executing the Dimension functionality is done as if it were a regular mapping:
SQL> exec trans_etl.start_mapping( 'map_product_dim' ); Pre-mapping processes beginning Pre-mapping processes completed PL/SQL procedure successfully completed. SQL> -- here you would execute the mapping SQL> exec trans_etl.end_mapping( 'map_product_dim' ); Post-mapping processes beginning Table STAGING.STG$PRODUCT_DIM created Number of records processed with analytics into STAGING.STG$PRODUCT_DIM: 82 5 constraint disablement processes for STEWART.PRODUCT_DIM executed 1 index and 0 local index partitions affected on table STEWART.PRODUCT_DIM Number of SCD1 attributes updated with a MERGE in STEWART.PRODUCT_DIM: 0 Number of records merged into STEWART.PRODUCT_DIM: 82 1 index rebuild process for table STEWART.PRODUCT_DIM executed 5 constraint enablement processes for STEWART.PRODUCT_DIM executed Post-mapping processes completed PL/SQL procedure successfully completed. SQL>
It’s not an overly impressive test case because, as I mentioned above, there are no historical rows in this table at all: PROD_VALID is ‘Y’ for every record. Transcend brings all the current dimension rows into the working set for the following reasons: the effective dates and the current indicator might have to be modified, and all the current Type 2 attributes have to be in the working set to be able to determine whether there have been Type 2 changes. But before this, a separate MERGE statement executes to update all the Type 1 changes for the historical rows. A Type 1 change is seen as a correction, and requires updating ALL the rows in the dimension table for that entity, even ones that may be years old.
Now, I’ll reset the test case in preparation for using a PARTITION EXCHANGE instead of a MERGE. This will issue two insert statements: one that does the same insert statement as above, and another that brings all the old dimension rows into the intermediate table as well. This is because we are replacing PRODUCT_DIM table with a brand new version of the table. Exchanges are handy in cases where 100% uptime is required, and users can’t afford to be down for even the amount of time it takes to load dimension tables. Partition exchanges don’t disrupt queries to the dimension tables, as the load is done to the intermediate table while the dimension is still available. When that load is complete, the dimension table is just swapped out for the new version of the table in an instantaneous dictionary update. Transcend makes this possible by creating the intermediate table with a single partition using MAXVALUE so that the one partition contains all data, and swapping that single partition with the dimension table. Note, however, that we still have to process all the Type 1 updates to the historical rows the same as we did before. However, we do that in the intermediate table so it is done prior to being exchanged in to the dimension table:
SQL> BEGIN 2 trans_adm.modify_dimension 3 ( p_mapping => 'map_product_dim', 4 p_replace_method => 'exchange' 5 ); 6 END; 7 / PL/SQL procedure successfully completed. SQL> exec trans_etl.start_mapping( 'map_product_dim' ); Pre-mapping processes beginning Pre-mapping processes completed PL/SQL procedure successfully completed. SQL> -- here you would execute the mapping SQL> exec trans_etl.end_mapping( 'map_product_dim' ); Post-mapping processes beginning Table STAGING.STG$PRODUCT_DIM created Number of history records inserted into STAGING.STG$PRODUCT_DIM: 0 Number of records processed with analytics into STAGING.STG$PRODUCT_DIM: 82 Number of SCD1 attributes updated with a MERGE in STAGING.STG$PRODUCT_DIM: 10 Statistics from STEWART.PRODUCT_DIM transferred to partition PMAX of STAGING.STG$PRODUCT_DIM 1 index creation process executed for STAGING.STG$PRODUCT_DIM 5 constraints built for STAGING.STG$PRODUCT_DIM STEWART.PRODUCT_DIM exchanged for partition PMAX of table STAGING.STG$PRODUCT_DIM Post-mapping processes completed PL/SQL procedure successfully completed. SQL>
Again… since there are no historical records in the PRODUCT_DIM table, the first insert brought over no records.
And in case you aren’t confused enough as is… Transcend supports late-arriving dimensions as well. In the default mode, if you get an incoming dimension row that has en effective date before the effective date of the current row for that dimensional entity, then the PROD_EFF_TO and PROD_EFF_FROM date ranges, as well as the tracking of SCD 2 attributes, won’t work correctly. But if we turn on the late-arriving feature, Transcend will handle this seamlessly. It sets the effective dates and current indicator correctly over the entire range of data, but more importantly… it will represent the history of SCD 2 attributes correctly regardless of which order the rows came in. The complex analytics statement used to evaluate Type 1 and Type 2 attributes, as well as setting effective dates and current indicators, will process the entire dimension table along with the incoming rows. Though this sounds like it would be excruciatingly slow, the fact is that it isn’t. Compared to the typical row-by-row approach that ETL tools take, requiring lots and lots of updates, a pure set-based process, even if it reads the entire dimension table, is generally the winner. But remember, this is only required if late-arriving dimensions are a possibility. However, for many clients, the processing of the entire table combined with a partition exchange outperforms the MERGE approach of just loading the current records. Updates are expensive folks.
You can see that the process is down to a single insert again:
SQL> BEGIN 2 trans_adm.modify_dimension 3 ( p_mapping => 'map_product_dim', 4 p_late_arriving => 'yes' 5 ); 6 END; 7 / PL/SQL procedure successfully completed. SQL> exec trans_etl.start_mapping( 'map_product_dim' ); Pre-mapping processes beginning Pre-mapping processes completed PL/SQL procedure successfully completed. SQL> -- here you would execute the mapping SQL> exec trans_etl.end_mapping( 'map_product_dim' ); Post-mapping processes beginning Table STAGING.STG$PRODUCT_DIM created Number of records processed with analytics into STAGING.STG$PRODUCT_DIM: 82 Statistics from STEWART.PRODUCT_DIM transferred to partition PMAX of STAGING.STG$PRODUCT_DIM 1 index creation process executed for STAGING.STG$PRODUCT_DIM 5 constraints built for STAGING.STG$PRODUCT_DIM STEWART.PRODUCT_DIM exchanged for partition PMAX of table STAGING.STG$PRODUCT_DIM Post-mapping processes completed PL/SQL procedure successfully completed. SQL>
The moral of this story: if you have any Type 1 attributes, don’t immediately go for the default MERGE approach with it’s multiple updates required. Give the late-arriving approach a try, even if you don’t have any. You might see the single insert statement outperform all the rest.