The role of Data Lakes or Staging Areas for Data Warehouse ETL


We were asked recently by a client, about the role of a staging area for ETL processes when loading a Data Warehouse, specifically they wanted to know if this was still required if using a Data Lake.


TLDR: Data Lakes and Staging areas could be interchangeable in terms of ETL processes, the key consideration is who else and what else will make use of the data within a Data Lake and do you have the right policies and procedures in place to ensure good data governance.


As with so many things people often see Data Lakes as a technology solution, but the reality is that it is a service. Data Lakes provide a method of surfacing data in its raw/native form to a variety of users and downstream processes, these are intended to use relatively cheap storage and to help accelerate insights into business decisions. We see clients opting to implement Data Lakes on a variety of different technologies which have various individual benefits, drawbacks and considerations, however the prevailing trend in terms of operating an effective Data Lake and in terms of controlling cost is the need for careful governance in terms of various aspects of data quality and security including items such as data retention, and data dictionary.

A staging area for a Data Warehouse serves a single focused purpose of holding raw data from source systems and providing a location for transient tables that are part of the transformation steps. Depending on the design methodology and ETL toolset the purpose of the staging area varies slightly but the target audience is always simply the ETL process and the Data Engineers who are responsible for developing and maintaining the ETL. This doesn't negate the need the data governance that is required in a Data Lake but it does simplify it significantly when compared to the multitude of users and processes which may access a Data Lake.


Traditional direct ETL from source system to Data Warehouse 
Data Warehousing with the inclusion of a Data Lake



CONCLUSIONS


  1. Depending on the toolset chosen for the Data Lake, ETL, and Data Warehouse the location and method for performing transformations and storing transient intimidate tables could be either in the Data Lake or within a sub schema of the Data Warehouse database.

  2. If your ETL and Data Warehouse is the only downstream user of a Data Lake is it even a Data Lake?

  3. Get your processes and policies right in terms of data governance, retention, and security.