Data sources

“The data source: Traditionally, this is one (or quite possibly more) remote system(s)” I wrote in a recent Blog and followed that up with a promise to expand on that statement later

Firstly, note the emphasis on the word “remote”; I am a strong believer that in order to get the best performance out of a data warehouse it should reside in its own database instance and certainly not one used for transactional purposes. There are just too many conflicts between performance tuning a database for OLTP and for data warehousing. The second point is that we can have more than one data source. If I have a chain of supermarkets I may get my sales data from a central system, or the individual POS systems in the stores, I might get warehouse stock data from an inventory management system and the reference data could come from the same systems or an operational data store serving the whole business

Until recently the only viable way to get data into the data warehouse was to load flat files through SQL/Loader: the remote system wrote its data to a flat file, the file was then loaded into a staging area in the DW. A later improvement to this technique was to use the markedly faster “direct path” loading option in SQL/Loader. The most recent releases of Oracle can even query these flat files directly as external tables. Flat files, however, are not the only way to move data between source and data warehouse. It has long been possible to load a data warehouse by querying remote systems directly though database links or gateway. Newer dataload techniques borrow heavily on the methods used in database backup and database replication; log shipping and transportable tablespaces have both found a place in dataload processes