Staging DB in Datawarehousing – The WHYs and WHY NOTs
 

Whenever we talk about designing datawarehouses, staging database is one consideration we have to make. And it depends on many factors whether you choose to have one or not. I feel when we start small, like implementing small datamarts, we may not see the need for a staging database but as the DW implementations grow in size and complexity, staging DBs have a lot of benefits to offer. Staging DBs are often the place to dump source data, the workplace to create temperory intermediate tables for processing the transformation steps in ETLs, to keep what is good, to reject what is bad and so on.

Lets quickly look through the PROs and CONs of having Staging DBs in the DW implementations:

 
The WHYs

Source data accumalation: If data is being pulled from multiple systems Source systems and not all data is available at the same time, we may need to accumalate all raw data pulled at different times and then integrate/merge and load the same.

Data transformation, cleansing, data alignment: Staging DB may be the workplace for the ETL to do the required cleansing, apply transformation rules and perform alignment of data from various systems.

Decoupling of Source from Datawarehouse: Staging DB helps decouple the Source systems from the DW. Source systems are accessed only by ETL jobs which pull data into staging, or they may even choose to push data there. This greatly reduces the amount of time that source systems are accessed by DW loading jobs, reducing contention too.

ETL Performance: Loading staging tables is simple as they are mostly the exact replicas of the source tables, minus the indexes / foreign key relationships. As most of the joins on the data , cleansing, calculations are done on staging tables and may be stored in intermediate temporary tables, loading of DW from staging tables is also simpler. This greatly increases the ETL performance as well as quality of the DW data.

Handle Failure, debugging, restart , recovery: As staging DBs have snapshots of source data, can have extensive logging, intermediate tables used for data transformations etc, they make it easier to debug errors/failures. In case of restarts / failures, the step of pulling the source data may not have to be redone.

Data reconciliation / pre / post load validations : This is very much similar to the point above. Staging DBs are excellent to capture auditing information which can be used for reconciling data, validating data to reject bad data, make sure the data we loaded is correct , consistent with the source, nothing is missed out.

Change Data Capture: Staging DB is also used as one of the mechanisms for CDC. For this the copy of source data tables from previous load are preserved till the next execution. The current and previous source data are compared to identify changes to the data. This is mostly used in scenarios where the source systems do not have means to support change detections.

Simpler, manintenable ETLs: With staging, we divide the process of Extract Transform Load into different processes. Combining the extraction, cleansing, transformation logic, precalculations, summarizations into one single ETL makes it very complex and also a single point for failure. With staging step, ETLs are smaller, simpler and easy to maintain as well.

Prevent Orphan records / race conditions: When we fetch data from source directly and load into DW, there are chances that we sometimes may pull some partial updates happening in the transactional system or the new data might be getting inserted as we pull causing race conditions. With Staging, the data from source is pulled in one single go preventing such problems.

Backup of source: The staging DBs if archived / retained for some specific period of time may act like a backup of the source data which is loaded into the DW.

 

The WHY NOTs

Increases processing time , as data is first loaded into staging tables, transformed and then loaded to the DW. Due to the additional step, there is slight increase in processing time.
Increases latency, source changes take more time to reflect in DW. In realtime/near realtime DW, staging is not usually included.
Environment requirements : Additional Database servers / storage space required.

The overhead that staging may impose is quite less as compared to the benefits we derive. But then again, it completely depends on what suits a given requirement. The above points may however help us in making a choice.

 
Shraddha Tambe | Helical IT Solutions

 
References:

1) http://data-warehouses.net/architecture/staging.html
2) http://dwbi.org/etl/etl/52-why-do-we-need-staging-area-during-etl-load
3) http://danlinstedt.com/allposts/datavaultcat/data-vault-and-staging-area/
4) http://www.dwh-club.com/dwh-bi-articles/change-data-capture-methods.html

Leave a Reply