written 5.6 years ago by | • modified 5.2 years ago |
Extraction - Transformation - Loading known as ETL process.
In ETL tools extract the data from different RDBMS source systems then transforms data by applying calculations concatenations etc and then load the data into Data Warehouse System.
Need of ETL
Helps companies to analyze their business data for taking critical business decisions.
Transactional databases cannot answer complex business questions that can be answered by ETL.
Provides common data repository.
ETL enables to move data from various sources into data warehouse.
As data source change DW will automatically update.
Allow verification of data transformation, aggregation and calculation rules.
ETL offers historical context for business.
Fig. ETL Process
Step 1 - Extraction.
In this step data is extracted from source system into staging area. Transformation if any are done in staging area so that performance of source system is not degraded.
There are 3 types of data extraction methods.
a] Full Extraction.
b] Partial Extraction without update notification.
c] Partial Extraction with update notification.
Validations done during extractions.
Reconcile records with source data.
Make sure no spam/unwanted data loaded.
Data type check.
Remove all types of duplicate/fragmented data.
Check whether all keys are placed or not.
Step 2 - Transformation.
Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped & transformed.
In fact, this is the key step where ETL process adds value & changes data such that insightful 31 reports can be generated.
In this step, we apply set apply set of functions on extracted data. Data that does not require any transformation is called as direct move or pass through data.
In this step, we can perform customized operations on data.
Validations done during transformation
Filtering - Select only section columns to load.
Using rules & lookup tables for data standardization.
Character set conversion & encoding handling.
Conversion of units of measurements like date . Time . Conversion . Currency Conversions . Numerical conversions etc.
Data threshold validation check.
Data flow validation from staging area to the intermediate tables.
Required field should not be kept blank.
Transposing rows and columns.
Step 3 - Loading
Loading data into target data warehouse data base is the last step of ETL process.
In a typical data ware house huge volume of data needs to be loaded in relatively short period hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of failure without data integrity loss.
Data warehouse admins need to monitor resume, cancel loads as per prevailing server performance.
Type of loading :
a] Initial load - Populating all the data warehouse tables.
b] Incremental Load - Applying ongoing changes as when needed periodically.
c] Full refresh - Erasing the contents of one or more tables and reloading with fresh data.
Validations done during loading process.
Ensure that key field data is neither missing nor null.
Test modelling views based on target tables.
Check that combined values & calculation measures.
Data checks in dimension table as well as history table.
Check BI reports on the loaded fact and dimension table.
ETL tools :
1] Mark Logic.
2] Oracle.
3] Amazon Red Shift.