written 8.7 years ago by |
ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse.
• It is simply a process of copying data from one database to other.
• Data is mostly extracted from an OLTP database, transformed to match the data warehouse schema and loaded into the data warehouse database.
• Many data warehouses also incorporate data from non‐OLTP systems such as text files, legacy systems, and spreadsheets.
• When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation.
• ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers.
• It is not a onetime event as new details are added to the Data Warehouse periodically(monthly, daily, hourly).
• Creates a logical and physical separation between the source systems and the data warehouse • ETL operations should be performed on a relational database server separate from the source databases and the data warehouse database.
ETL involves the following tasks:
- Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.
- Transforming the data
• Data transformation is the main step where the ETL adds value
• Data transformation actually changes data and provides guidance whether this data can be used for its intended purposes
• It is performed in staging area
• transforming the data may involve the following tasks:
o cleansing - it is by definition transformation process in which data that violates business rules is changed to conform these rules. It is usually done by ETL programs that determine or derive correct data values and then write them into the BI target databases.
o summarization - Values are summarized to obtain total figures which are subsequently calculated and stored at multiple levels as business fact in multidimensional fact tables.
o derivation - new data is created from existing (detailed) source data during this process by calculations, program logic or table lookups. Some examples of derivation may be: calculating profit from income and expense items or calculating customer's age based on their date of birth and the current year.
o aggregation - data elements for customers may be aggregated from multiple source files and databases (e.g. Customer Master File, Sales File, Prospect File). Sometimes (in multidimensional database terminology) this term also refers to roll-ups of data values.
o integration - the expected result of this part is to have each and unique data element known by one standard name with one standard definition and approved name. Data integration forces the need to reconcile different data names and values for the same data element. Also, each element of the data should be associated with its source databases as well as with its BI target databases.
• loading the data into a data warehouse or data repository other reporting applications