written 7.9 years ago by | modified 7.5 years ago by |
Dimensions in data management and data warehousing contain relatively static data about such entities as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.
Some scenarios can cause Referential integrity problems.
For example, a database may contain a fact table that stores sales records. This fact table would be linked to dimensions by means of foreign keys. One of these dimensions may contain data about the company's salespeople: e.g., the regional offices in which they work. However, the salespeople are sometimes transferred from one regional office to another. For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is now assigned to a different regional office. Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDsare also sometimes called Hybrid SCDs.
1. Type 0: retain original :
The Type 0 method is passive. It manages dimensional changes and no action is performed. Values remain as they were at the time the dimension record was first inserted. In certain circumstances history is preserved with a Type 0. Higher order types are employed to guarantee the preservation of history whereas Type 0 provides the least or no control. This is rarely used.
2. Type 1: overwrite:
This methodology overwrites old with new data, and therefore does not track historical data.
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme Supply Co | CA |
In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys (unless the number of bytes in the character key is less than the number of bytes in the integer key).
If the supplier relocates the headquarters to Illinois the record would be overwritten:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme Supply Co | IL |
The disadvantage of the Type 1 method is that there is no history in the data warehouse. It has the advantage however that it's easy to maintain.
If one has calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.
3. Type 2: add new row:
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert.
For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Version |
---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 0 |
123 | ABC | Acme Supply Co | IL | 1 |
Another method is to add 'effective date' columns.
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Start_Date | End_Date |
---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 01-Jan-2000 | 21-Dec-2004 |
123 | ABC | Acme Supply Co | IL | 22-Dec-2004 | NULL |
The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.
Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed. To reference the entity via the natural key, it is necessary to remove the unique constraint making Referential integrity by DBMS impossible.
If there are retroactive changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.
4. Type 3: add new attribute
- This method tracks changes using separate columns and preserves limited history. The Type 3 preserves limited history as it is limited to the number of columns designated for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type 3 adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Effective_Date | Current_Supplier_State |
---|---|---|---|---|---|
123 | ABC | Acme Supply Co | CA | 22-Dec-2004 | IL |
This record contains a column for the original state and current state—cannot track the changes if the supplier relocates a second time.
One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.
5. Type 4: add history table
- The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the Fact table to enhance query performance.
For the above example the original table name is Supplier and the history table is Supplier_History.Supplier
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State |
---|---|---|---|
123 | ABC | Acme& Johnson Supply Co | IL |
Supplier_History
Supplier_Key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date |
---|---|---|---|---|
123 | ABC | Acme Supply Co,CA | CA | 14-June-2003 |
123 | ABC | Acme & Johnson Supply Co | IL | 22-Dec-2004 |
- This method resembles how database audit tables and change data capture techniques function.
6.Type 6: hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido[citation needed]. Ralph Kimball calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit.[1]
The Supplier table starts out with one record for our example supplier:
Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | 1 | ABC | Acme Supply Co, | CA | 01-Jan-2000 | 31-Dec-2009 | Y |
The Current_State and the Historical_State are the same. The optional Current_Flag attribute indicates that this is the current or most recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing, however a row key is included to ensure we have a unique key for each row:
Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | 1 | ABC | Acme Supply Co, | CA | 01-Jan-2000 | 21-Dec-2004 | N |
123 | 2 | ABC | Acme Supply Co. | IL | 22-Dec-2004 | 31-Dec-2009 | Y |
We overwrite the Current_Flag information in the first record (Row_Key = 1) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.
For example, if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:
Supplier_Key | Row_Key | Supplier_Code | Supplier_Name | Supplier_State | Create_Date | End_Date | Current_Flag |
---|---|---|---|---|---|---|---|
123 | 1 | ABC | Acme Supply Co, | NY | 01-Jan-2000 | 21-Dec-2004 | N |
123 | 2 | ABC | Acme Supply Co, | NY | 22-Dec-2004 | 03-Feb-2008 | N |
123 | 3 | ABC | Acme Supply Co, | NY | 04-Feb-2008 | 31-Dec-2009 | Y |
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.