written 2.7 years ago by |
Data Integration
Combines data from multiple sources into a coherent store
Schema integration: e.g., A.cust-id= B.cust-#
Integrate metadata from different sources
■ Entity identification problem:
- Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
■ Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources are different
Possible reasons: different representations, different scales, e.g., metric vs. British units (e.g., GPA in US and China)
Handling Redundancy in Data Integration
■ Redundant data occur often when integration of multiple databases
• Object identification. The same attribute or object may have different names in different databases
• Derivable data: One attribute may be a "derived" attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
• Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Correlation Analysis (Numerical Data)
• Correlation coefficient (also called Pearson's product moment coefficient)
$$ r_{A, B}=\frac{\sum(A-\bar{A})(B-\bar{B})}{(n-1) \sigma_{A} \sigma_{s}}=\frac{\sum(A B)-n \bar{A} \bar{B}}{(n-1) \sigma_{A} \sigma_{s}} $$
where $n$ is the number of tuples, $\bar{a}$ and are the respective means of $A$ and $B, \sigma_{A}$ and $\sigma_{B}$ are the respective standard deviation of $A$ and $B$, and $\Sigma(A B)$ is the sum of the $A B$ cross-product. - If $r_{A, B}>0, A$ and $B$ are positively correlated ( $A^{\prime} s$ values increase as B' s). The higher, the stronger correlation. - $r_{A, B}=0$ : independent; $r_{A, B}{ V B}{ }$ : negatively correlated
Data Transformation
■ Smoothing: remove noise from data
■ Aggregation: summarization
■ Generalization: concept hierarchy climbing
■ Normalization: scaled to fall within a small, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
■ Attribute/feature construction
■ New attributes constructed from the given ones