written 7.8 years ago by |
Steps Of data preprocessing:
1.Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
2.Data integration: using multiple databases, data cubes, or files.
3.Data transformation: normalization and aggregation.
4.Data reduction: reducing the volume but producing the same or similar analytical results.
5.Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
Data cleaning
Fill in missing values (attribute or class value):
⦁ Ignore the tuple: usually done when class label is missing.
⦁ Use the attribute mean (or majority nominal value) to fill in the missing value
⦁ Use the attribute mean (or majority nominal value) for all samples belonging to the same class.
Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value. Identify outliers and smooth out noisy data
⦁ Binning
Sort the attribute values and partition them into bins (see "Unsuperviseddiscretization" below)
Then smooth by bin means, bin median, or bin boundaries.
⦁ Clustering: group values in clusters and then detect and remove outliers (automatic or manual)
⦁ Regression: smooth by fitting the data into regression functions.
Correct inconsistent data: use domain knowledge or expert decision.
Data transformation
Normalization:
⦁ Scaling attribute values to fall within a specified range. Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)
⦁ Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev
Aggregation: moving up in the concept hierarchy on numeric attributes.
- Generalization: moving up in the concept hierarchy on nominal attributes.
- Attribute construction: replacing or adding new attributes inferred by existing attributes.
Data reduction
Reducing the number of attributes ⦁ Data cube aggregation: applying roll-up, slice or dice operations. ⦁ Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space
⦁ Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..
Reducing the number of attribute values
⦁ Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
⦁ Clustering: grouping values in clusters.
⦁ Aggregation or generalization
Reducing the number of tuples
⦁ Sampling