1
27kviews
Discuss different steps involved in Data Preprocessing.
1 Answer
3
915views

Steps Of data preprocessing:

1.Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.

2.Data integration: using multiple databases, data cubes, or files.

3.Data transformation: normalization and aggregation.

4.Data reduction: reducing the volume but producing the same or similar analytical results.

5.Data discretization: part of data reduction, replacing numerical attributes with nominal ones.

Data cleaning

  1. Fill in missing values (attribute or class value):

    ⦁ Ignore the tuple: usually done when class label is missing.

    ⦁ Use the attribute mean (or majority nominal value) to fill in the missing value

    ⦁ Use the attribute mean (or majority nominal value) for all samples belonging to the same class.

  2. Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value. Identify outliers and smooth out noisy data

    ⦁ Binning

    1. Sort the attribute values and partition them into bins (see "Unsuperviseddiscretization" below)

    2. Then smooth by bin means, bin median, or bin boundaries.

    ⦁ Clustering: group values in clusters and then detect and remove outliers (automatic or manual)

    ⦁ Regression: smooth by fitting the data into regression functions.

  3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

  1. Normalization:

    ⦁ Scaling attribute values to fall within a specified range. Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)

    ⦁ Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev

  2. Aggregation: moving up in the concept hierarchy on numeric attributes.

  3. Generalization: moving up in the concept hierarchy on nominal attributes.
  4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

Data reduction

  1. Reducing the number of attributes ⦁ Data cube aggregation: applying roll-up, slice or dice operations. ⦁ Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space

    ⦁ Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data..

  2. Reducing the number of attribute values

    ⦁ Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).

    ⦁ Clustering: grouping values in clusters.

    ⦁ Aggregation or generalization

  3. Reducing the number of tuples

    ⦁ Sampling

Please log in to add an answer.