Why is Data Preprocessing required? Explain the different steps involved in Data Preprocessing

1.8kviews

written 8.5 years ago by

ashishravindrasalve • 920

modified 8.5 years ago by

ramnath • 100

Data Preprocessing is required because:

Real world data are generally:

Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data

Noisy: Containing errors or outliers

Inconsistent: Containing discrepancies in codes or names
Steps in Data preprocessing:

1. Data cleaning:
- Data cleaning, also called data cleansing or scrubbing.
- Fill in missing values, smooth noisy data, identify or remove the outliers, and resolve inconsistencies.
- Data cleaning is required because source systems contain “dirty data” that must be cleaned.
- Steps in Data cleaning:
1.1 Parsing:
- Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.
- Example includes parsing the first, middle and the last name.
1.2 Correcting:
- Correct parsed individual data components using sophisticated data algorithms and secondary data sources.
- Example includes replacing a vanity address and adding a zip code.
1.3 Standardizing:
- Standardizing applies conversion routines to transform data into its preferred and consistent format using both standard and custom business rules.
- Examples include adding a pre name, replacing a nickname.
1.4 Matching:
- Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.
- Examples include identifying similar names and addresses.
1.5 Consolidating:
- Analyzing and identifying relationships between matched records and consolidating/merging them into one representation.
1.6 Data cleansing must deal with many types of possible errors:
- These include missing data and incorrect data at one source.
1.7 Data Staging:
- Accumulates data from asynchronous sources.
- At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse.
- There is usually no end user access to the staging file.
- An operational data store may be used for data staging.

2. Data integration and Transformation:

Data integration: Combines data from multiple sources into a coherent data store e.g. data warehouse.
Sources may include multiple databases, data cubes or data files.

Issues in data integration:
- Schema integration:
  - Integrate metadata from different sources.
  - Entity identification problem: identify real world entities from multiple data sources, e.g. A cust-id=B.cust#.
- Detecting and resolving data value conflicts:
  - For the same real world entity, attribute values from different sources are different.
  - Possible reasons: different representations, different scales.
- Redundant data occur often when integration of multiple databases:
  - The same attribute may have different names in different databases.

Data Transformation: Transformation process deals with rectifying any inconsistency (if any).
One of the most common transformation issues is ‘Attribute Naming Inconsistency’. It is common for the given data element to be referred to by different data names in different databases.
Eg Employee Name may be EMP_NAME in one database, ENAME in the other.
Thus one set of Data Names are picked and used consistently in the data warehouse.
Once all the data elements have right names, they must be converted to common formats.

3. Data Reduction:

Obtains reduced representation in volume but produces the same or similar analytical results.
Need for data reduction:
- Reducing the number of attributes
- Reducing the number of attribute values
- Reducing the number of tuples

4. Discretization and Concept Hierarchy Generation(or summarization):

Discretization: Reduce the number of values for a given continuous attribute by divide the range of a continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
Concept Hierarchies: Reduce the data by collecting and replacing low level concepts(such as numeric values for the attribute age)by higher level concepts(such as young, middle-aged or senior).

ADD COMMENT EDIT