written 2.6 years ago by |
• Today's real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.
• Low-quality data will lead to low-quality mining results.
• Data Preprocessing helps to improve the quality of the data and, consequently, of the mining results. There are several data preprocessing techniques.
Data cleaning can be applied to remove noise and correct inconsistencies in data.
Data integration merges data from multiple sources into a coherent data store such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
- Accuracy: correct or wrong, accurate or not
- Completeness: not recorded, unavailable, ..
- Consistency: some modified but some not, dangling, ..
- Timeliness: timely update?
- Believability: how trustable the data are correct?
- Interpretability: how easily the data can be understood?
There are many possible reasons for inaccurate data (i.e., having incorrect attribute values).
- The data collection instruments used may be faulty.
- There may have been human or computer errors occurring at data entry.
- Users may purposely submitincorrect data values for mandatory fields when the do not wish to submit personal information (e.g., by choosing the default value "January 1" displayed for birthday). This is known as disguised missing data.
- Errors in data transmission can also occur.
- There may be technology limitations such as limited buffer size for coordinatin; synchronized data transfer and consumption.
- Incorrect data may also result from inconsistencies in naming conventions or da codes, or inconsistent formats for input fieldge.g., date).
- Duplicate tuples also require data cleaning.
Major Tasks in Data Preprocessing
- Data cleaning
- Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
- Data integration
- Integration of multiple databases, data cubes, or files
- Data reduction
- Dimensionality reduction
- Numerosity reduction
- Data compression
- Data transformation and data discretization
- Normalization
- Concept hierarchy generation