written 8.7 years ago by |
Concept Hierarchy reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Concept hierarchy generation for numeric data is as follows:
- Binning (see sections before)
- Histogram analysis (see sections before)
- Clustering analysis (see sections before)
- Entropy-based discretization
Segmentation by natural partitioning
Binning
- In binning, first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Histogram analysis
- Histogram is a popular data reduction technique
- Divide data into buckets and store average (sum) for each bucket
- Can be constructed optimally in one dimension using dynamic programming
- Related to quantization problems.
Clustering analysis
- Partition data set into clusters, and one can store cluster representation only
- Can be very effective if data is clustered but not if data is “smeared”
- Can have hierarchical clustering and be stored in multi-dimensional index tree structures
Entropy-based discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
– S1 & S2 correspond to samples in S satisfying conditions A<v &="" a="">=v
The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
- The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Ent (S)- E(T,S)>δ
- Experiments show that it may reduce data size and improve classification accuracy
Segmentation by natural partitioning
- 3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.
- If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals
- If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals
- If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals
Concept hierarchy generation for categorical data is as follows:
Specification of a set of attributes, but not of their partial ordering
- Auto generate the attribute ordering based upon observation that attribute defining a high level concept has a smaller # of distinct values than an attribute defining a lower level concept
- Example : country (15), state_or_province (365), city (3567), street (674,339)
Specification of only a partial set of attributes
- Try and parse database schema to determine complete hierarchy