written 8.7 years ago by |
Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
Entropy-Based Discretization
• Entropy is one of the most commonly used discretization measures.
• Entropy-based discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and determination of split-points (data values for partitioning an attribute range).
• Entropy-based discretization can reduce data size.
• Unlike the other methods mentioned here so far, entropy-based discretization uses class information.
• This makes it more likely that the interval boundaries (split-points) are defined to occur in places that may help improve classification accuracy.
• The entropy and information gain measures described here are also used for decision tree induction.
• The expected information requirement for classifying a tuple is given by:
DATA SUMMARIZATION
• Summarization is a key data mining concept which involves techniques for finding a compact description of a dataset.
• Simple summarization methods such as tabulating the mean and standard deviations are often applied for data analysis, data visualization and automated report generation.
• Clustering[13, 23] is another data mining technique that is often used to summarize large datasets.
• Summarization can be viewed as compressing a given set of transactions into a smaller set of patterns while retaining the maximum possible information.
• A trivial summary for a set of transactions would be itself.
• The information loss here is zero but there is no compaction.
• Another trivial summary would be the empty set , which represents all the transactions.
• In this case the gain in compaction is maximum but the summary has no information content.
• A good summary is one which is small but still retains enough information about the data as a whole and also for each transaction.
Summarization Using Clustering
o Here we present a direct application of clustering to obtain a summary for a given set of transactions with categorical attributes.
o This simple algorithm involves clustering of the data using any standard clustering algorithm and then replacing each cluster with a representation using feature-wise intersection of all transactions in that cluster.
o The number of clusters here determine the compaction gain for the summary.
o Step 2 generates l clusters, while step 3 and 4 generate the summary description for each of the individual clusters.
o For example consider the sample data set of 8 transactions in Table.
o Let clustering generate two clusters for this data set
(C1 ={T1,T2,T3,T4,T8}and C2 = {T5,T6,T7})
o Table shows a summary obtained using the clustering based algorithm.