written 8.7 years ago by |
Numerosity Reduction
This is a technique of choosing smaller forms or data representation to reduce the volume of data.
These techniques may be parametric or nonparametric.
Parametric:
For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.)
eg: Log-linear models, which estimate discrete multidimensional probability distributions.
Nonparametric:
Nonparametric methods are used for storing reduced representations of the data include histograms, clustering, and sampling.
Regression and Log-Linear Models
• Regression and log-linear models can be used to approximate the given data.
• In (simple) linear regression, the data are modeled to fit a straight line.
• Multiple linear regression is an extension of (simple) linear regression, which allows a response variable y to be modeled as a linear function of two or more predictor variables.
• Log-linear models approximate discrete multidimensional probability distributions.
• Log-linear models can be used to estimate the probability of each point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower dimensional spaces.
• Log-linear models are therefore also useful for dimensionality reduction and data smoothing
• Regression and log-linear models can both be used on sparse data, although their application may be limited.
• While both methods can handle skewed data, regression does exceptionally well. Regression can be computationally intensive when applied to high dimensional data, whereas log-linear models show good scalability for up to 10 or so dimensions.
Histograms • Histograms use binning to approximate data distributions and are a popular form of data reduction.
• A histogram partitions the data distribution into disjoint subsets, or buckets.
• If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets.
• Singleton buckets are useful for storing outliers with high frequency.
• Histograms are highly effective at approximating both sparse and dense data, aswell as highly skewed and uniform data.
• The histograms for single attributes can be extended for multiple attributes.
• Multidimensional histograms can capture dependencies between attributes.