Outlier
- An outlier is a data object that is exceptionally far from the mainstream of data.
- An outlier is an object that deviates significantly from the rest of the objects and behaves differently.
- It can be created due to measurement or execution errors.
- Therefore, outlier detection can be defined as the process of detecting and then excluding outsiders from a given set of data.
- But, there are no standardized outlier identification methods because these are mostly dataset-dependent.
Outlier Detection Methods
- There are many methods or approaches used to detect abnormalities.
- Based on that outlier detection methods can be categorized as follows:
Extreme Value Analysis -
- This is a basic method and useful for 1-dimensional data.
- In this method, values that are too large or too small are considered outliers.
- Examples of this method are the z-test and t-test.
- This method is generally used as the final step for interpreting outputs of other outlier detection methods.
- Because this method is a good heuristic for the initial analysis of data but they do not have much value in multivariate settings.
Linear Approach -
- In this outlier detection method, data is organized into a lower-dimensional sub-space by using linear correlations.
- Then the distance of each data point to a plane that fits the sub-space is calculated.
- This distance is used to find outliers.
- An example of this method is Principal Component Analysis (PCA).
Probabilistic and Statistical Methods -
- This method assumes particular distributions for data.
- This method generally uses the expectation-maximization (EM) function to calculate parameters for the approach.
- At last. find out the probability distribution for each data object.
- The data object with a low probability is considered an outlier.
Proximity Methods -
- In this method, outliers are considered as objects that are isolated from the rest of the data sets.
- In this method, the object is considered an outlier if its neighborhood does not have enough other points.
- In this method, the object is considered an outlier if its density is relatively much lower than that of its neighbors.
- Examples of this type of method are Cluster analysis, density-based analysis, and nearest neighborhood.
Information-theoretical methods -
- In this method, outliers increase the minimum code length to describe a data set.
- These methods measure the regularity of audit data and perform appropriate data transformations.
- Dataset used this method has high regularity.
- This method uses relative entropy to determine whether the approach is suitable for a new dataset.