Write a short note on Measuring, similarity and dissimilarity

141views

written 2.6 years ago by

DevarenjiniP • 3.2k

Similarity Measures

Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbour classification and anomaly detection
The term proximity is used to refer to either similarity or dissimilarity

Definitions

The similarity between two objects is a numeral measure of the degree to which the Consequently, similarities are higher for pairs of objects that are more alike. Similarities are usually non- are often between 0 (no two objects are alike. negative and similarity) and 1(complete similarity).
The dissimilarity between two objects is the numerical measure of the degree to which the two objects are different. Dissimilarity is lower for more similar pairs of objects.
Frequently, the term distance is used as a synonym for dissimilarity. Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to range from 0 to ∞

Proximity Measures

Proximity measures, especially similarities, are defined to have values in the interval [0,1]. If the similarity between objects can range from 1 (not at all similar) to 10 (completely similar), we can make them fall into the range [0,1] by using the formula: s'=(s-1)/9, where s and s' are the original and the new similarity values, respectively.
The more general case, s' is calculated as s'=(s-min_s)/(max_s-min_s), where min_s and max s are the minimum and maximum similarity values respectively.
Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by using the formula d'=(d-min_d)/(max_d- min_d)
If the proximity measure originally takes values in the interval [0, 0], then we usually use the formula: d'= d/(1+d) for such cases and bring the dissimilarity measure between [0,1]

Similarity and dissimilarity between simple attributes

The proximity of objects with a number of attributes is defined by combining the proximities of individual attributes.

Attribute Types and Similarity Measures:

1) For interval or ratio attributes, the natural measure of dissimilarity between two attributes is the absolute difference of their values. For example, we might compare our current weight to our weight one year ago. In such cases the dissimilarities range from 0 to ∞.

2) For objects described with one nominal attribute, the attribute value describes whether the attribute is present in the object or not. Comparing two objects with one nominal attribute means comparing the values of this attribute. In that case, similarity is traditionally defined as 1 if attribute values match and as 0 otherwise. A dissimilarity would be defined in the opposite way: 0 if the attribute values match, 1 if they do not.

3) For objects with a single ordinal attribute, information about order should be taken into account. Consider an attribute that measures the quality of a product on the scale {poor, fair, OK, good, wonderful}. It would be reasonable that a product P1 which was rated wonderful would be closer to a product P2 rated good rather than a product P3 rated OK. To make this observation quantitative, the values of the ordinal attribute are often mapped to successive integers, beginning at 0 or 1, e.g.{poor=0, fair=1, OK=2, good=3, wonderful=4}. Then d(P1-P2) =4-3 =1

Dissimilarities Between Data Objects

Distances:

Distances are dissimilarities with certain properties. The Euclidian distance, d, between two points , x and y in one , two or higher dimensional space is given by the formula:

$$ \mathrm{d}(\mathrm{x}, \mathrm{y})=\sqrt{\sum n} \quad(x k-y k) 2 $$

where n is the number of dimensibns and xk and yk are, respectively, the kth attribute (component) of x and y.

The Euclidian distance measure is given generalized by the Minkowski distance metric shown as:

$$ \mathrm{d}(\mathrm{x}, \mathrm{y})=\left(\sum n|x k-y k| r\right) 1 / r $$

The following are the 3 most common examples of Minkowski distances:

r = 1 also known as City block (Manhattan or L1 norm) distance. A common example is the Hamming distance, which is the number of bits that are different between two objects that only have binary attributes (i.e., binary vectors)
r=2. Euclidian distance (L2 norm).
$\mathrm{r}=\infty$. Supremum, ( $L_{\max }$ or $L_{\infty}$ norm) distance. This is the maximum difference between any attributes of the objects. The $L_{\infty}$ is defined more formally by: $$ \mathrm{d}(\mathrm{x}, \mathrm{y})=\lim r \rightarrow \infty\left(\sum n|x k-y k| r\right) 1 / r $$

Similarities Between Data Objects

s(x, y) is the similarity between points x and y, then typically we will have 1. s(x, y) =1 only if x=y. (0 £ s£ 1)

s(x, y) = s (y, x) for all x and y. (Symmetry)

Non-symmetric Similarity

Measures – confusion matrix

Consider an experiment in which people are asked to classify a small set of characters as they flash on the screen. The confusion matrix for this experiment records how often each character is classified as itself, and how often it is classified as another character. For example, suppose "0" appeared 200 times and was classified as "0" 160 times but as "o" 40 times. Likewise, suppose that "o" appeared 200 times and was classified as "o" 170 times and as "0" 30 times

Non-symmetric Similarity

Measures – confusion matrix

If we take these counts as a measure of similarity between the two characters, then we have a similarity measure, but not a symmetric one.

s(“0", "o") =40/2 = 20%

s("o", "0") = 30/2 = 15%

ADD COMMENT EDIT