written 8.5 years ago by |
For interval data the most common distance measure used is the Euclidean distance.
Euclidean distance:
In general, if you have p variables X1,X2, . . . ,Xp measured on a sample of n subjects, the observed data for subject i can be denoted by xi1, xi2, . . . , xip and the observed data for subject j by xj1, xj2, . . . , xjp. The Euclidean distance between these two subjects is given by
$$d_{ij}\sqrt{(x_{i1}-x_{j1})^2+(x_{i2}-x_{j2})^2+....+(x_{ip}-x_{jp})^2}$$
Hierarchical agglomerative methods:
Nearest neighbour method (single linkage method):
In this method the distance between two clusters is defined to be the distance between the two closest members, or neighbours.
Furthest neighbour method (complete linkage method):
In this case the distance between two clusters is defined to be the maximum distance between members — i.e. the distance between the two subjects that are furthest apart.
Average (between groups) linkage method (sometimes referred to as UPGMA): The distance between two clusters is calculated as the average distance between all pairs of subjects in the two clusters.
Centroid method:
Here the centroid (mean value for each variable) of each cluster is calculated and the distance between centroids is used. Clusters whose centroids are closest together are merged.
Ward’s method:
In this method all possible pairs of clusters are combined and the sum of the squared distances within each cluster is calculated. This is then summed over all clusters. The combination that gives the lowest sum of squares is chosen.