written 7.1 years ago by | • modified 5.7 years ago |
In the point assignment classs which assumes Euclidean space.
It does not assume anything about the shape of clusters; they need not be normally distributed, and can even have strange bends, S-shapes, or even rings.
Instead of representing clusters by their centroid, it uses a collection of representative points, as the name implies.
The CURE algorithm is divided into into phases:
- Initialization in CURE
- Completion of the CURE Algorithm
Initialization in CURE:
Take a small sample of the data and cluster it in main memory. In principle, any clustering method could be used, but as CURE is designed to handle oddly shaped clusters, it is often advisable to use a hierarchical method in which clusters are merged when they have a close pair of points.
Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible, using the K-means method.
Move each of the representative points a fixed fraction of the distance between its location and the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note that this step requires a Euclidean space, since otherwise, there might not be any notion of a line between two points.
Completion of the CURE Algorithm:
The next phase of CURE is to merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close. The user may pick the distance that defines “close.” This merging step can repeat, until there are no more sufficiently close clusters.
Example:
Figure A is an illustration of two clusters. The inner cluster is an ordinary circle, while the second is a ring around the circle. This arrangement is not completely pathological. A creature from another galaxy might look at our solar system and observe that the objects cluster into an inner circle (the planets) and an outer ring (the Kuyper belt), with little in between.
We could use a hierarchical clustering algorithm on a sample of the data from Fig. A If we took as the distance between clusters the shortest distance between any pair of points, one from each cluster, then we would correctly find the two clusters. That is, pieces of the ring would stick together, and pieces of the inner circle would stick together, but pieces of ring would always be far away from the pieces of the circle. Note that if we used the rule that the distance between clusters was the distance between their centroids, then we might not get the intuitively correct result. The reason is that the centroids of both clusters are in the center of the diagram.
For the second step, we pick the representative points. If the sample from which the clusters are constructed is large enough, we can count on a cluster’s sample points at greatest distance from one another lying on the boundary of the cluster. Figure B suggests what our initial selection of sample points might look like.
Finally, we move the representative points a fixed fraction of the distance from their true location toward the centroid of the cluster. Note that in Fig. B both clusters have their centroid in the same place: the center of the inner circle. Thus, the representative points from the circle move inside the cluster, as was intended. Points on the outer edge of the ring also move into their cluster, but points on the ring’s inner edge move outside the cluster. The final locations of the representative points from Fig. B are suggested by Fig.C.
The situation of Fig. C serves as a useful illustration. There is some argument that the ring and circle should really be merged, because their centroids are the same. For instance, if the gap between the ring and circle were much smaller, it might well be argued that combining the points of the ring and circle into a single cluster reflected the true state of affairs. For instance, the rings of Saturn have narrow gaps between them, but it is reasonable to visualize the rings as a single object, rather than several concentric objects. In the case of Fig.C the choice of:
- The fraction of the distance to the centroid that we move the representative points and
- The choice of how far apart representative points of two clusters need to be to avoid merger
The last step of CURE is point assignment. Each point p is brought from secondary storage and compared with the representative points. We assign p to the cluster of the representative point that is closest to p.