written 2.6 years ago by |
CURE Algorithm:
It is a large scale clustering algorithm which uses a combination of partition based and hierarchical algorithms.
This algorithm as sums Euclidean space and identifies cluster of any space.
CURE starts by treating each data point as a single cluster and then recursively merges two existing clusters into one until we have only K clusters.
CURE uses two data structures to compute minimum distance between representative points:
1) Heap to track the distance of each existing cluster to its closet cluster.
2) Uses k-d tree to store all representative points for each cluster.
- The two stages of CURE implementation are as follows:
1) Initialization in CURE
• Take a small sample data and cluster it in main memory using hierarchical clustering algorithm
• Select a small set of points from each cluster to be representative points. These points should be chosen to be as far from one another as possible.
• Move each of the representative points a fixed fraction of the distance between its location and the centro-id of its cluster. The fraction could be about 20% or 30% of the original distance
2) Completion of CURE Algorithm:
Once initialization is complete, we have to cluster the renaming points and output the final cluster:
The final step of CURE is to merge two clusters if they have a pair of representative points, one from each cluster, that are sufficiently close. The user may pick the distance threshold. The merging step can be repeated until there are no more sufficiently close clusters.
Each point P is brought from secondary storage and compared with the representative points.
We assign p to the cluster of the representative point that is closes to P.
Figures illustrating CURE :