written 5.6 years ago by | modified 20 months ago by |
K-means Clustering:
In K-means approach the data objects are classified based on their attributes or features into k number of clusters. The number of clusters i.e K is an input given by the user. K-means is one of the simplest unsupervised learning algorithms.
K-means Algorithm:
K- number of clusters
n- sample vectors $x_{1},x_{2},....x_{n}$
$m_{i}$- The mean of vectors in cluster i
- Assume k < n
- Make initial guesses for the means $m_{1}, m_{2},...m_{k}$
- Until there are no changes in any mean
Use the estimated means to classify the samples into clusters.
for i = 1 to k
Replace $m_i$ with the mean of all of the samples for cluster i
end_for
end_until
Following three steps are repeated until convergence:
- Iterate till no object moves to a different group:
- Find the centroid coordinate.
- Find the distance of each object to the centroids
- Based on minimum distance group the objects.
K-Means Advantages :
If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.
K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
K-Means Disadvantages :
- Difficult to predict K-Value.
- With global cluster, it didn't work well.
- It does not work well with clusters (in the original data) of Different size and Different density.
Numerical
Given:
Dataset -{15, 15, 16, 19, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
Solution:
Assume K=2
So, two clusters are formed (Users point of view) i.e
$Cluster 1$ = 15, 16, 19, 20, 21, 22, 28 $\therefore C_1$ = 20 (mean value)
$Cluster 2$ = 35, 40, 41, 42, 43, 44, 60, 61, 65 $\therefore C_2$ = 43 (mean value)
Below table:
Calculate Cluster 1 and Cluster 2 values, by subtracting the datapoint from $C_1$ and $C_2$
i.e (20 - 15) = 5 and (43 - 15) = 28
then choose min value (5) and assign that cluster.
Data point | Cluster 1 | Cluster 2 | Cluster assign |
---|---|---|---|
15 | 5 | 28 | 1 |
15 | 5 | 28 | 1 |
16 | 4 | 27 | 1 |
19 | 1 | 24 | 1 |
19 | 1 | 24 | 1 |
20 | 0 | 23 | 1 |
20 | 0 | 23 | 1 |
21 | 1 | 22 | 1 |
22 | 2 | 21 | 1 |
28 | 8 | 15 | 1 |
35 | 15 | 8 | 2 |
40 | 20 | 3 | 2 |
41 | 21 | 2 | 2 |
42 | 22 | 1 | 2 |
43 | 23 | 0 | 2 |
44 | 24 | 1 | 2 |
60 | 40 | 17 | 2 |
61 | 41 | 18 | 2 |
65 | 45 | 22 | 2 |
New centroid is calculated by taking average value of clusters.
Old centroid | New centroid |
---|---|
20 | 19.5 |
43 | 47.88 |
Data point | Cluster 1 | Cluster 2 | Cluster assign |
---|---|---|---|
15 | 4.5 | 32.8 | 1 |
15 | 4.5 | 32.8 | 1 |
16 | 3.5 | 31.8 | 1 |
19 | 0.5 | 28.8 | 1 |
19 | 0.5 | 28.8 | 1 |
20 | 0.5 | 27.8 | 1 |
20 | 0.5 | 27.8 | 1 |
21 | 1.5 | 26.8 | 1 |
22 | 2.5 | 25.8 | 1 |
28 | 8.5 | 19.8 | 1 |
35 | 15.5 | 12.8 | 2 |
40 | 20.5 | 7.8 | 2 |
41 | 21.5 | 6.8 | 2 |
42 | 22.5 | 5.8 | 2 |
43 | 23.5 | 4.8 | 2 |
44 | 24.5 | 3.8 | 2 |
60 | 40.5 | 12.1 | 2 |
61 | 41.5 | 13.1 | 2 |
65 | 45.5 | 17.1 | 2 |
Old centroid | New centroid |
---|---|
19.5 | 19.5 |
47.88 | 47.88 |
Since, Old centriod is same as New centroid stop the iteration.
$\therefore$ Final answer is $k_1$ = {15,15,16,19,19,20,20,21,22,28} $k_2$ = {35,40,41,42,43,44,60,61,65}