In the previous article, we’ve covered two types of clustering(partitional clustering and hierarchical clustering) and its application. Today, we will focus more on partitional clustering and the principles of K-Means clustering and pros and cons of K-Means method.
■ K-Means Clustering
K-Means clustering is a typical partitional clustering method and it is fairly simple algorithm. All you have to is to set the data points closest to the centroid of each cluster. The number of clusters, ‘K’, should be set by the researchers.
Step-by-step algorithm is like:
1. Center of K is randomly selected
2. Make the data close to its center into a cluster.
3. After averaging the distance from the center to each data point, move the center to the mean point of them.
4. Assigns a cluster of data points that are close again to the moved center.
5. Repeat step 3 to step 4. Until when? till data points are not re-assign, till the center no longer moves, and till sum of squared errors (SSE) are minimized!
It is most common to obtain an SSE. Here, an error refers to the distance between the center of the cluster and each data point.
In the picture above, let’s say the red point is the center of the cluster. Which cluster seems better to you? Yes, right side of the figure seems better because this cluser has shorter line between each point and its center, a.k.a.t minimizes SSE.
■ Disadvantages of K-Means: K
K-Means is not a perfect solution. One of the disadvantages of K-Means is that the researchers need to set K, and the center of K is randomly distributed.
Suppose the center point is randomly distributed badly as shown on the left above. Even if you assign a cluster accordingly and move the center to the average point (right figure), it is not a cluster that represents the data well.
■ How to overcome the shortcomings of K-Means – finding the appropriate number of K
A way to overcome the disadvantage that the starting position of K is random, there is a method of analyzing several times. By moving the starting point several times, you can assign the clusters which minimize SSE and still to remove such ‘outliers’.
K values can also be found through simple trial and error. That is, start with a small value for K and increase it one by one to compare the results(Empirically, the K value is between 1 and 10.). Fortunately, Researchers use several algorithms to automatically find K values, typically Knee value (also known as elbow value.)
The appropriate K value can be set as shown above. After all, even if the K value is increased, you can find the knee value (in the figure above, the slope decreases when K is 6) and set the K value, which is the most declining point of improvement in SSE.
■ Another shortcomings of K-Means: Outlier
Another shortcoming of K-Meaans is that its vulerability to outlier. We don’t want the center of the cluster to change because of the one outlayer data below.
To overcome the shortcomings of K-Means, you can ‘get rid of’ data far from the center in the first place. At this point, it is important to ensure that the data point you want to remove is definitely an outlier. In other words, it is recommended that you carefully observe and remove data that is likely to be outliers. Of course, it is the next best thing because it is to apply manipulation to a original data.
Another method is random sampling. If the data size is large enough, the probability of selecting an outlier is likely to be very small. The advantage of a random sampling is that you can create models more quickly. However, this cannot be taken if the data is small.
The last method is to use a modified form of K-Means called K-Medoids. Medoids uses a median, not an average, for each cluster. As you know, the median is less affected by the outlier.
■ Clustering Evaluation
Again, let’s think about the purpose of clustering. The distance between the data in one cluster is small (i.e. the cohesion is large) and the distance between the clusters is large (i.e. seperation is large)!
Cohesion a(x) refers to the average distance x of all data(vector value) in the same cluster. What a(x) is small means that the cohesion in the cluster is high. On the other hand, Separation b(x) first calculates the mean of the distance between a certain data and every data points in other clusters where that data do not belong. The smallest distance value among them is x.. That means that you have selected a cluster located in the neighborhood closest to the data.
Silhouette s(x) and silhouette coeffiicient are:
Let’s take a closer look. It’s easy to think of the ideal case, where all the data is positioned at one point. The case of where a(x) is 0. When you substitute it with the formula (b(x) – 0) / (b(x)) = 1. Thus, the more a(x) value is close to 0 or the b(x) value is close to infinity, In other words, the longer the distance between each cluster and the smaller the data in the cluster, the better the performance of cluster.
Silhouette values range from -1 to 1. If the silhouette value is 0, it means that there is no change in the cluster. Of course, the closer to -1 means that the results of the cluster analysis are bad.
The silhouette coefficient can also be used to select the number of K. You can find the K value, which has the largest silhouette coefficient. Keep in mind the differences from knee value, which chooses the point where the slope decreases.