There are various data mining methods: Clustering, Classification, Regression, Association Rule Mining, Text Mining, Anomaly detection, sequential Pattern Mining, and Time Series Prediction. In this article, let’s learn about clustering.
■ Why clustering? to better understand a data!
Clustering classifies the data with the following guideline:
1) The data in a cluster is as similar as possible (red arrow in the figure below).
2) Get as far away from data in other clusters as possible (blue arrow in the figure below).
It enables descriptive analysis because it creates clusters with data already given. In the figure below, the right side is the actual data labeled in color. But what if you don’t know which point is which color?
As shown on the left side of the figure, it is necessary to group close data together. Why? To understand the structure of the data well!
Clustering is a typical unsupervised learning approach. This is because data does not have a label (i.e., “red”, “blue”, or “green”) and is designed to explore its basic structure.
■ Clustering applications
Clustering is useful when segmenting a market, which means dividing a market into subgroups of multiple customers. We can evaluate a cluster by checking whether the purchasing pattern is similar within the cluster and dissimilar with other clusters.
Another example is clustering documents. The purpose is to analyze how similar and different each document is, with the keywords that appear in each document. To do this, first determine which terms appear frequently in each document, and distinguish similarity according to the frequency. Based on this, the search engine shows the keyword-related documents when we search for a keyword.
In addition, there are endless examples of clustering communities based on interests on social media, image recognition in autonomous driving, or clustering.
■ Type of Clustering
There are two type of Clustering: partitioning clusters and hierarchical clusterings. A partitional clustering is to divide a cluster so that the data do not overlap, and a hierarchical clustering is to divide clusters in a tree structure.
Next, we’ll learn more about K-Means, which is representative of the Partitional Clustering.