Skip to content

K-Means Clustering Python Example – Analyzing Customer Purchase Data

This is a python example of K-Means clustering. This example analyzes customer purchase data to gain meaningful insights. For a concept, please refer to K-Means.

** Please check the GitHub at the bottom of the post for Excel data and guided jupyter notebook files. Any comments are welcomed 🙂

■ How does the K value affect clustering?

First, load the Excel data using the pandas package on your Python Jupyter notebook.

# Import pandas packages
import pandas as pd

# read_excel () function to load files
data = pd.read_excel('CustomerDataSet.xls')

# View only a few rows of data
data.head()

The data is made up of five columns: Customer ID, ProductsBought, ItemsReturned, ZipCode, and Product. What we want to know is the relationship between our product portfolio and the region. We will use ItemsBoughts and ItemsReturned to cluster. Customer IDs are not used because they are not significantly meaningful for this analysis. ZipCode and Product are used to interpret cluster analysis results.

Now we need to calculate the distance between records and to cluster them. Before move on, we have to normalize the data to equalize the units of measurement of ItemsBought and ItemsReturned. Here we will use MinMaxScaler() to proceed the normalization.

# Load required packages (KMeans, matplotlib, preprocessing)
from sklearn.cluster import KMeans
import matplotlib.byplot as plt
from sklearn import preprocessing

# Copy and preprocess the original data (do not preprocess on the original data immediately)
processed_data = data.copy()

# Data preprocessing: tasks for normalization
scaler = preprocessing. MinMaxScaler()
processed_data [['ItemsBought', 'ItemsReturned'] ] = scaler.fit_transform(preprocessed [['ItemsBought', 'ItemsReturned'] ])

# Create a figure
plt.figure(figsize = (10, 6))
Repeat the test by increasing the #K value
for i in range(1, 7)
       # Cluster creation
       estimator = KMeans(n_cluster = i)
       ids = estimator.fit_predict(processed_data [['ItemsBought', ' ItemsReturned'] ])
       # 2 Add subplots with row 3 (index = i)
       plt.subplot(3, 2, i)
       plt.tight_layout()
       # Labeling of subplots
       plt.title("K value = {}".format(i))
       plt.xlabel('ItemsBought')
       plt.ylabel('ItemsReturned')
       # Drawing clustering
       plt.scatter(processed_data ['ItemsBought'] , processed_data ['ItemsReturned'] , c=ids) 
plt.show()

The result is as follows:

We can see that the clustering is good when K is either 3 or 4. To interpret the cluster, a cluster is for the customer who buys a lot and holds it, a cluster for the customer who buys a lot but returns a bit, and finally a cluster for the customer who don’t buy and returns a lot.

■ Let’s look at the relationship between each cluster and product ID

This time, let’s set K as 3 and put the legend based on the product id.

# K = Clustering to 3
estimator = KMeans(n_clusters = 3)

# clustering
cluster_ids = estimator.fit_predict(preprocessed [['ItemsBought', 'ItemsReturned'] ])

# create a scatter plot
plt.scatter(processed_data ['ItemsBought'] , processed_data ['ItemsReturned'] , c=cluster_ids)

# Legend data with product and cluster ids
for index, c_id, bought, returned, zip_code, product in processed_data.itertuples():
    plt.annotate("Clu{}: {}".format(cluster_ids [index] , product),(bought, returned))
    
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')
plt.show()

You will get the following results:

If you look at the graph, you can see that Clu1, product 2435 are not sold frequently, but also many returns occur. Since the data is normalized, let’s go back to the original data and take a closer look at that product.

# Let's extract data classified as cluster 1
data[ cluster_ids == 1 ]

Now we know which custer is assigned as ‘bad cluster’ and where they live. From this point, you can improve your problem by looking at customer data!

■ Let’s look at the relationship between each cluster and region

This time, let’s cluster to gain insights into local marketing efforts. Zipcode do not need to be preprocessed separately. However, we can use the same cluster from the preprocessed data because the order of records stays the same even after the preprocessing.

# Plotting
plt.scatter(data ['ItemsBought'] , data ['ItemsReturned'] , c=cluster_ids)

# Show legend by zip code
for (index, c_id, bought, returned, zip_code, product) in customer_data.itertuples():
    plt.annotate(zip_code,(bought + 0.6, returned + 0.6))
    
plt.xlabel('ItemsBought')
plt.ylabel('ItemsReturned')

plt.show()

Looking at the results, you can see that the performance is good in areas where the zip codes are 1 and 2. Check what marketing you’re doing in regions 1 and 2, and see if it can be applied to other regions, and find ways to improve performance in other regions!

Please check the guided file with the link below:
https://github.com/lucy-the-marketer/k-means-clustering

Leave a Reply

Your email address will not be published. Required fields are marked *