Machine learning experiment 3: K-means clustering

Posted by joespenceley on Sun, 17 Oct 2021 00:54:40 +0200

Introduction:

In this experiment, we will realize K-means clustering algorithm (K-means) and understand its working principle in data clustering and its application in image compression.

The data sets used in this experiment include:

ex3data1.mat -2D dataset
hzau.jpeg - an image used to test the image compression performance of k-means clustering algorithm

The scoring criteria are as follows:

Point 1: find the nearest class center -----------------(20 points)
Point 2: calculate the mean class center --------------------(20 points)
Point 3: randomly initialize the class center -----------------(10 points)
Key point 4: K-means clustering algorithm ---------------------(20 points)
Point 5: image compression -----------------------------(30 points)

In [1]:

# Import the required library files
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sb
from scipy.io import loadmat

%matplotlib inline

1 K-means Clustering

In this part of the experiment, the K-means clustering algorithm will be implemented.

In each iteration, the algorithm mainly includes two parts: finding the nearest class center and calculating the mean class center.

In addition, based on the needs of initialization, it is necessary to create a function to select random samples and use them as the initial cluster center.

1.1 find the nearest Center

In this part of the experiment, we will find the nearest class center for each sample point and assign it to the corresponding class.

The specific update formula is as follows:

ci:=argminj=1,⋯,K∥xi−μj∥2,ci:=argminj=1,⋯,K⁡‖xi−μj‖2,

Where xixi is the second sample point, μ j μ J is the center of the jj mean class.

**Important point 1: * * in the cell below, please * * implement the code of "finding the nearest class center" * *.

In [2]:

# ======================Fill in the code here=======================
def find_closest_centroids(X, centroids):
    """   
    input
    ----------
    X : Size (m, n)Matrix of, page i Act No i Samples, n Is the dimension of the sample. 
        
    centroids : Size (k, n)Matrix of, where k Is the number of categories.
    
    output
    -------
    idx : Size (m, 1)Matrix of, page i The first component represents the second component i Category of samples. 
    """
    m = X.shape[0]
    k = centroids.shape[0]
    idx = np.zeros(m,dtype=np.int)
    
    for i in range(m):
        minn = 100000
        for j in range(k):
            dist = np.sum((X[i,:] - centroids[j,:]) ** 2)
            if dist < minn:
                minn = dist
                idx[i] = j

    return idx
# =============================================================

If the above function is completed find_closest_centroids, the following code can be used for testing. If the result is [0 2 1], the calculation passes.

In [3]:

#Import data
data = loadmat('ex3data1.mat')
X = data['X']
X1=X
initial_centroids = np.array([[3, 3], [6, 2], [8, 5]])

idx = find_closest_centroids(X, initial_centroids)
idx[0:3]

Out[3]:

array([0, 2, 1])

In [4]:

#Display and view some data
data2 = pd.DataFrame(data.get('X'), columns=['X1', 'X2'])
data2.head()

Out[4]:

	X1	X2
0	1.842080	4.607572
1	5.658583	4.799964
2	6.352579	3.290854
3	2.904017	4.612204
4	3.231979	4.939894

In [5]:

#Visualization of 2D data
fig, ax = plt.subplots(figsize=(9,6))
ax.scatter(X[:,0], X[:,1], s=30, color='k', label='Original')
ax.legend()
plt.show()

1.2 calculation of mean class center

In this part of the experiment, we take the mean of each class of samples as the new class center.

The specific update formula is as follows:

μj:=1|Cj|∑i∈Cjxiμj:=1|Cj|∑i∈Cjxi

Where cjcjj is the index set of the jj sample point and | CJ | CJ | is the number of elements of the set CjCj.

**Important point 2: * * in the cell below, please * * implement the code of "calculate mean class center" * *.

In [6]:

# ======================Fill in the code here======================= 
def compute_centroids(X, idx, k):
    m, n = X.shape
    centroids = np.zeros((k, n))
    #print(n)
    for i in range(k):
        num=0
        sum=np.zeros(n)
        for j in range(m):
            if idx[j]==i:
                num=num+1
                sum[:]=sum[:]+X[j,:]
        centroids[i]=sum[:]/num        
    
    
    return centroids
# =============================================================

In [7]:

#Test the above calculated mean class center code
compute_centroids(X, idx, 3)

Out[7]:

array([[2.42830111, 3.15792418],
       [5.81350331, 2.63365645],
       [7.11938687, 3.6166844 ]])

1.3 random initialization class center

k samples are randomly selected as the initial class center.

**Important 3: * * in the cell below, please * * implement the code of "random initialization class center" * *. Specifically, k samples are randomly selected as the initial class center.

In [8]:

# ======================Fill in the code here======================= 
def init_centroids(X, k):
    m, n = X.shape
    #j=m/k
    #print(j)
    idx = np.random.randint(0, m, k)
    centroids = np.zeros((k, n))
    for i in range(k):
        centroids[i,:] = X[idx[i],:]
    
    return centroids
# =============================================================

In [9]:

#Test the above random initialization class center code
init_centroids(X, 3)

Out[9]:

array([[3.30063655, 1.28107588],
       [1.02285128, 5.0105065 ],
       [6.59702155, 3.07082376]])

1.4 realize K-means clustering algorithm

**Key point 4: * * in the cell below, please * * implement the code of "K-means clustering algorithm" by combining the above steps * *.

In [10]:

# ======================Fill in the code here=======================
def run_k_means(X, initial_centroids, max_iters):
    m, n = X.shape
    k = initial_centroids.shape[0]
    idx = np.zeros(m)
    #centroids = np.zeros((k, n))
    centroids =initial_centroids
    centroids_last =initial_centroids
    for i in range(max_iters):
        idx = find_closest_centroids(X, centroids)
        centroids = compute_centroids(X, idx, k) 
        if (centroids==centroids_last).all()==True:
            break
        centroids_last = compute_centroids(X, idx, k)
    
    return idx, centroids
# =============================================================

2 apply K-means clustering algorithm to data set 1

In this part of the experiment, the implemented K-means clustering algorithm is applied to data set 1. The sample dimension of the data set is 2. Therefore, after clustering, the clustering results can be observed visually.

In [11]:

idx, centroids = run_k_means(X, initial_centroids, 10)
# print(centroids)

In [12]:

cluster1 = X[np.where(idx == 0)[0],:]
cluster2 = X[np.where(idx == 1)[0],:]
cluster3 = X[np.where(idx == 2)[0],:]

fig, ax = plt.subplots(figsize=(9,6))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
plt.show()

1.3 applying K-means clustering algorithm to image compression Image compression with K-means

In [13]:

#Read image
A = mpl.image.imread('hzau.jpeg')
A.shape

Out[13]:

(96, 150, 3)

Now we need to apply some preprocessing to the data and provide it to the K-means algorithm.

In [14]:

# Normalize the range of image pixel values to [0, 1] 
A = A / 255.

# Transform the original image size
X = np.reshape(A, (A.shape[0] * A.shape[1], A.shape[2]))
X.shape

Out[14]:

(14400, 3)

**Key point 5: * * in the cell below, * * Please use K-means clustering algorithm to realize image compression * *. The specific method is to replace the original pixel with the corresponding mean class center pixel.

In [21]:

# ======================Fill in the code here=======================
# Random initialization class center  
initial_centroids = init_centroids(X, 16)
m=X.shape[0]
idx, centroids = run_k_means(X, initial_centroids, 10)
idx = find_closest_centroids(X, centroids)
#n=centroids.shape[0]
#print(n)
A_compressed=X
for i in range(m):
    A_compressed[i,:]=centroids[idx[i],:]
A_compressed = np.reshape(A_compressed, (96, 150,3))
print(A_compressed.shape)
# =============================================================

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:13: RuntimeWarning: invalid value encountered in true_divide
  del sys.path[0]

(96, 150, 3)

In [22]:

#Display images before and after compression
fig, ax = plt.subplots(1, 2, figsize=(9,6))
ax[0].imshow(A)
ax[0].set_axis_off()
ax[0].set_title('Original image')
ax[1].imshow(A_compressed)
ax[1].set_axis_off()
ax[1].set_title('Compressed image')
plt.show()

In [27]:

#Calculation class center point
# ======================Fill in the code here======================= 
def Manhattan(x, y):
    # Defines the calculation of Manhattan distance
    return np.sum(np.abs(x-y))

def compute_mid(X, idx, k):
    m, n = X.shape
    centroids = np.zeros((k, n))
    for i in range(k):
        minn=10000
        for j in range(m):
    
            if idx[j]==i:
                sum=0
                
                for h in range(m):
                    if idx[h]==i and j!=h:
                        sum=sum+Manhattan(X[j,:],X[h,:])
                if sum<minn:
                    minn=sum               
                    centroids[i,:] =X[j,:]   
#     print(centroids)
    return centroids

In [28]:

#Implement k-center clustering algorithm
def run_k_mid(X, initial_centroids, max_iters):
    m, n = X.shape
    k = initial_centroids.shape[0]
    idx = np.zeros(m)
    
    centroids =initial_centroids
    centroids_last =initial_centroids
    for i in range(max_iters):
        idx = find_closest_centroids(X, centroids)
#         print(i)
#         print(idx)
        centroids = compute_mid(X, idx, k) 
#         print(centroids)
        if (centroids==centroids_last).all()==True:
            break
        centroids_last = compute_mid(X, idx, k)
       
    return idx, centroids

# =============================================================

In [29]:

initial_centroids=init_centroids(X1, 3)
# print(initial_centroids)
idx, centroids = run_k_mid(X1, initial_centroids, 10)
# print(idx)
# print(centroids)

In [30]:

cluster1 = X1[np.where(idx == 0)[0],:]
cluster2 = X1[np.where(idx == 1)[0],:]
cluster3 = X1[np.where(idx == 2)[0],:]

fig, ax = plt.subplots(figsize=(9,6))
ax.scatter(cluster1[:,0], cluster1[:,1], s=30, color='r', label='Cluster 1')
ax.scatter(cluster2[:,0], cluster2[:,1], s=30, color='g', label='Cluster 2')
ax.scatter(cluster3[:,0], cluster3[:,1], s=30, color='b', label='Cluster 3')
ax.legend()
plt.show()

initial_centroids=init_centroids(X1, 3)
# print(initial_centroids)
idx, centroids = run_k_mid(X1, initial_centroids, 10)
# print(idx)
# print(centroids)

Topics: Python Machine Learning

Programmer Think