kmeans algorithm and its optimization

Posted by kark_1999 on Fri, 12 Nov 2021 23:20:12 +0100

kmeans clustering algorithm

Algorithm principle

The algorithm principle of kmeans is actually very simple

I use the simplest two-dimensional scatter diagram to explain

As shown in the figure above, we can intuitively see that the figure can be grouped into two categories, which are represented by red dots and blue dots respectively

Let's simulate how Kmeans clusters the original two-dimensional scatter graph

First, initialize two cluster centers randomly. As for what is a cluster center, we don't need to press down the table for the time being. Just take it as a point now.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-sk81va5t-1636722491932) (C: \ users \ li554 \ appdata \ roaming \ typora user images \ image-20211112203412124. PNG)]

Then we will mark all the sample points entering from the red point as red and the sample points close to the blue point as blue

Then we reset the location of the cluster center. The red clustering center is set at the center (mean value) of the current red point, the blue clustering center is set at the center (mean value) of the current blue point, and the sample color is reset to black

Then we continue to mark all the sample points that enter from the red point as red, and the sample points that are close to the blue point as blue

In fact, we now see that red and blue are very clear and have achieved the initial effect, but the machine has no eyes and can't see intuitively. Now it can stop.

Therefore, it will continue to reset the cluster center according to the method of calculating the mean

Then, continue to mark all the sample points entering from the red point as red and the sample points close to the blue point as blue

Then continue to reset the cluster center according to the method of calculating the mean, but here, the machine will find that the position of the cluster center we set and the previous cluster center has hardly changed, which shows that our algorithm converges and the category of each sample has been basically determined. So the algorithm terminates and clustering is completed

Algorithm steps

  1. Select K cluster centers

  2. Calculate the distance from each sample point to K cluster centers, and add the sample to the classification corresponding to the cluster center with the smallest distance

  3. The sample mean of each classification set is calculated and used as a new clustering center

  4. Repeat steps 2 and 3 until the distance between the new cluster center and the original cluster center is less than the set threshold

Algorithm implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

class KMeans:
    def __init__(self, k, threshold):
        self.k = k
        self.threshold = threshold

    def train(self, X):
        # Initialize cluster center
        # K samples were randomly selected as clustering centers
        self.centers = X[np.random.choice(X.shape[0], size=self.k)]
        while True:
            # Classification by cluster center
            clusters = []
            for i in range(self.k):
                clusters.append([])
            for i, x in enumerate(X):
                # The reason why clusters store the sample index rather than the sample itself is to reduce space consumption
                clusters[self.predict(x)].append(i)
            new_centers = self.centers.copy()
            # Calculate the new cluster center
            for i, cluster in enumerate(clusters):
                new_centers[i] = X[cluster].mean()
            # If the cluster center basically does not change, it is terminated
            if np.max(np.abs(new_centers - self.centers)) < self.threshold:
                break
            # Otherwise, update the cluster center and repeat the above steps
            self.centers = new_centers
        # Return clustering results
        y_pred = np.zeros(shape=(X.shape[0],))
        for cluster_index, cluster in enumerate(clusters):
            for i in cluster:
                y_pred[i] = cluster_index
        return y_pred

    def predict(self, x):
        dis = []
        # Calculate the distance between each sample and the center
        for c in self.centers:
            dis.append(np.linalg.norm(x - c))
        # Add the sample index to the classification corresponding to the center with the smallest distance
        return np.argmin(dis)


def main():
    model = KMeans(3, 1e-2)
    X, y = make_blobs(n_samples=1000, n_features=2, centers=3)
    y_pred = model.train(X)
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.show()

    plt.scatter(X[:, 0], X[:, 1], c=y_pred)
    plt.show()
main()

Optimization and improvement

OK, after talking about the Kmeans algorithm, let's think about what's wrong with the Kmeans algorithm

Cluster center initialization

Look at the following two figures. They are the results of the above programs that may run occasionally. The left is the original data and the right is the clustered data

Although the clustered data is very clear, it is extremely inconsistent with the distribution of the original data, which is actually the problem of cluster center initialization.

For example, in the example we introduced earlier, if our clustering center is in this place

Therefore, the lower part of the green line is divided into red and the upper part of the green line is divided into blue. As soon as we find the mean value and reset the cluster center, we find that the location has not changed much, so this is the cluster we get

It is also clear, but it is obviously not the clustering we want

So how to solve this problem, that is, randomly initializing the cluster center is not good.

. . . To be added...

Topics: Algorithm Data Mining kmeans