Implementation of hierarchical clustering in python

Posted by hostcord on Sun, 31 Oct 2021 08:34:21 +0100

Hierarchical clustering algorithm

As the name suggests, hierarchical clustering is clustering layer by layer. Large clusters can be divided from top to bottom, which is called splitting method; It can also aggregate small categories from bottom to top, which is called condensation method; However, the bottom-up condensation method is commonly used.

Split method:

Splitting method means that all samples are classified into a cluster at the beginning, and then split gradually according to some criteria until a certain condition or a set number of classifications are reached. Describe with algorithm:
Input: sample set D, number of clusters or a certain condition (generally the threshold of sample distance, so the number of clusters can not be set)
Output: clustering results

1.Classify all samples in the sample set into a class cluster;
repeat: 
    2.In the same class cluster (counted as c)Calculate the distance between two samples and find the two samples with the furthest distance a,b;
    3.Sample a,b Assign to different class clusters c1 and c2 Medium;
    4.Computing primitive clusters( c)Other remaining sample points and a,b Distance, if dis(a)<dis(b),Then the sample points are assigned to c1 In, otherwise it belongs to c2 Medium;
util:  The number of clusters or set conditions are reached

Coagulation method:

The aggregation method refers to treating each sample point as a class cluster initially, so the size of the original class cluster is equal to the number of sample points, and then merging these initial class clusters according to some criteria until certain conditions or the set number of classifications are reached. Describe with the algorithm:
Input: sample set D, number of clusters or a certain condition (generally the threshold of sample distance, so the number of clusters can not be set)
Output: clustering results

  1.All the sample points in the sample set are regarded as an independent class cluster;
   repeat: 
        2.Calculate the distance between two classes of clusters (described later) and find the two classes of clusters with the smallest distance c1 and c2;
        3.Merge class cluster c1 and c2 Is a class cluster;
   util:  The number of clusters or set conditions are reached

Example:

Calculation formula of Euclidean distance

There are many methods to calculate the distance between clusters:
(1) That is, the distance between the nearest two samples in the two classes is taken as the distance between the two sets, that is, the smaller the distance between the nearest two samples, the greater the similarity between the two classes
(2) The distance between the two farthest points in the two sets is taken as the distance between the two sets
(3) Put all the distances between the two points in the two sets together to find an average value, and you can get a better result.
e.g. the following is the calculation of the distance from the combined data point (A,F) to (B,C). Here, the mean values of the distances between (A,F) and (B,C) are calculated respectively.

(4) Taking the median value of pairwise distance is more able to eliminate the interference of individual deviation samples on the results than taking the mean value.
(5) Find the center point of each set (that is, a vector obtained by adding the corresponding dimensions of all elements in the set and then dividing by the number of elements), then replace the set with the center point, and then calculate the distance between the sets
Next, it is simply implemented with the world bank sample data set. The data set is stored in a CSV file named WBClust2013.csv in a standard format. It has 80 rows of data and 14 variables. data sources

In order to make the visualization of results more convenient, I deleted the last column of population data, and added PCA dimension reduction after hierarchical clustering to compare with the original results.

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


data = pd.read_csv('data/WBClust2013.csv')
data.pop('Pop')
# data.pop('RuralWater')
# data.pop('CellPhone')
# data.pop('LifeExp')
data = data[:20]
country = list(data['Country'])
data.pop('Country')

# The following code uses hierarchical clustering only

plt.figure(figsize=(9, 7))
plt.title("original data")
mergings = linkage(data, method='average')
# print(mergings)
dendrogram(mergings, labels=country, leaf_rotation=45, leaf_font_size=8)
plt.show()
Z = linkage(data, method='average')
print(Z)
cluster_assignments = fcluster(Z, t=3.0, criterion='maxclust')
print(cluster_assignments)
for i in range(1, 4):
    print('cluster', i, ':')
    num = 1
    for index, value in enumerate(cluster_assignments):
        if value == i:
            if num % 5 == 0:
                print()
            num += 1
            print(country[index], end='  ')
    print()

# The following code is to add PCA for comparison
class myPCA():

    def __init__(self, X, d=2):
        self.X = X
        self.d = d

    def mean_center(self, data):
        """
        Decentralization
        :param data: data sets
        :return:
        """
        n, m = data.shape
        for i in range(m):
            aver = np.sum(self.X[:, i])/n
            x = np.tile(aver, (1, n))
            self.X[:, i] = self.X[:, i]-x

    def runPCA(self):

        # Calculate the covariance matrix to obtain the eigenvalue and eigenvector
        S = np.dot(self.X.T, self.X)
        S_val, S_victors = np.linalg.eig(S)
        index = np.argsort(-S_val)[0:self.d]
        Y = S_victors[:, index]
        # Get the output sample set
        Y = np.dot(self.X, Y)
        return Y

# data_for_pca = np.array(data)
# pcaObject=myPCA(data_for_pca,d=2)
# pcaObject.mean_center(data_for_pca)
# res=pcaObject.runPCA()

# plt.figure(figsize=(9, 7))
# plt.title("after pca")
# mergings = linkage(res,method='average')
# print(mergings)
# dendrogram(mergings,labels=country,leaf_rotation=45,leaf_font_size=8)
# plt.show()
# Z = linkage(res, method='average')
# print(Z)
# cluster_assignments = fcluster(Z, t=3.0, criterion='maxclust')
# print(cluster_assignments)
# for i in range(1,4):
#     print('cluster', i, ':')
#     num = 1
#     for index, value in enumerate(cluster_assignments):
#         if value == i:
#             if num % 5 ==0:
#                 print()
#             num+=1
#             print(country[index],end='  ')
#     print()

The two classification results are the same:

cluster 1 :
China  United States  Indonesia  Brazil  
Russian Federation  Japan  Mexico  Philippines  Vietnam  
Egypt, Arab Rep.  Germany  Turkey  Thailand  France  
United Kingdom  
cluster 2 :
India  Pakistan  Nigeria  Bangladesh  
cluster 3 :
Ethiopia  

Visualize the results through a tree view

Original tree view:

Results after PCA dimensionality reduction:

Topics: Python Machine Learning