Machine Learning-Hierarchical Clustering (Partition Clustering)

Posted by dallasx on Tue, 23 Jul 2019 02:39:29 +0200

Hierarchical Clustering (Partition Clustering)

Clustering refers to a large number of unknown labeled datasets, which are divided into several different categories according to the data characteristics existing inside the data, so that the data within the categories are similar, and the data similarity between the categories is small; it belongs to unsupervised learning.

Algorithmic steps

1. Initialized k centers

2. Assign categories to each sample based on distance

3. Update the center point of each category (update to the mean of all samples in that category)

4. Repeat the above two steps until a termination condition is reached

Hierarchical clustering decomposes a given dataset hierarchically until a certain condition is met. Traditional hierarchical clustering algorithms are divided into two main categories:

Aggregated Hierarchical Clustering

The AGNES algorithm==>uses a bottom-up strategy.

agglomerative nesting

Each object is initially treated as a cluster, and then these clusters are merged step by step according to some criteria (the measure of similarity between the two clusters). The distance between the two clusters can be determined by the similarity of the nearest data points in the two different clusters. The merging process of the clusters is repeated until all the objects are full.Number of foot clusters.

AGNES is to group each fruit into a category.

Selection of merge points:

Maximum distance between two clusters (complete)
Minimum distance between two clusters (word)
average distance between two clusters

For chain clustering, bar clustering is better.

Code:

linkages : complete,word,average

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# call AGNES
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import kneighbors_graph  ## K-Nearest Neighbor Calculation for KNN
import sklearn.datasets as ds
# Intercept exception information
import warnings

warnings.filterwarnings('ignore')
# Set properties to prevent Chinese scrambling
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
# Analog data generation: Generate 600 pieces of data
np.random.seed(0)
n_clusters = 4
N = 1000
data1, y1 = ds.make_blobs(n_samples=N, n_features=2, centers=((-1, 1), (1, 1), (1, -1), (-1, -1)), random_state=0)

n_noise = int(0.1 * N)
r = np.random.rand(n_noise, 2)
min1, min2 = np.min(data1, axis=0)
max1, max2 = np.max(data1, axis=0)
r[:, 0] = r[:, 0] * (max1 - min1) + min1
r[:, 1] = r[:, 1] * (max2 - min2) + min2

data1_noise = np.concatenate((data1, r), axis=0)
y1_noise = np.concatenate((y1, [4] * n_noise))
# Fitting crescent data
data2, y2 = ds.make_moons(n_samples=N, noise=.05)
data2 = np.array(data2)
n_noise = int(0.1 * N)
r = np.random.rand(n_noise, 2)
min1, min2 = np.min(data2, axis=0)
max1, max2 = np.max(data2, axis=0)
r[:, 0] = r[:, 0] * (max1 - min1) + min1
r[:, 1] = r[:, 1] * (max2 - min2) + min2
data2_noise = np.concatenate((data2, r), axis=0)
y2_noise = np.concatenate((y2, [3] * n_noise))


def expandBorder(a, b):
    d = (b - a) * 0.1
    return a - d, b + d


## Drawing
# Given the color of the drawing
cm = mpl.colors.ListedColormap(['#FF0000', '#00FF00', '#0000FF', '#d8e507', '#F0F0F0'])
plt.figure(figsize=(14, 12), facecolor='w')
linkages = ("ward", "complete", "average")  # Put several distance methods list Inside, back direct loop
for index, (n_clusters, data, y) in enumerate(((4, data1, y1), (4, data1_noise, y1_noise),
                                               (2, data2, y2), (2, data2_noise, y2_noise))):
    # The first two four represent rows and columns, and the third parameter represents the number of subgraphs(From 1, left to right)
    plt.subplot(4, 4, 4 * index + 1)
    plt.scatter(data[:, 0], data[:, 1], c=y, cmap=cm)
    plt.title(u'Raw data', fontsize=17)
    plt.grid(b=True, ls=':')
    min1, min2 = np.min(data, axis=0)
    max1, max2 = np.max(data, axis=0)
    plt.xlim(expandBorder(min1, max1))
    plt.ylim(expandBorder(min2, max2))

    # Calculate the distance between categories(Calculate distances for only the closest seven samples) -- Hope agens In the algorithm, the distance between points is calculated without repetition
    connectivity = kneighbors_graph(data, n_neighbors=7, mode='distance', metric='minkowski', p=2, include_self=True)
    connectivity = (connectivity + connectivity.T)
    for i, linkage in enumerate(linkages):
        ##Modeling and passing values
        ac = AgglomerativeClustering(n_clusters=n_clusters, affinity='euclidean',
                                     connectivity=connectivity, linkage=linkage)
        ac.fit(data)
        y = ac.labels_

        plt.subplot(4, 4, i + 2 + 4 * index)
        plt.scatter(data[:, 0], data[:, 1], c=y, cmap=cm)
        plt.title(linkage, fontsize=17)
        plt.grid(b=True, ls=':')
        plt.xlim(expandBorder(min1, max1))
        plt.ylim(expandBorder(min2, max2))

plt.tight_layout(0.5, rect=(0, 0, 1, 0.95))
plt.show()

AGNES uses the results of different merges:

Split hierarchical clustering (similar to decision trees)

DIANA algorithm==>uses a top-down strategy.

Divisive analysis

All objects are first placed in a cluster, then subdivided into smaller and smaller clusters according to a set rule (e.g. by kmeans) until an end condition (the number of clusters or the distance between clusters reaches a threshold) is reached.

1, place all sample data as a cluster in a queue 2, divide it into two subclusters (initialize two center points for clustering), add the subclusters to the queue 3, iterate through the second step until the termination condition is reached (number of clusters, minimum square error, number of iterations)

Selection of Split Points:

Error of each cluster
SE for each cluster (preferring this strategy)
Select the cluster with the largest amount of sample data

DIANA is similar to breaking bread

AGNES and DIANA

Simple, easy to understand
Merge/split point selection is not easy
The merge/categorization operation cannot be undone (bread cut cannot close)
Large datasets are not suitable
Less efficient O(t*n2), t is the number of iterations, n is the number of sample points

Optimization of AGNES

BIRCH (Master)

BIRCH algorithm (balanced iterative reduction clustering):

Cluster features use three tuples to carry out information about a cluster, and cluster features are computed by constructing a cluster feature tree that satisfies the restrictions of the branching factor and cluster diameter. A cluster feature tree is actually a height balance tree with two parameters, the branching factor specifies the maximum number of children at each node of the tree.The class diameter reflects the distance range to such points; the non-leaf node is the maximum eigenvalue of its children;

The construction of clustering feature tree can be a dynamic process, and the model can be updated according to the data at any time.

Triple

Construction of BIRCH

Judging from the root node to the leaf node one level at a time,

Advantages and disadvantages:

Suitable for large-scale datasets, linear efficiency;
Suitable only for datasets with convex or spherical distribution, requiring a given number of clusters and correlation parameters between clusters

Code implementation:

Library parameters:

threshold class diameter
branshing_factor branching factor
Number of n_clusters

from itertools import cycle
from time import time
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors

from sklearn.cluster import Birch
from sklearn.datasets.samples_generator import make_blobs

## Set properties to prevent Chinese scrambling
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
## Generate analog data
xx = np.linspace(-22, 22, 10)
yy = np.linspace(-22, 22, 10)
xx, yy = np.meshgrid(xx, yy)
n_centres = np.hstack((np.ravel(xx)[:, np.newaxis],
                       np.ravel(yy)[:, np.newaxis]))
# The resulting 100,000 feature attributes are 2 and the category is 100,Datasets with a Gaussian distribution
X, y = make_blobs(n_samples=100000, n_features=2, centers=n_centres, random_state=28)
# Create different parameters (cluster diameter) Birch hierarchical clustering
birch_models = [
    Birch(threshold=1.7, n_clusters=None),
    Birch(threshold=0.5, n_clusters=None),
    Birch(threshold=1.7, n_clusters=100)
]
# threshold: Threshold of cluster diameter,    branching_factor: Number of large leaves

# We can also add parameters to try the effect, such as adding a branching factor branching_factor，Given different parameter values, see the results of clustering
## Drawing
final_step = [u'diameter=1.7;n_lusters=None', u'diameter=0.5;n_clusters=None', u'diameter=1.7;n_lusters=100']

plt.figure(figsize=(12, 8), facecolor='w')
plt.subplots_adjust(left=0.02, right=0.98, bottom=0.1, top=0.9)
colors_ = cycle(colors.cnames.keys())
cm = mpl.colors.ListedColormap(colors.cnames.keys())

for ind, (birch_model, info) in enumerate(zip(birch_models, final_step)):
    t = time()
    birch_model.fit(X)
    time_ = time() - t
    # Get model results ( label And center point)
    labels = birch_model.labels_
    centroids = birch_model.subcluster_centers_
    n_clusters = len(np.unique(centroids))
    print("Birch Algorithm, parameter information is:%s；Modeling takes time to build:%.3f Seconds; number of cluster centers:%d" % (info, time_, len(np.unique(labels))))

    # Drawing
    subinx = 221 + ind
    plt.subplot(subinx)
    for this_centroid, k, col in zip(centroids, range(n_clusters), colors_):
        mask = labels == k
        plt.plot(X[mask, 0], X[mask, 1], 'w', markerfacecolor=col, marker='.')
        if birch_model.n_clusters is None:
            plt.plot(this_centroid[0], this_centroid[1], '*', markerfacecolor=col, markeredgecolor='k', markersize=2)
    plt.ylim([-25, 25])
    plt.xlim([-25, 25])
    plt.title(u'Birch algorithm%s，time consuming%.3fs' % (info, time_))
    plt.grid(False)

# Original Dataset Display
plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=y, s=1, cmap=cm, edgecolors='none')
plt.ylim([-25, 25])
plt.xlim([-25, 25])
plt.title(u'Raw data')
plt.grid(False)

plt.show()

Run result:

Birch Algorithm, parameter information is: diameter=1.7;n_lusters=None；Modeling takes time to build:2.510 Seconds; number of cluster centers:171Birch Algorithm, parameter information is: diameter=0.5;n_clusters=None；Modeling takes time to build:6.689 Seconds; number of cluster centers:3205Birch Algorithm, parameter information is: diameter=1.7;n_lusters=100；Modeling takes time to build:3.013 Seconds; number of cluster centers:100

Process finished with exit code 0

CURE (unused)

CURE algorithm (using clustering on behalf of points):

The algorithm considers each data point as a class, then merges the closest classes until the required number of classes is reached.However, the difference with the AGNES algorithm is that all points are cancelled or a class is represented by a center point + distance

Instead, a fixed number of well-distributed points are selected from each class as representative points of this class, and these representative points are multiplied by an appropriate shrinking factor to make them closer to the class center point.

The shrinkage characteristics of the representative points can adjust the model to match those non-spherical scenes, and the use of the shrinkage factor can reduce the impact of noise on clustering.

Find several special points to replace the samples in the entire category

Advantages and disadvantages: Random sampling and partitioning can improve the efficiency of algorithm execution in application scenarios that can handle non-spherical distribution

Topics: PHP less

Programmer Think