Birch clustering algorithm

Posted by kevintynfron on Thu, 10 Feb 2022 16:17:41 +0100

1 Principle

1.1 B-tree

(1) m-way lookup tree

An m-way search tree is either an empty tree or a tree satisfying the following properties:

  • The root has up to m sub trees and has the following structure:

,Is a pointer to a subtree,The key is,

  • In subtreeAll keys in are greater than, less than.
  • In subtreeAll keys in are greater than
  • In subtreeAll keys in are less than
  • subtreem-way lookup tree

(2) B tree

m-order B-tree is an m-way search tree, which is either empty or satisfies the following properties:

  • Each node in the tree has at most m sub trees
  • The root node has at least two subtrees
  • All non terminal nodes except the root node have at leastsubtree
  • All leaf nodes are on the same layer

1.2 steps

Specific simulation process reference: https://www.cnblogs.com/pinard/p/6179132.html

References:

BIRCH can identify the imbalance of data distribution in the data set, cluster the points distributed in dense areas and remove the abnormal points distributed in sparse areas. In addition, BIRCH is an incremental clustering method. The clustering decision for each point is based on the currently processed data points rather than the global data points.
① Establish a clustering feature tree
The first is to traverse all the data, and use the given amount of memory and the reclaimed space on the disk to build an initial memory CF tree to reflect the clustering information on the data set. For dense data, it is divided into finer clusters, and sparse data points are removed as outliers.
② Reduce the scope and simplify the clustering feature tree
This process is optional. This part is the bridge connecting step ① and step ③. Similar to step ①, it starts to traverse the leaf nodes of the initialized clustering feature tree, remove more outliers and narrow the range for grouping.
③ Global clustering
Using global clustering or semi global clustering to operate all leaf nodes, the clustering algorithm with data points is easy to adapt to a group of sub clusters, and each sub cluster is represented by its clustering feature vector. Calculate the centroid of sub clusters, and then each sub cluster is represented by centroid, which can capture the main distribution law of data.
④ Cluster refinement
Because step ③ is only a rough summary of the data, and the original data is only scanned once, it is necessary to continue to improve the cluster class. The center of the cluster generated in the previous stage is used as the seed, and the data points are reassigned to the nearest seed to obtain a new set of clusters. This not only allows the migration of points belonging to the sub cluster, but also ensures that all copies of a given data point are migrated to the same sub cluster. It also provides an option to discard outliers. That is, if it is too far from the nearest point, the seed can be treated as an outlier and not included in the result.

2. Parameter description

Function: sklearn cluster. Birch

Parameters:

  • Threshold: (float,default=0.5) the radius of the merged sub cluster of the new sub cluster and the latest sub cluster is less than the threshold, otherwise it will be split.
  • branching_factor: (int,default=50) the maximum number of CF sub clusters in each node.
  • n_cluster: (int, default=3) the number of clusters in the final clustering step, if None, the final clustering step is not executed, and the sub clusters are returned as is; if sklearn.cluster.Estimator, the model executes the sub clustering as a new sample.
  • compute_labels: (bool,default=True) whether the label value is calculated during each fitting.
  • Copy: (bool,default=True) whether to copy the obtained data. If it is set to false, the initialization data will be overwritten.

Properties:

  • root_: Root of CF tree
  • dummy_leaf_: Pointers to all leaf nodes
  • subcluster_centers_: Centroid of sub cluster in all leaves
  • subcluster_labels_: Labels of sub cluster centroid after full clustering
  • labels_: Labels of all input data

3. Specific implementation

You can refer to the example of scikit learn: https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py

4 source code analysis

Source code: anaconda3 / lib / site packages / sklearn / cluster / birch In PY

(1) Prefix knowledge

  • The hasattr() function is used to judge whether a class instance object contains a property or method with a specified name, and returns True and False

hasattr(obj, name), where obj refers to the instance object of a class, and name refers to the specified attribute name or method name.

  • The getattr() function gets the value of the specified attribute in a class instance object

getattr(obj, name[, default]), where obj represents the specified class instance object, name represents the specified attribute name, and default is an optional parameter, which is used to set the default return value of the function. That is, when the function fails to find, if the default parameter is not specified, the program will directly report an AttributeError error error, otherwise the function will return the value specified by default.

  • The function of setattr() is relatively complex. Its most basic function is to modify the attribute value in the class instance object. Secondly, it can also dynamically add properties or methods for instance objects.

(2) Birch function

  • Birch (baseestimator, transformer mixin, clustermixin) is in the base file of sklearn
  • Other parameters

  • fit function (the main core calculation is in _fitfunction)
 def fit(self, X, y=None):
        """
        Build a CF Tree for the input data.

        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            Input data.

        y : Ignored
            Not used, present here for API consistency by convention.

        Returns
        -------
        self
            Fitted estimator.
        """
        self.fit_, self.partial_fit_ = True, False
        return self._fit(X)

    def _fit(self, X):
        
        X = self._validate_data(X, accept_sparse='csr', copy=self.copy)
        
        threshold = self.threshold
        branching_factor = self.branching_factor

        if branching_factor <= 1:
            raise ValueError("Branching_factor should be greater than one.")
        n_samples, n_features = X.shape

        # If partial_fit is called for the first time or fit is called, we
        # start a new tree.
        partial_fit = getattr(self, 'partial_fit_')
        has_root = getattr(self, 'root_', None)
        if getattr(self, 'fit_') or (partial_fit and not has_root):
            # The first root is the leaf. Manipulate this object throughout.
            self.root_ = _CFNode(threshold=threshold,
                                 branching_factor=branching_factor,
                                 is_leaf=True,
                                 n_features=n_features)

            # To enable getting back subclusters.
            self.dummy_leaf_ = _CFNode(threshold=threshold,
                                       branching_factor=branching_factor,
                                       is_leaf=True, n_features=n_features)
            self.dummy_leaf_.next_leaf_ = self.root_
            self.root_.prev_leaf_ = self.dummy_leaf_

        # Cannot vectorize. Enough to convince to use cython.
        if not sparse.issparse(X):
            iter_func = iter
        else:
            iter_func = _iterate_sparse_X
        
        #Traverse the data and construct sub clusters
        for sample in iter_func(X):
            
            subcluster = _CFSubcluster(linear_sum=sample)
            split = self.root_.insert_cf_subcluster(subcluster)
            #If the CF is determined to be split, use the split algorithm to return two sub clusters and add the sub clusters to the root
            if split:
                new_subcluster1, new_subcluster2 = _split_node(
                    self.root_, threshold, branching_factor)
                del self.root_
                self.root_ = _CFNode(threshold=threshold,
                                     branching_factor=branching_factor,
                                     is_leaf=False,
                                     n_features=n_features)
                self.root_.append_subcluster(new_subcluster1)
                self.root_.append_subcluster(new_subcluster2)
        #Get the centroid of leaf node
        centroids = np.concatenate([
            leaf.centroids_ for leaf in self._get_leaves()])
        self.subcluster_centers_ = centroids

        self._global_clustering(X)
        return self

Other functions:

Sparse matrix construction

def _iterate_sparse_X(X):
    """This little hack returns a densified row when iterating over a sparse
    matrix, instead of constructing a sparse matrix for every row that is
    expensive.
    """
    n_samples = X.shape[0]
    X_indices = X.indices
    X_data = X.data
    X_indptr = X.indptr
    
    for i in range(n_samples):
        row = np.zeros(X.shape[1])
        startptr, endptr = X_indptr[i], X_indptr[i + 1]
        nonzero_indices = X_indices[startptr:endptr]
        row[nonzero_indices] = X_data[startptr:endptr]
        yield row

Function of splitting leaf nodes: define two sub clusters and two CF nodes, and add the CF node to the CF sub cluster. If the incoming sub cluster is a leaf node, carry out a series of pointer transformations, calculate the distance between the centroid and the sum of squares of the sub cluster, select the matrix with the largest distance, and then select the smaller value as a sub cluster, The others belong to another sub cluster.

def _split_node(node, threshold, branching_factor):
    """The node has to be split if there is no place for a new subcluster
    in the node.
    1. Two empty nodes and two empty subclusters are initialized.
    2. The pair of distant subclusters are found.
    3. The properties of the empty subclusters and nodes are updated
       according to the nearest distance between the subclusters to the
       pair of distant subclusters.
    4. The two nodes are set as children to the two subclusters.
    """
    new_subcluster1 = _CFSubcluster()
    new_subcluster2 = _CFSubcluster()
    new_node1 = _CFNode(
        threshold=threshold, branching_factor=branching_factor,
        is_leaf=node.is_leaf,
        n_features=node.n_features)
    new_node2 = _CFNode(
        threshold=threshold, branching_factor=branching_factor,
        is_leaf=node.is_leaf,
        n_features=node.n_features)
    new_subcluster1.child_ = new_node1
    new_subcluster2.child_ = new_node2

    if node.is_leaf:
        if node.prev_leaf_ is not None:
            node.prev_leaf_.next_leaf_ = new_node1
        new_node1.prev_leaf_ = node.prev_leaf_
        new_node1.next_leaf_ = new_node2
        new_node2.prev_leaf_ = new_node1
        new_node2.next_leaf_ = node.next_leaf_
        if node.next_leaf_ is not None:
            node.next_leaf_.prev_leaf_ = new_node2

    dist = euclidean_distances(
        node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)
    n_clusters = dist.shape[0]

    farthest_idx = np.unravel_index(
        dist.argmax(), (n_clusters, n_clusters))
    node1_dist, node2_dist = dist[(farthest_idx,)]

    node1_closer = node1_dist < node2_dist
    for idx, subcluster in enumerate(node.subclusters_):
        if node1_closer[idx]:
            new_node1.append_subcluster(subcluster)
            new_subcluster1.update(subcluster)
        else:
            new_node2.append_subcluster(subcluster)
            new_subcluster2.update(subcluster)
    return new_subcluster1, new_subcluster2

Get leaf node:

    def _get_leaves(self):
        """
        Retrieve the leaves of the CF Node.

        Returns
        -------
        leaves : list of shape (n_leaves,)
            List of the leaf nodes.
        """
        leaf_ptr = self.dummy_leaf_.next_leaf_
        leaves = []
        while leaf_ptr is not None:
            leaves.append(leaf_ptr)
            leaf_ptr = leaf_ptr.next_leaf_
        return leaves

Global clustering: agglomerative clustering algorithm is added (written separately).

def _global_clustering(self, X=None):
        """
        Global clustering for the subclusters obtained after fitting
        """
        clusterer = self.n_clusters
        centroids = self.subcluster_centers_
        compute_labels = (X is not None) and self.compute_labels

        # Preprocessing for the global clustering.
        not_enough_centroids = False
        if isinstance(clusterer, numbers.Integral):
            clusterer = AgglomerativeClustering(
                n_clusters=self.n_clusters)
            # There is no need to perform the global clustering step.
            if len(centroids) < self.n_clusters:
                not_enough_centroids = True
        elif (clusterer is not None and not
              hasattr(clusterer, 'fit_predict')):
            raise ValueError("n_clusters should be an instance of "
                             "ClusterMixin or an int")

        # To use in predict to avoid recalculation.
        self._subcluster_norms = row_norms(
            self.subcluster_centers_, squared=True)

        if clusterer is None or not_enough_centroids:
            self.subcluster_labels_ = np.arange(len(centroids))
            if not_enough_centroids:
                warnings.warn(
                    "Number of subclusters found (%d) by Birch is less "
                    "than (%d). Decrease the threshold."
                    % (len(centroids), self.n_clusters), ConvergenceWarning)
        else:
            # The global clustering step that clusters the subclusters of
            # the leaves. It assumes the centroids of the subclusters as
            # samples and finds the final centroids.
            self.subcluster_labels_ = clusterer.fit_predict(
                self.subcluster_centers_)

        if compute_labels:
            self.labels_ = self.predict(X)

 

(3)CFNode

parameterattribute
threshold:floatDetermine the threshold of sub clustersubclusters_ : listSub cluster of specified nodes
branching_factor: intBranching factorprev_leaf_ : _CFNodeFront leaf node
is_leaf : boolIs it a leaf nodenext_leaf_ : _CFNodePosterior leaf node
n_features : intFeature quantityinit_centroids_ Initialize centroid, shape=(branching_factor + 1, n_features)
  init_sq_norm_ Initialize sum of squares, shape=(branching_factor + 1, n_features)
  centroids_centroid
  squared_norm_Sum of squares

 

CFNode consists of three functions:

First function: append_subcluster(self, subcluster) updates the characteristic value of CF

    def append_subcluster(self, subcluster):
        #Get the sub cluster length of CF
        n_samples = len(self.subclusters_)
        #Add new sub clusters to CF
        self.subclusters_.append(subcluster)
        #Initialize the sum of centroids and squares of new sub clusters (add centroids and squares to the list)
        self.init_centroids_[n_samples] = subcluster.centroid_
        self.init_sq_norm_[n_samples] = subcluster.sq_norm_

        # Keep centroids and squared norm as views. In this way
        # if we change init_centroids and init_sq_norm_, it is
        # sufficient,
        #Update the sum of centroids and squares of the final sub clusters (add centroids and squares to the list)
        self.centroids_ = self.init_centroids_[:n_samples + 1, :]
        self.squared_norm_ = self.init_sq_norm_[:n_samples + 1]

The second function: update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2): update split nodes

def update_split_subclusters(self, subcluster,
                                 new_subcluster1, new_subcluster2):
        """Remove a subcluster from a node and update it with the
        split subclusters.
        """
        ind = self.subclusters_.index(subcluster)
        self.subclusters_[ind] = new_subcluster1
        self.init_centroids_[ind] = new_subcluster1.centroid_
        self.init_sq_norm_[ind] = new_subcluster1.sq_norm_
        self.append_subcluster(new_subcluster2)

The third function: Insert_ cf_ Sub cluster (self, sub cluster): insert CF features into sub clusters

 def insert_cf_subcluster(self, subcluster):
        """Insert a new subcluster into the node."""
        # self.subclusters_ If it does not exist, the new sub cluster will be added to the sub cluster list
        if not self.subclusters_:
            self.append_subcluster(subcluster)
            return False
        threshold = self.threshold
        branching_factor = self.branching_factor
        # We need to find the closest subcluster among all the
        # subclusters so that we can insert our new subcluster.
        #Calculate distance matrix
        dist_matrix = np.dot(self.centroids_, subcluster.centroid_)
        dist_matrix *= -2.
        dist_matrix += self.squared_norm_
        closest_index = np.argmin(dist_matrix)
        closest_subcluster = self.subclusters_[closest_index]

        # If the subcluster has a child, we need a recursive strategy.
        #If there is handwriting in the sub cluster, the recursive strategy needs to be adopted to update the CF parameters
        if closest_subcluster.child_ is not None:
            split_child = closest_subcluster.child_.insert_cf_subcluster(
                subcluster)

            if not split_child:
                # If it is determined that the child need not be split, we
                # can just update the closest_subcluster
                closest_subcluster.update(subcluster)
                self.init_centroids_[closest_index] = \
                    self.subclusters_[closest_index].centroid_
                self.init_sq_norm_[closest_index] = \
                    self.subclusters_[closest_index].sq_norm_
                return False

            # things not too good. we need to redistribute the subclusters in
            # our child node, and add a new subcluster in the parent
            # subcluster to accommodate the new child.
            else:
                new_subcluster1, new_subcluster2 = _split_node(
                    closest_subcluster.child_, threshold, branching_factor)
                self.update_split_subclusters(
                    closest_subcluster, new_subcluster1, new_subcluster2)

                if len(self.subclusters_) > self.branching_factor:
                    return True
                return False

        # good to go!
        
        else:
            #When the residual radius of the sub cluster is less than the threshold, the CF parameter is updated
            merged = closest_subcluster.merge_subcluster(
                subcluster, self.threshold)
            #If merged exists, add a new sub cluster to the sub cluster and update the parameters of the sub cluster
            if merged:
                self.init_centroids_[closest_index] = \
                    closest_subcluster.centroid_
                self.init_sq_norm_[closest_index] = \
                    closest_subcluster.sq_norm_
                return False

            # not close to any other subclusters, and we still
            # have space, so add.
            #If the CF tree of the sub cluster exceeds the number of branch factors, it will be split into new sub clusters and added to the Node
            elif len(self.subclusters_) < self.branching_factor:
                self.append_subcluster(subcluster)
                return False

            # We do not have enough space nor is it closer to an
            # other subcluster. We need to split.
            else:
                self.append_subcluster(subcluster)
                return True

(4)CFSubcluster

parameterattribute
linear_sum:narraysamplen_samples_ :intNumber of samples per sub cluster
  linear_sum_ : narrayLinear sum of all samples in sub cluster
  squared_sum_ : floatSum of the squared l2 norms
  centroids_ centroid
  child_Child node
  sq_norm_ Sum of squares of sub clusters

CFSubcluster consists of three functions:

The first function: update(self, subcluster) updates the values (linear sum, centroid, square sum, etc.)

def update(self, subcluster):
        self.n_samples_ += subcluster.n_samples_
        self.linear_sum_ += subcluster.linear_sum_
        self.squared_sum_ += subcluster.squared_sum_
        self.centroid_ = self.linear_sum_ / self.n_samples_
        self.sq_norm_ = np.dot(self.centroid_, self.centroid_)

The second function: Merge_ Sub cluster (self, sub cluster)

def merge_subcluster(self, nominee_cluster, threshold):
        """Check if a cluster is worthy enough to be merged. If
        yes then merge.
        """
        new_ss = self.squared_sum_ + nominee_cluster.squared_sum_
        new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
        new_n = self.n_samples_ + nominee_cluster.n_samples_
        new_centroid = (1 / new_n) * new_ls
        new_norm = np.dot(new_centroid, new_centroid)
        dot_product = (-2 * new_n) * new_norm
        sq_radius = (new_ss + dot_product) / new_n + new_norm
        if sq_radius <= threshold ** 2:
            (self.n_samples_, self.linear_sum_, self.squared_sum_,
             self.centroid_, self.sq_norm_) = \
                new_n, new_ls, new_ss, new_centroid, new_norm
            return True
        return False

The third function: radius(self): calculate the residual

def radius(self):
        """Return radius of the subcluster"""
        dot_product = -2 * np.dot(self.linear_sum_, self.centroid_)
        return sqrt(
            ((self.squared_sum_ + dot_product) / self.n_samples_) +
            self.sq_norm_)

 

Topics: Machine Learning