# Birch clustering algorithm

Posted by kevintynfron on Thu, 10 Feb 2022 16:17:41 +0100

# 1 Principle

1.1 B-tree

(1) m-way lookup tree

An m-way search tree is either an empty tree or a tree satisfying the following properties:

• The root has up to m sub trees and has the following structure:

,Is a pointer to a subtree,The key is,

• In subtreeAll keys in are greater than, less than.
• In subtreeAll keys in are greater than
• In subtreeAll keys in are less than
• subtreem-way lookup tree

(2) B tree

m-order B-tree is an m-way search tree, which is either empty or satisfies the following properties:

• Each node in the tree has at most m sub trees
• The root node has at least two subtrees
• All non terminal nodes except the root node have at leastsubtree
• All leaf nodes are on the same layer

1.2 steps

Specific simulation process reference: https://www.cnblogs.com/pinard/p/6179132.html

References:

BIRCH can identify the imbalance of data distribution in the data set, cluster the points distributed in dense areas and remove the abnormal points distributed in sparse areas. In addition, BIRCH is an incremental clustering method. The clustering decision for each point is based on the currently processed data points rather than the global data points.
① Establish a clustering feature tree
The first is to traverse all the data, and use the given amount of memory and the reclaimed space on the disk to build an initial memory CF tree to reflect the clustering information on the data set. For dense data, it is divided into finer clusters, and sparse data points are removed as outliers.
② Reduce the scope and simplify the clustering feature tree
This process is optional. This part is the bridge connecting step ① and step ③. Similar to step ①, it starts to traverse the leaf nodes of the initialized clustering feature tree, remove more outliers and narrow the range for grouping.
③ Global clustering
Using global clustering or semi global clustering to operate all leaf nodes, the clustering algorithm with data points is easy to adapt to a group of sub clusters, and each sub cluster is represented by its clustering feature vector. Calculate the centroid of sub clusters, and then each sub cluster is represented by centroid, which can capture the main distribution law of data.
④ Cluster refinement
Because step ③ is only a rough summary of the data, and the original data is only scanned once, it is necessary to continue to improve the cluster class. The center of the cluster generated in the previous stage is used as the seed, and the data points are reassigned to the nearest seed to obtain a new set of clusters. This not only allows the migration of points belonging to the sub cluster, but also ensures that all copies of a given data point are migrated to the same sub cluster. It also provides an option to discard outliers. That is, if it is too far from the nearest point, the seed can be treated as an outlier and not included in the result.

# 2. Parameter description

Function: sklearn cluster. Birch

Parameters:

• Threshold: (float,default=0.5) the radius of the merged sub cluster of the new sub cluster and the latest sub cluster is less than the threshold, otherwise it will be split.
• branching_factor: (int,default=50) the maximum number of CF sub clusters in each node.
• n_cluster: (int, default=3) the number of clusters in the final clustering step, if None, the final clustering step is not executed, and the sub clusters are returned as is; if sklearn.cluster.Estimator, the model executes the sub clustering as a new sample.
• compute_labels: (bool,default=True) whether the label value is calculated during each fitting.
• Copy: (bool,default=True) whether to copy the obtained data. If it is set to false, the initialization data will be overwritten.

Properties:

• root_: Root of CF tree
• dummy_leaf_: Pointers to all leaf nodes
• subcluster_centers_: Centroid of sub cluster in all leaves
• subcluster_labels_: Labels of sub cluster centroid after full clustering
• labels_: Labels of all input data

# 3. Specific implementation

You can refer to the example of scikit learn: https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py

# 4 source code analysis

Source code: anaconda3 / lib / site packages / sklearn / cluster / birch In PY

(1) Prefix knowledge

• The hasattr() function is used to judge whether a class instance object contains a property or method with a specified name, and returns True and False

hasattr(obj, name), where obj refers to the instance object of a class, and name refers to the specified attribute name or method name.

• The getattr() function gets the value of the specified attribute in a class instance object

getattr(obj, name[, default]), where obj represents the specified class instance object, name represents the specified attribute name, and default is an optional parameter, which is used to set the default return value of the function. That is, when the function fails to find, if the default parameter is not specified, the program will directly report an AttributeError error error, otherwise the function will return the value specified by default.

• The function of setattr() is relatively complex. Its most basic function is to modify the attribute value in the class instance object. Secondly, it can also dynamically add properties or methods for instance objects.

(2) Birch function

• Birch (baseestimator, transformer mixin, clustermixin) is in the base file of sklearn
• Other parameters

• fit function (the main core calculation is in _fitfunction)
``` def fit(self, X, y=None):
"""
Build a CF Tree for the input data.

Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
Input data.

y : Ignored
Not used, present here for API consistency by convention.

Returns
-------
self
Fitted estimator.
"""
self.fit_, self.partial_fit_ = True, False
return self._fit(X)

def _fit(self, X):

X = self._validate_data(X, accept_sparse='csr', copy=self.copy)

threshold = self.threshold
branching_factor = self.branching_factor

if branching_factor <= 1:
raise ValueError("Branching_factor should be greater than one.")
n_samples, n_features = X.shape

# If partial_fit is called for the first time or fit is called, we
# start a new tree.
partial_fit = getattr(self, 'partial_fit_')
has_root = getattr(self, 'root_', None)
if getattr(self, 'fit_') or (partial_fit and not has_root):
# The first root is the leaf. Manipulate this object throughout.
self.root_ = _CFNode(threshold=threshold,
branching_factor=branching_factor,
is_leaf=True,
n_features=n_features)

# To enable getting back subclusters.
self.dummy_leaf_ = _CFNode(threshold=threshold,
branching_factor=branching_factor,
is_leaf=True, n_features=n_features)
self.dummy_leaf_.next_leaf_ = self.root_
self.root_.prev_leaf_ = self.dummy_leaf_

# Cannot vectorize. Enough to convince to use cython.
if not sparse.issparse(X):
iter_func = iter
else:
iter_func = _iterate_sparse_X

#Traverse the data and construct sub clusters
for sample in iter_func(X):

subcluster = _CFSubcluster(linear_sum=sample)
split = self.root_.insert_cf_subcluster(subcluster)
#If the CF is determined to be split, use the split algorithm to return two sub clusters and add the sub clusters to the root
if split:
new_subcluster1, new_subcluster2 = _split_node(
self.root_, threshold, branching_factor)
del self.root_
self.root_ = _CFNode(threshold=threshold,
branching_factor=branching_factor,
is_leaf=False,
n_features=n_features)
self.root_.append_subcluster(new_subcluster1)
self.root_.append_subcluster(new_subcluster2)
#Get the centroid of leaf node
centroids = np.concatenate([
leaf.centroids_ for leaf in self._get_leaves()])
self.subcluster_centers_ = centroids

self._global_clustering(X)
return self```

Other functions:

Sparse matrix construction

```def _iterate_sparse_X(X):
"""This little hack returns a densified row when iterating over a sparse
matrix, instead of constructing a sparse matrix for every row that is
expensive.
"""
n_samples = X.shape[0]
X_indices = X.indices
X_data = X.data
X_indptr = X.indptr

for i in range(n_samples):
row = np.zeros(X.shape[1])
startptr, endptr = X_indptr[i], X_indptr[i + 1]
nonzero_indices = X_indices[startptr:endptr]
row[nonzero_indices] = X_data[startptr:endptr]
yield row
```

Function of splitting leaf nodes: define two sub clusters and two CF nodes, and add the CF node to the CF sub cluster. If the incoming sub cluster is a leaf node, carry out a series of pointer transformations, calculate the distance between the centroid and the sum of squares of the sub cluster, select the matrix with the largest distance, and then select the smaller value as a sub cluster, The others belong to another sub cluster.

```def _split_node(node, threshold, branching_factor):
"""The node has to be split if there is no place for a new subcluster
in the node.
1. Two empty nodes and two empty subclusters are initialized.
2. The pair of distant subclusters are found.
3. The properties of the empty subclusters and nodes are updated
according to the nearest distance between the subclusters to the
pair of distant subclusters.
4. The two nodes are set as children to the two subclusters.
"""
new_subcluster1 = _CFSubcluster()
new_subcluster2 = _CFSubcluster()
new_node1 = _CFNode(
threshold=threshold, branching_factor=branching_factor,
is_leaf=node.is_leaf,
n_features=node.n_features)
new_node2 = _CFNode(
threshold=threshold, branching_factor=branching_factor,
is_leaf=node.is_leaf,
n_features=node.n_features)
new_subcluster1.child_ = new_node1
new_subcluster2.child_ = new_node2

if node.is_leaf:
if node.prev_leaf_ is not None:
node.prev_leaf_.next_leaf_ = new_node1
new_node1.prev_leaf_ = node.prev_leaf_
new_node1.next_leaf_ = new_node2
new_node2.prev_leaf_ = new_node1
new_node2.next_leaf_ = node.next_leaf_
if node.next_leaf_ is not None:
node.next_leaf_.prev_leaf_ = new_node2

dist = euclidean_distances(
node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)
n_clusters = dist.shape[0]

farthest_idx = np.unravel_index(
dist.argmax(), (n_clusters, n_clusters))
node1_dist, node2_dist = dist[(farthest_idx,)]

node1_closer = node1_dist < node2_dist
for idx, subcluster in enumerate(node.subclusters_):
if node1_closer[idx]:
new_node1.append_subcluster(subcluster)
new_subcluster1.update(subcluster)
else:
new_node2.append_subcluster(subcluster)
new_subcluster2.update(subcluster)
return new_subcluster1, new_subcluster2```

Get leaf node:

```    def _get_leaves(self):
"""
Retrieve the leaves of the CF Node.

Returns
-------
leaves : list of shape (n_leaves,)
List of the leaf nodes.
"""
leaf_ptr = self.dummy_leaf_.next_leaf_
leaves = []
while leaf_ptr is not None:
leaves.append(leaf_ptr)
leaf_ptr = leaf_ptr.next_leaf_
return leaves
```

Global clustering: agglomerative clustering algorithm is added (written separately).

```def _global_clustering(self, X=None):
"""
Global clustering for the subclusters obtained after fitting
"""
clusterer = self.n_clusters
centroids = self.subcluster_centers_
compute_labels = (X is not None) and self.compute_labels

# Preprocessing for the global clustering.
not_enough_centroids = False
if isinstance(clusterer, numbers.Integral):
clusterer = AgglomerativeClustering(
n_clusters=self.n_clusters)
# There is no need to perform the global clustering step.
if len(centroids) < self.n_clusters:
not_enough_centroids = True
elif (clusterer is not None and not
hasattr(clusterer, 'fit_predict')):
raise ValueError("n_clusters should be an instance of "
"ClusterMixin or an int")

# To use in predict to avoid recalculation.
self._subcluster_norms = row_norms(
self.subcluster_centers_, squared=True)

if clusterer is None or not_enough_centroids:
self.subcluster_labels_ = np.arange(len(centroids))
if not_enough_centroids:
warnings.warn(
"Number of subclusters found (%d) by Birch is less "
"than (%d). Decrease the threshold."
% (len(centroids), self.n_clusters), ConvergenceWarning)
else:
# The global clustering step that clusters the subclusters of
# the leaves. It assumes the centroids of the subclusters as
# samples and finds the final centroids.
self.subcluster_labels_ = clusterer.fit_predict(
self.subcluster_centers_)

if compute_labels:
self.labels_ = self.predict(X)
```

(3)CFNode

 parameter attribute threshold:float Determine the threshold of sub cluster subclusters_ : list Sub cluster of specified nodes branching_factor: int Branching factor prev_leaf_ : _CFNode Front leaf node is_leaf : bool Is it a leaf node next_leaf_ : _CFNode Posterior leaf node n_features : int Feature quantity init_centroids_ Initialize centroid, shape=(branching_factor + 1, n_features) init_sq_norm_ Initialize sum of squares, shape=(branching_factor + 1, n_features) centroids_ centroid squared_norm_ Sum of squares

CFNode consists of three functions:

First function: append_subcluster(self, subcluster) updates the characteristic value of CF

```    def append_subcluster(self, subcluster):
#Get the sub cluster length of CF
n_samples = len(self.subclusters_)
#Add new sub clusters to CF
self.subclusters_.append(subcluster)
#Initialize the sum of centroids and squares of new sub clusters (add centroids and squares to the list)
self.init_centroids_[n_samples] = subcluster.centroid_
self.init_sq_norm_[n_samples] = subcluster.sq_norm_

# Keep centroids and squared norm as views. In this way
# if we change init_centroids and init_sq_norm_, it is
# sufficient,
#Update the sum of centroids and squares of the final sub clusters (add centroids and squares to the list)
self.centroids_ = self.init_centroids_[:n_samples + 1, :]
self.squared_norm_ = self.init_sq_norm_[:n_samples + 1]```

The second function: update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2): update split nodes

```def update_split_subclusters(self, subcluster,
new_subcluster1, new_subcluster2):
"""Remove a subcluster from a node and update it with the
split subclusters.
"""
ind = self.subclusters_.index(subcluster)
self.subclusters_[ind] = new_subcluster1
self.init_centroids_[ind] = new_subcluster1.centroid_
self.init_sq_norm_[ind] = new_subcluster1.sq_norm_
self.append_subcluster(new_subcluster2)```

The third function: Insert_ cf_ Sub cluster (self, sub cluster): insert CF features into sub clusters

``` def insert_cf_subcluster(self, subcluster):
"""Insert a new subcluster into the node."""
# self.subclusters_ If it does not exist, the new sub cluster will be added to the sub cluster list
if not self.subclusters_:
self.append_subcluster(subcluster)
return False
threshold = self.threshold
branching_factor = self.branching_factor
# We need to find the closest subcluster among all the
# subclusters so that we can insert our new subcluster.
#Calculate distance matrix
dist_matrix = np.dot(self.centroids_, subcluster.centroid_)
dist_matrix *= -2.
dist_matrix += self.squared_norm_
closest_index = np.argmin(dist_matrix)
closest_subcluster = self.subclusters_[closest_index]

# If the subcluster has a child, we need a recursive strategy.
#If there is handwriting in the sub cluster, the recursive strategy needs to be adopted to update the CF parameters
if closest_subcluster.child_ is not None:
split_child = closest_subcluster.child_.insert_cf_subcluster(
subcluster)

if not split_child:
# If it is determined that the child need not be split, we
# can just update the closest_subcluster
closest_subcluster.update(subcluster)
self.init_centroids_[closest_index] = \
self.subclusters_[closest_index].centroid_
self.init_sq_norm_[closest_index] = \
self.subclusters_[closest_index].sq_norm_
return False

# things not too good. we need to redistribute the subclusters in
# our child node, and add a new subcluster in the parent
# subcluster to accommodate the new child.
else:
new_subcluster1, new_subcluster2 = _split_node(
closest_subcluster.child_, threshold, branching_factor)
self.update_split_subclusters(
closest_subcluster, new_subcluster1, new_subcluster2)

if len(self.subclusters_) > self.branching_factor:
return True
return False

# good to go!

else:
#When the residual radius of the sub cluster is less than the threshold, the CF parameter is updated
merged = closest_subcluster.merge_subcluster(
subcluster, self.threshold)
#If merged exists, add a new sub cluster to the sub cluster and update the parameters of the sub cluster
if merged:
self.init_centroids_[closest_index] = \
closest_subcluster.centroid_
self.init_sq_norm_[closest_index] = \
closest_subcluster.sq_norm_
return False

# not close to any other subclusters, and we still
# have space, so add.
#If the CF tree of the sub cluster exceeds the number of branch factors, it will be split into new sub clusters and added to the Node
elif len(self.subclusters_) < self.branching_factor:
self.append_subcluster(subcluster)
return False

# We do not have enough space nor is it closer to an
# other subcluster. We need to split.
else:
self.append_subcluster(subcluster)
return True```

(4)CFSubcluster

 parameter attribute linear_sum:narray sample n_samples_ :int Number of samples per sub cluster linear_sum_ : narray Linear sum of all samples in sub cluster squared_sum_ : float Sum of the squared l2 norms centroids_ centroid child_ Child node sq_norm_ Sum of squares of sub clusters

CFSubcluster consists of three functions:

The first function: update(self, subcluster) updates the values (linear sum, centroid, square sum, etc.)

```def update(self, subcluster):
self.n_samples_ += subcluster.n_samples_
self.linear_sum_ += subcluster.linear_sum_
self.squared_sum_ += subcluster.squared_sum_
self.centroid_ = self.linear_sum_ / self.n_samples_
self.sq_norm_ = np.dot(self.centroid_, self.centroid_)```

The second function: Merge_ Sub cluster (self, sub cluster)

```def merge_subcluster(self, nominee_cluster, threshold):
"""Check if a cluster is worthy enough to be merged. If
yes then merge.
"""
new_ss = self.squared_sum_ + nominee_cluster.squared_sum_
new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
new_n = self.n_samples_ + nominee_cluster.n_samples_
new_centroid = (1 / new_n) * new_ls
new_norm = np.dot(new_centroid, new_centroid)
dot_product = (-2 * new_n) * new_norm
sq_radius = (new_ss + dot_product) / new_n + new_norm
if sq_radius <= threshold ** 2:
(self.n_samples_, self.linear_sum_, self.squared_sum_,
self.centroid_, self.sq_norm_) = \
new_n, new_ls, new_ss, new_centroid, new_norm
return True
return False```

The third function: radius(self): calculate the residual

```def radius(self):
"""Return radius of the subcluster"""
dot_product = -2 * np.dot(self.linear_sum_, self.centroid_)
return sqrt(
((self.squared_sum_ + dot_product) / self.n_samples_) +
self.sq_norm_)```

Topics: Machine Learning