# OpenCV4 machine learning: principle and implementation of K-means

Posted by Christopher on Tue, 07 Sep 2021 06:41:32 +0200

preface:

This column mainly combines OpenCV4 to realize some basic image processing operations, classical machine learning algorithms (such as K-Means, KNN, SVM, decision tree, Bayesian classifier, etc.), and common deep learning algorithms.

Series of articles, continuously updated:

## 1, Basic introduction

K-means, namely K-means, is an iterative clustering algorithm. Clustering is a process of classifying and organizing data members who are similar in some aspects. Clustering is a technology to discover this internal structure. Clustering technology is often called unsupervised learning.

K-means clustering is the most famous partition clustering algorithm. Because of its simplicity and efficiency, it has become the most widely used of all clustering algorithms. Given a set of data points and the required number of clusters K, K is specified by the user, and the k-means algorithm repeatedly divides the data into K clusters according to a distance function.

## 2, Algorithm principle

For a given data set, the process of clustering by K-means method is as follows:

• Initialize K cluster centers.
• Sample allocation. Place each sample in the collection with its nearest category center. Determine which center the sample is closest to through the set distance function and put it into the corresponding sample. The distance function generally adopts: Euclidean distance, Manhattan distance, Minkowski distance and Hamming distance.
• Update category center. For the samples allocated in each set, calculate the sample mean and take it as the current category center.
• Determine termination conditions. Judge whether the category label reaches the convergence accuracy or the number of training rounds.

## 3, Function interpretation

In OpenCV4, the cv::kmeans function implements K-means, which finds the centers of K categories and groups the input samples around the categories.

The cv::kmeans function is defined as follows:

```double cv::kmeans(InputArray data,  //sample
int K, //Number of categories
InputOutputArray bestLabels,  //The output integer array is used to store the cluster category index of each sample
TermCriteria criteria,  //Algorithm termination condition: maximum number of iterations or required accuracy
int attempts,  //Specifies the number of times the algorithm is executed using different initial tags
int flags,  //Method of initializing mean point
OutputArray centers = noArray()  //The output matrix of cluster centers, each cluster center occupies one row
)
```

## 4, Actual combat demonstration

The following will demonstrate an example of clustering a two-dimensional coordinate point set using the kmeans() method in OpenCV.

```#include<iostream>
#include<opencv.hpp>
using namespace std;
using namespace cv;

int main() {
const int MAX_CLUSTERS = 5; //Maximum number of categories
Scalar colorTab[] = {   //Drawing color
Scalar(0, 0, 255),
Scalar(0, 255, 0),
Scalar(255, 100, 100),
Scalar(255, 0, 255),
Scalar(0, 255, 255)
};

Mat img(500, 500, CV_8UC3); //New canvas
img = Scalar::all(255); //Set canvas to white
RNG rng(35345); //Random number generator

//Number of initialization categories
int clusterCount = rng.uniform(2, MAX_CLUSTERS + 1);
//In the specified interval, randomly generate an integer, the number of samples
int sampleCount = rng.uniform(1, 1001);
//Input sample matrix: sampleCount row x1 column, floating point, 2 channels
Mat points(sampleCount, 1, CV_32FC2);
Mat labels;
//Number of cluster categories < number of samples
clusterCount = MIN(clusterCount, sampleCount);

//Clustering result index matrix
vector<Point2f> centers;

//Randomly generate samples with multi Gaussian distribution
//for (int k = 0; k < clusterCount; k++) {
Point center;
center.x = rng.uniform(0, img.cols);
center.y = rng.uniform(0, img.rows);

//Assign values to the sample points assignment
Mat pointChunk = points.rowRange(0, sampleCount / clusterCount);

//Take center as the center, generate random points with Gaussian distribution, and save the coordinate points in pointChunk
rng.fill(pointChunk, RNG::NORMAL, Scalar(center.x, center.y), Scalar(img.cols*0.05, img.rows*0.05));
//Disrupt values in points
randShuffle(points, 1, &rng);

//Execute k-means
double compactness = kmeans(points,  //sample
clusterCount, //Number of categories
labels,  //The output integer array is used to store the cluster category index of each sample
TermCriteria(TermCriteria::EPS + TermCriteria::COUNT, 10, 1.0),  //Algorithm termination condition: maximum number of iterations or required accuracy
3, //Specifies the number of times the algorithm is executed using different initial tags
KMEANS_PP_CENTERS, //Method of initializing mean point
centers); //The output matrix of cluster centers, each cluster center occupies one row

//Draw or output clustering results
for (int i = 0; i < sampleCount; i++) {
int clusterIdx = labels.at<int>(i);

Point ipt = points.at<Point2f>(i);
circle(img, ipt, 2, colorTab[clusterIdx], FILLED, LINE_AA);
}

//Draw a circle with the cluster center as the center of the circle
for (int i = 0; i < (int)centers.size(); ++i) {
Point2f c = centers[i];
circle(img, c, 40, colorTab[i], 1, LINE_AA);
}

cout << "Compactness: " << compactness << endl;
imshow("clusters", img);
waitKey(0);

return 0;
}
```

The clustering results are shown in the figure below:

All the complete codes of this column will be updated on my GitHub warehouse. Welcome to learn:

Enter the GitHub warehouse, click star (shown by the red arrow), and get the dry goods at the first time:

The best relationship is mutual achievement. Your "three companies" are the biggest driving force for the creation of [AI bacteria]. See you next time!