OpenCV4 machine learning: principle and implementation of K-means

Posted by Christopher on Tue, 07 Sep 2021 06:41:32 +0200

preface:

This column mainly combines OpenCV4 to realize some basic image processing operations, classical machine learning algorithms (such as K-Means, KNN, SVM, decision tree, Bayesian classifier, etc.), and common deep learning algorithms.

Series of articles, continuously updated:

1, Basic introduction

K-means, namely K-means, is an iterative clustering algorithm. Clustering is a process of classifying and organizing data members who are similar in some aspects. Clustering is a technology to discover this internal structure. Clustering technology is often called unsupervised learning.

K-means clustering is the most famous partition clustering algorithm. Because of its simplicity and efficiency, it has become the most widely used of all clustering algorithms. Given a set of data points and the required number of clusters K, K is specified by the user, and the k-means algorithm repeatedly divides the data into K clusters according to a distance function.

2, Algorithm principle

For a given data set, the process of clustering by K-means method is as follows:

  • Initialize K cluster centers.
  • Sample allocation. Place each sample in the collection with its nearest category center. Determine which center the sample is closest to through the set distance function and put it into the corresponding sample. The distance function generally adopts: Euclidean distance, Manhattan distance, Minkowski distance and Hamming distance.
  • Update category center. For the samples allocated in each set, calculate the sample mean and take it as the current category center.
  • Determine termination conditions. Judge whether the category label reaches the convergence accuracy or the number of training rounds.

3, Function interpretation

In OpenCV4, the cv::kmeans function implements K-means, which finds the centers of K categories and groups the input samples around the categories.

The cv::kmeans function is defined as follows:

double cv::kmeans(InputArray data,  //sample
				  int K, //Number of categories
				  InputOutputArray bestLabels,  //The output integer array is used to store the cluster category index of each sample
				  TermCriteria criteria,  //Algorithm termination condition: maximum number of iterations or required accuracy
				  int attempts,  //Specifies the number of times the algorithm is executed using different initial tags
				  int flags,  //Method of initializing mean point
				  OutputArray centers = noArray()  //The output matrix of cluster centers, each cluster center occupies one row
				  )

4, Actual combat demonstration

The following will demonstrate an example of clustering a two-dimensional coordinate point set using the kmeans() method in OpenCV.

#include<iostream>
#include<opencv.hpp>
using namespace std;
using namespace cv;

int main() {
	const int MAX_CLUSTERS = 5; //Maximum number of categories
	Scalar colorTab[] = {   //Drawing color
						 Scalar(0, 0, 255),
						 Scalar(0, 255, 0),
						 Scalar(255, 100, 100),
						 Scalar(255, 0, 255),
						 Scalar(0, 255, 255)
						};

	Mat img(500, 500, CV_8UC3); //New canvas
	img = Scalar::all(255); //Set canvas to white
	RNG rng(35345); //Random number generator

	//Number of initialization categories
	int clusterCount = rng.uniform(2, MAX_CLUSTERS + 1);
	//In the specified interval, randomly generate an integer, the number of samples
	int sampleCount = rng.uniform(1, 1001);
	//Input sample matrix: sampleCount row x1 column, floating point, 2 channels
	Mat points(sampleCount, 1, CV_32FC2);
	Mat labels; 
	//Number of cluster categories < number of samples
	clusterCount = MIN(clusterCount, sampleCount); 

	//Clustering result index matrix
	vector<Point2f> centers;

	//Randomly generate samples with multi Gaussian distribution
	//for (int k = 0; k < clusterCount; k++) {
	Point center;
	center.x = rng.uniform(0, img.cols);
	center.y = rng.uniform(0, img.rows);

	//Assign values to the sample points assignment
	Mat pointChunk = points.rowRange(0, sampleCount / clusterCount);
			
	//Take center as the center, generate random points with Gaussian distribution, and save the coordinate points in pointChunk
	rng.fill(pointChunk, RNG::NORMAL, Scalar(center.x, center.y), Scalar(img.cols*0.05, img.rows*0.05));
	//Disrupt values in points
	randShuffle(points, 1, &rng);

	//Execute k-means
	double compactness = kmeans(points,  //sample
								clusterCount, //Number of categories
								labels,  //The output integer array is used to store the cluster category index of each sample
								TermCriteria(TermCriteria::EPS + TermCriteria::COUNT, 10, 1.0),  //Algorithm termination condition: maximum number of iterations or required accuracy
								3, //Specifies the number of times the algorithm is executed using different initial tags
								KMEANS_PP_CENTERS, //Method of initializing mean point
								centers); //The output matrix of cluster centers, each cluster center occupies one row
			
	//Draw or output clustering results
	for (int i = 0; i < sampleCount; i++) {
		int clusterIdx = labels.at<int>(i);

		Point ipt = points.at<Point2f>(i);
		circle(img, ipt, 2, colorTab[clusterIdx], FILLED, LINE_AA);
	}

	//Draw a circle with the cluster center as the center of the circle
	for (int i = 0; i < (int)centers.size(); ++i) {
		Point2f c = centers[i];
		circle(img, c, 40, colorTab[i], 1, LINE_AA);
	}

	cout << "Compactness: " << compactness << endl;
	imshow("clusters", img);
	waitKey(0);

	return 0;
}

The clustering results are shown in the figure below:

All the complete codes of this column will be updated on my GitHub warehouse. Welcome to learn:

Enter the GitHub warehouse, click star (shown by the red arrow), and get the dry goods at the first time:

The best relationship is mutual achievement. Your "three companies" are the biggest driving force for the creation of [AI bacteria]. See you next time!

Topics: OpenCV Machine Learning Deep Learning