In depth inventory: these 15 scikit learn important skills are necessary for beginners

Posted by buluk21 on Sun, 28 Nov 2021 10:07:58 +0100

Scikit learn is a great python library for implementing machine learning models and statistical modeling. Through it, we can not only realize various machine learning models of regression, classification and clustering, but also provide the functions of dimension reduction, feature selection, feature extraction, integration technology and built-in data set.

Today, I will introduce scikit learn in detail. I believe you will have a deeper understanding and application of it through this article. I like this article for praise and collection. At the end of the article, welcome to chat.

1. Data set

When learning algorithms, we all hope to have some data sets to practice. Scikit learn comes with some very nice data sets, such as iris data set, house price data set, diabetes dataset and so on.

These data sets are very easy to obtain and understand. You can implement ML model directly on them, which is very suitable for beginners.

You can get it as follows:

import sklearn
from sklearn import datasets
import pandas as pd
dataset = datasets.load_iris()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)

Similarly, you can import other data sets in the same way.

2. Data splitting

Sklearn provides the ability to split data sets for training and testing. Splitting data sets is essential for unbiased evaluation of prediction performance, and the proportion of data in training and test data sets can be defined.

We can split the dataset as follows:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2, random_state=4)

In train_ test_ With the help of split, we split the data set so that 80% of the data in the training set and 20% of the data in the test set.

3. Linear regression

When the output variable is a continuous variable and has a linear relationship with the dependent variable, the supervised machine learning model is used, which can predict the sales in the next few months by analyzing the sales data in the previous few months.

With sklearn, we can easily realize the linear regression model, as shown below:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
regression_model = LinearRegression()
regression_model.fit(x_train, y_train)
y_predicted = regression_model.predict(x_test)
rmse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)

First, linearregression() creates a linear regression object, and then we fit the model on the training set. Finally, we predict the model on the test data set. "rmse" and "r_score" can be used to check the accuracy of the model.

4. Logistic regression

Logistic regression is also a supervised regression algorithm, just like linear regression. The only difference is that the output variables are classified. It can be used to predict whether a patient has heart disease.

With sklearn, we can easily implement the Logistic regression model, as shown below:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_predicted = logreg.predict(x_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))

Confusion matrix and classification report are used to check the accuracy of classification model.

5. Decision tree

Decision tree is a powerful tool for classification and regression problems. It consists of root and node. Root represents the decision of splitting, and node represents the value of output variable. Decision trees are useful when dependent and independent variables do not follow a linear relationship.

Implementation of decision tree for classification

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from pydot import graph_from_dot_data
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)
(graph, ) = graph_from_dot_data(dot_data.getvalue())
y_pred = dt.predict(x_test)

We use the DecisionTreeClassifier() object to fit the model and use further code to visualize the decision tree implementation in Python.

6,Bagging

Bagging is a technique for training multiple models of the same type using random samples in the training set. The inputs of different models are independent of each other.

For the former case, multiple decision trees can be used for prediction, not just a decision tree called random forest.

7,Boosting

The training method of Boosting multiple models is that the input of one model depends on the output of the previous model. In Boosting, more priority is given to the data with wrong prediction.

8. Random forest

Random forest is a bagging technology, which uses hundreds of decision trees to build models for classification and regression problems. For example: classification of loan applicants, identification of fraud activities and prediction of diseases.

The implementation in python is as follows:

from sklearn.ensemble import RandomForestClassifier
num_trees = 100
max_features = 3
clf = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

9,XGBoost

XGBoost is a lifting technology that provides a high-performance implementation of gradient lifting decision tree. It can deal with lost data by itself, supports regularization, and usually gives more accurate results than other models.

The implementation in python is as follows:

from xgboost import XGBClassifier
from sklearn.metrics import mean_squared_error
xgb = XGBClassifier(colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)
xgb.fit(x_train,y_train)
y_pred=xgb.predict(x_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

10. Support vector machine (SVM)

SVM is a supervised machine learning algorithm, which classifies by finding the best hyperplane. It is usually used in many applications, such as face detection, e-mail classification and so on.

Implemented in python as

from sklearn import svm
from sklearn import metrics
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

11. Confusion matrix

Confusion matrix is a table used to describe the performance of classification model. The confusion matrix is analyzed with the help of the following four items:

  • True positive (TF)

This means that the model is predicted to be positive and actually positive.

  • True negative (TN)

This means that the model predicts negative, but it is actually negative.

  • False positive (FP)

This means that the model predicts positive, but it is actually negative.

  • False negative (FN)

This means that the model predicts negative, but it is actually positive.

Python can implement

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

12. K-means clustering

K-Means clustering is an unsupervised machine learning algorithm used to solve classification problems. Unsupervised algorithms are algorithms that have no labels or output variables in the dataset.

In clustering, data sets are divided into different groups according to their characteristics, which is called cluster. k-means clustering has many applications, such as market segmentation, document clustering and image segmentation.

It can be implemented in python as:

from sklearn.cluster import KMeans
import statsmodels.api as sm
kmeans = KMeans(3)
means.fit(x)
identified_clusters = kmeans.fit_predict(x)

13. DBSCAN clustering

DBSCAN is also an unsupervised clustering algorithm, which clusters according to the similarity between data points. In DBSCAN, clusters are formed only when the number of points in the cluster with the specified radius is the least.

The advantage of DBSCAN is that it is robust to outliers, that is, it can handle outliers by itself, which is different from k-means clustering. DBSCAN algorithm is used to create heat map, geospatial analysis and anomaly detection in temperature data.

It can be implemented as

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print(labels)

14. Standardization and normalization

Standardization

Standardization is a scaling technique. We set the mean value of the attribute to 0 and the standard deviation to 1, so that the value is centered on the mean value with unit standard deviation. It can be X '= (X)- μ)/σ

Normalization

Normalization is a technique that ranges values from 0 to 1. It is also called min max scaling. Normalization can be completed by the given formula x = (x - xmin) / (xmax xmin).

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

Python provides the StandardScaler function for standardization and the MinMaxScaler function for normalization.

15. Feature extraction

Feature extraction is a method to extract features from data. If the data is converted to digital format, we can only transfer the data to the machine learning model. Scikit learn provides the function of converting text and images into numbers.

Bag of Words and TF-IDF are the most commonly used methods to convert words into numbers in natural language processing provided by scikit learn.

generalization

This article confidently introduces the 15 most important features of scikit learn and the python code implementation.

reference resources

https://ml2quantum.com/scikit-learn/

Technical exchange

Welcome to reprint, collect, gain, praise and support!

At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends

  • Method ① send the following pictures to wechat, long press identification, and the background replies: add group;
  • Mode ②. Add micro signal: dkl88191, remarks: from CSDN
  • WeChat search official account: Python learning and data mining, background reply: add group

Topics: Python Machine Learning Data Mining