Scikit learn is a great python library for implementing machine learning models and statistical modeling. Through it, we can not only realize various machine learning models of regression, classification and clustering, but also provide the functions of dimension reduction, feature selection, feature extraction, integration technology and built-in data set.
Today, I will introduce scikit learn in detail. I believe you will have a deeper understanding and application of it through this article. I like this article for praise and collection. At the end of the article, welcome to chat.
1. Data set
When learning algorithms, we all hope to have some data sets to practice. Scikit learn comes with some very nice data sets, such as iris data set, house price data set, diabetes dataset and so on.
These data sets are very easy to obtain and understand. You can implement ML model directly on them, which is very suitable for beginners.
You can get it as follows:
import sklearn from sklearn import datasets import pandas as pd dataset = datasets.load_iris() df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
Similarly, you can import other data sets in the same way.
2. Data splitting
Sklearn provides the ability to split data sets for training and testing. Splitting data sets is essential for unbiased evaluation of prediction performance, and the proportion of data in training and test data sets can be defined.
We can split the dataset as follows:
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=2, random_state=4)
In train_ test_ With the help of split, we split the data set so that 80% of the data in the training set and 20% of the data in the test set.
3. Linear regression
When the output variable is a continuous variable and has a linear relationship with the dependent variable, the supervised machine learning model is used, which can predict the sales in the next few months by analyzing the sales data in the previous few months.
With sklearn, we can easily realize the linear regression model, as shown below:
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score regression_model = LinearRegression() regression_model.fit(x_train, y_train) y_predicted = regression_model.predict(x_test) rmse = mean_squared_error(y_test, y_predicted) r2 = r2_score(y_test, y_predicted)
First, linearregression() creates a linear regression object, and then we fit the model on the training set. Finally, we predict the model on the test data set. "rmse" and "r_score" can be used to check the accuracy of the model.
4. Logistic regression
Logistic regression is also a supervised regression algorithm, just like linear regression. The only difference is that the output variables are classified. It can be used to predict whether a patient has heart disease.
With sklearn, we can easily implement the Logistic regression model, as shown below:
from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report logreg = LogisticRegression() logreg.fit(x_train, y_train) y_predicted = logreg.predict(x_test) confusion_matrix = confusion_matrix(y_test, y_pred) print(confusion_matrix) print(classification_report(y_test, y_pred))
Confusion matrix and classification report are used to check the accuracy of classification model.
5. Decision tree
Decision tree is a powerful tool for classification and regression problems. It consists of root and node. Root represents the decision of splitting, and node represents the value of output variable. Decision trees are useful when dependent and independent variables do not follow a linear relationship.
Implementation of decision tree for classification
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import confusion_matrix from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from IPython.display import Image from pydot import graph_from_dot_data dt = DecisionTreeClassifier() dt.fit(x_train, y_train) dot_data = StringIO() export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names) (graph, ) = graph_from_dot_data(dot_data.getvalue()) y_pred = dt.predict(x_test)
We use the DecisionTreeClassifier() object to fit the model and use further code to visualize the decision tree implementation in Python.
6,Bagging
Bagging is a technique for training multiple models of the same type using random samples in the training set. The inputs of different models are independent of each other.
For the former case, multiple decision trees can be used for prediction, not just a decision tree called random forest.
7,Boosting
The training method of Boosting multiple models is that the input of one model depends on the output of the previous model. In Boosting, more priority is given to the data with wrong prediction.
8. Random forest
Random forest is a bagging technology, which uses hundreds of decision trees to build models for classification and regression problems. For example: classification of loan applicants, identification of fraud activities and prediction of diseases.
The implementation in python is as follows:
from sklearn.ensemble import RandomForestClassifier num_trees = 100 max_features = 3 clf = RandomForestClassifier(n_estimators=num_trees, max_features=max_features) clf.fit(x_train,y_train) y_pred=clf.predict(x_test) print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
9,XGBoost
XGBoost is a lifting technology that provides a high-performance implementation of gradient lifting decision tree. It can deal with lost data by itself, supports regularization, and usually gives more accurate results than other models.
The implementation in python is as follows:
from xgboost import XGBClassifier from sklearn.metrics import mean_squared_error xgb = XGBClassifier(colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10) xgb.fit(x_train,y_train) y_pred=xgb.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, preds)) print("RMSE: %f" % (rmse))
10. Support vector machine (SVM)
SVM is a supervised machine learning algorithm, which classifies by finding the best hyperplane. It is usually used in many applications, such as face detection, e-mail classification and so on.
Implemented in python as
from sklearn import svm from sklearn import metrics clf = svm.SVC(kernel='linear') clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
11. Confusion matrix
Confusion matrix is a table used to describe the performance of classification model. The confusion matrix is analyzed with the help of the following four items:
- True positive (TF)
This means that the model is predicted to be positive and actually positive.
- True negative (TN)
This means that the model predicts negative, but it is actually negative.
- False positive (FP)
This means that the model predicts positive, but it is actually negative.
- False negative (FN)
This means that the model predicts negative, but it is actually positive.
Python can implement
from sklearn.metrics import confusion_matrix confusion_matrix = confusion_matrix(y_test, y_pred) print(confusion_matrix)
12. K-means clustering
K-Means clustering is an unsupervised machine learning algorithm used to solve classification problems. Unsupervised algorithms are algorithms that have no labels or output variables in the dataset.
In clustering, data sets are divided into different groups according to their characteristics, which is called cluster. k-means clustering has many applications, such as market segmentation, document clustering and image segmentation.
It can be implemented in python as:
from sklearn.cluster import KMeans import statsmodels.api as sm kmeans = KMeans(3) means.fit(x) identified_clusters = kmeans.fit_predict(x)
13. DBSCAN clustering
DBSCAN is also an unsupervised clustering algorithm, which clusters according to the similarity between data points. In DBSCAN, clusters are formed only when the number of points in the cluster with the specified radius is the least.
The advantage of DBSCAN is that it is robust to outliers, that is, it can handle outliers by itself, which is different from k-means clustering. DBSCAN algorithm is used to create heat map, geospatial analysis and anomaly detection in temperature data.
It can be implemented as
from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.preprocessing import StandardScaler db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print(labels)
14. Standardization and normalization
Standardization
Standardization is a scaling technique. We set the mean value of the attribute to 0 and the standard deviation to 1, so that the value is centered on the mean value with unit standard deviation. It can be X '= (X)- μ)/σ
Normalization
Normalization is a technique that ranges values from 0 to 1. It is also called min max scaling. Normalization can be completed by the given formula x = (x - xmin) / (xmax xmin).
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler
Python provides the StandardScaler function for standardization and the MinMaxScaler function for normalization.
15. Feature extraction
Feature extraction is a method to extract features from data. If the data is converted to digital format, we can only transfer the data to the machine learning model. Scikit learn provides the function of converting text and images into numbers.
Bag of Words and TF-IDF are the most commonly used methods to convert words into numbers in natural language processing provided by scikit learn.
generalization
This article confidently introduces the 15 most important features of scikit learn and the python code implementation.
reference resources
https://ml2quantum.com/scikit-learn/
Technical exchange
Welcome to reprint, collect, gain, praise and support!
At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends
- Method ① send the following pictures to wechat, long press identification, and the background replies: add group;
- Mode ②. Add micro signal: dkl88191, remarks: from CSDN
- WeChat search official account: Python learning and data mining, background reply: add group