sklearn machine learning regression process based on python language

Posted by pristanski on Wed, 09 Mar 2022 08:29:32 +0100

Using the sklearn package of python language to realize the basic machine learning operation is very simple. After all, we only need to call part of the source code.

Basic machine learning operations are nothing more than four parts:

  1. Import corresponding packages and data
  2. Data preprocessing;
  3. Model establishment and fitting
  4. Model evaluation and prediction

1. Import corresponding packages and data

First, let's go to the first step, guide the package and data. Sklearn mainly includes six categories, which can be seen on the main page of the official website. Here I import five functions in sklearn, in which StandardScaler is used as data standardization and train_test_split is used as training set and test set for data segmentation, and RandomForestRegressor is used for the establishment of regression model.

1.1 Guide Package

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from xgboost import XGBRegressor as XGBR

1.2 import data

data = pd.read_csv(r'./data.csv')

2. Data preprocessing

The basic and necessary operations of data preprocessing include deletion of missing values, data standardization and data segmentation.

2.1 deletion of missing values

data.isnull().any()#Check for missing values

data = data.dropna()#Delete missing values deleted data information

2.2 data standardization

#Extract characteristic variables and dependent variables
XX = data.loc[:,['SAVI_MIN','SAVI_MEDIAN','SAVI_MAX',
                'ASPECT', 'DEM','SLOPE',
                 'elev_percentile_30th', 'elev_percentile_60th','elev_percentile_90th',
                 'elev_mean','elev_variance' ,
                 'CanopyCover', 'GapFraction',

Y = data['cha_'].values

#Data standardization
X = StandardScaler().fit_transform(XX)
Y = StandardScaler().fit_transform(Y.reshape(-1,1))

2.3 data segmentation

The most important function is train_test_split, you need to recite the full text on the official website. The parameters inside are more important, as well as the return value.

#Split 70% of the training set and 30% of the test set
validation_size = 0.3
seed = 10
x_train,x_test,y_train,y_test = train_test_split(X,Y,
    test_size = validation_size,random_state = seed)

3. Model establishment and fitting

Establish the model and carry out model fitting. There are two steps in total. The first line of code is model establishment; The second line of code is model fitting (fitting should use training data fitting).

There are two most important parameters of random forest regression, which are n_estimators and max_depth, the model accuracy can be appropriately improved by adjusting.

RF=RandomForestRegression(n_estimators=50, n_jobs=-1,random_state=10),y_train)

4. Model evaluation and prediction

Common regression evaluation indicators include MSE, MAE, R2, etc. the formula can be found by yourself. The final result is the result of the model.

y_true = RF.predict(x_test)




The above is the overall process of machine learning, thank you!

Topics: Python Machine Learning Data Analysis sklearn