Using the sklearn package of python language to realize the basic machine learning operation is very simple. After all, we only need to call part of the source code.
Basic machine learning operations are nothing more than four parts:
- Import corresponding packages and data
- Data preprocessing;
- Model establishment and fitting
- Model evaluation and prediction
1. Import corresponding packages and data
First, let's go to the first step, guide the package and data. Sklearn mainly includes six categories, which can be seen on the main page of the official website. Here I import five functions in sklearn, in which StandardScaler is used as data standardization and train_test_split is used as training set and test set for data segmentation, and RandomForestRegressor is used for the establishment of regression model.
1.1 Guide Package
import numpy as np import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt from xgboost import XGBRegressor as XGBR
1.2 import data
data = pd.read_csv(r'./data.csv') data.info()
2. Data preprocessing
The basic and necessary operations of data preprocessing include deletion of missing values, data standardization and data segmentation.
2.1 deletion of missing values
data.isnull().any()#Check for missing values data = data.dropna()#Delete missing values data.info()#View deleted data information
2.2 data standardization
#Extract characteristic variables and dependent variables XX = data.loc[:,['SAVI_MIN','SAVI_MEDIAN','SAVI_MAX', 'NEAR_DIST','NEAR_NUMBER', 'NDMI_MIN','NDMI_MEDIAN','NDMI_MAX', 'ASPECT', 'DEM','SLOPE', 'NDRE_MEDIAN','NDRE_MIN','NDRE_MAX', 'elev_percentile_30th', 'elev_percentile_60th','elev_percentile_90th', 'elev_mean','elev_variance' , 'CanopyCover', 'GapFraction', 'density_metrics[0]','density_metrics[3]', 'density_metrics[6]','density_metrics[9]',]] Y = data['cha_'].values #Data standardization X = StandardScaler().fit_transform(XX) Y = StandardScaler().fit_transform(Y.reshape(-1,1))
2.3 data segmentation
The most important function is train_test_split, you need to recite the full text on the official website. The parameters inside are more important, as well as the return value.
#Split 70% of the training set and 30% of the test set validation_size = 0.3 seed = 10 x_train,x_test,y_train,y_test = train_test_split(X,Y, test_size = validation_size,random_state = seed)
3. Model establishment and fitting
Establish the model and carry out model fitting. There are two steps in total. The first line of code is model establishment; The second line of code is model fitting (fitting should use training data fitting).
There are two most important parameters of random forest regression, which are n_estimators and max_depth, the model accuracy can be appropriately improved by adjusting.
RF=RandomForestRegression(n_estimators=50, n_jobs=-1,random_state=10) RF.fit(x_train,y_train)
4. Model evaluation and prediction
Common regression evaluation indicators include MSE, MAE, R2, etc. the formula can be found by yourself. The final result is the result of the model.
y_true = RF.predict(x_test) print("mse:",mean_squared_error(y_test,y_true))#MSE print("mae:",mean_absolute_error(y_test,y_true))#MAE print("r2:",r2_score(y_test,y_true))#R2
The above is the overall process of machine learning, thank you!