Task 15: integrated learning case 2 (steam volume prediction)

Posted by google_man2000 on Thu, 10 Feb 2022 00:16:40 +0100

Reference source: https://github.com/datawhalechina/team-learning-data-mining/tree/master/EnsembleLearning

1. Data set

1.1 background introduction

The basic principle of thermal power generation is: when the fuel is burned, it heats water to generate steam, the steam pressure drives the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electric energy. In this series of energy conversion, the core affecting the power generation efficiency is the combustion efficiency of the boiler, that is, fuel combustion heats water to produce high-temperature and high-pressure steam. There are many factors affecting the combustion efficiency of the boiler, including the adjustable parameters of the boiler, such as combustion feed, primary and secondary air, induced air, return air and water supply; And the working conditions of the boiler, such as boiler bed temperature and bed pressure, furnace temperature and pressure, superheater temperature, etc. How can we use the above information to predict the amount of steam produced according to the working conditions of the boiler, so as to contribute to the output prediction of China's industrial industry?

Therefore, this case is to use the characteristics of the above industrial indicators to predict the steam volume. Due to information security and other reasons, we use the data collected by the desensitized boiler sensor (the acquisition frequency is minute level).

In fact, it doesn't seem to have anything to do with this background, because it's desensitized data. We don't know what the real meaning of each attribute is, so we don't need to manually process each attribute, add artificial attributes and cross attributes, etc.

1.2 data set

The data is divided into training data (train.txt) and test data (test.txt), in which the fields "V0" - "V37" are used as characteristic variables and "target" is used as target variables. We need to use the training data to train the model and predict the target variables of the test data.

1.3 evaluation index:

The final evaluation index is the mean square error MSE, namely:
S c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 Score = \frac{1}{n} \sum_1 ^n (y_i - y ^*)^2 Score=n11∑n(yi−y∗)2

2. Data preprocessing

2.1 importing dependent packages

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns

# Model
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
from sklearn.metrics import make_scorer,mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor,AdaBoostRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import PolynomialFeatures,MinMaxScaler,StandardScaler

2.2 reading data

data_train = pd.read_csv('train.txt',sep = '\t')
data_test = pd.read_csv('test.txt',sep = '\t')

#Merge training data and test data
#Before merging training data and test data, label them to indicate the source
data_train["oringin"]="train"
data_test["oringin"]="test"
data_all=pd.concat([data_train,data_test],axis=0,ignore_index=True)
#Display the first 5 data
data_all.head()

2.3 exploratory analysis of data

Here, kdeplot (nuclear density estimation map) is used for preliminary analysis of data, namely EDA.

Kernel density estimation
Nuclear density estimation: https://www.zhihu.com/question/27301358

According to my understanding, the kernel density estimation is to know the scatter diagram and make regression. The connection is required to be as smooth as possible and roughly observe the distribution of the data. In this example, the distribution of training set and test set data is observed through kernel density estimation, so as to delete attribute values that do not have similar distribution.

for column in data_all.columns[0:-2]:
    #Kernel density estimation is used to estimate the unknown density function in probability theory. It is one of the nonparametric test methods. Through the kernel density estimation diagram, we can intuitively see the distribution characteristics of the data samples themselves.
    #kdeplot Manual: https://www.cntofu.com/book/172/docs/25.md
    #ax: the coordinate axis to be drawn. If it is empty, the current axis will be used
    g = sns.kdeplot(data_all[column][(data_all["oringin"] == "train")], color="Red", shade = True)
    g = sns.kdeplot(data_all[column][(data_all["oringin"] == "test")], ax =g,color="Blue", shade= True)
    g.set_xlabel(column)
    g.set_ylabel("Frequency")
    g = g.legend(["train","test"])
    plt.show()

There are too many pictures. Select some typical data for observation

View the correlation between features (degree of correlation)

#Draw thermal map
#The thermal map does not contain the origin attribute
data_train1=data_all[data_all["oringin"]=="train"].drop("oringin",axis=1)
plt.figure(figsize=(20, 20))  # Specifies the width and height of the drawing object
colnm = data_train1.columns.tolist()  # List header
mcorr = data_train1[colnm].corr(method="spearman")  # The correlation coefficient matrix gives the correlation coefficient between any two variables
mask = np.zeros_like(mcorr, dtype=np.bool)  # It is constructed that the number matrix with the same dimension as mcorr is bool type
mask[np.triu_indices_from(mask)] = True  # True to the right of the corner divider
cmap = sns.diverging_palette(220, 10, as_cmap=True)  # Return matplotlib colormap object, color palette
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')  # Thermodynamic diagram (see two similarity)
plt.show()

Dimensionality reduction operation, that is, delete the features whose absolute value of correlation is less than the threshold

threshold = 0.1
corr_matrix = data_train1.corr().abs()
drop_col=corr_matrix[corr_matrix["target"]<threshold].index
data_all.drop(drop_col,axis=1,inplace=True)

Then normalize the formula:
y = c o l − c o l . m i n ( ) c o l . m a x ( ) − c o l . m i n ( ) y = \frac {col-col.min()}{col.max()-col.min()} y=col.max()−col.min()col−col.min()

cols_numeric=list(data_all.columns)
cols_numeric.remove("oringin")

def scale_minmax(col):
    return (col-col.min())/(col.max()-col.min())

scale_cols = [col for col in cols_numeric if col!='target']
data_all[scale_cols] = data_all[scale_cols].apply(scale_minmax,axis=0)
data_all[scale_cols].describe()

#After normalization, the maximum value must be 1. According to the formula
#According to the formula, the minimum value must be 0 after normalization

The above is to complete the whole process of data processing. However, missing values and outliers are not filled in this project.

3. Characteristic Engineering

For quantitative quantitative (Q-Q) diagram, please refer to: https://blog.csdn.net/u012193416/article/details/83210790

fcols = 6
frows = len(cols_numeric)-1
plt.figure(figsize=(4*fcols,4*frows))
i=0

#i is used to determine the drawing position
#var is used to traverse all specific values under each attribute

for var in cols_numeric:
    #Establish the relationship graph between all attributes and target, so the relationship graph between target and target cannot be established
    if var!='target':
        dat = data_all[[var, 'target']].dropna()
        
        i+=1
        #Three integers or three independent integers can be used to describe the position information of the subgraph.
        #If the three integers are row number, column number and index value, the subgraph will be distributed at the index position of the row and column. The index starts at 1 and increases from the upper right corner to the lower right corner.
        plt.subplot(frows,fcols,i)
        #scipy.stats.norm function can realize normal distribution
        sns.distplot(dat[var] , fit=stats.norm);
        #Set the picture title to the name of the current attribute + original
        plt.title(var+' Original')
        plt.xlabel('')
        
        i+=1
        #The reason for resetting subplot is that the value of i has changed
        plt.subplot(frows,fcols,i)
        #stats.probplot is a method for drawing QQ chart. It detects normal distribution by default
        _=stats.probplot(dat[var], plot=plt)
        
        #stats.skew: calculate the sample skewness of the data set.
        #For normally distributed data, the skewness should be approximately zero.
        #For the unimodal continuous distribution, the skewness value greater than zero means that there are more weights at the right tail of the distribution.
        #Statistically speaking, this function can be used to determine whether the skew value is close enough to zero.
        plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))
        plt.xlabel('')
        plt.ylabel('')
        
        i+=1
        plt.subplot(frows,fcols,i)
        #Draw the correlation diagram between the current attribute and target
        plt.plot(dat[var], dat['target'],'#',alpha=0.5)
        #np.corrcoef calculates the correlation coefficient
        plt.title('corr='+'{:.2f}'.format(np.corrcoef(dat[var], dat['target'])[0][1]))
 
        i+=1
        plt.subplot(frows,fcols,i)
        trans_var, lambda_var = stats.boxcox(dat[var].dropna()+1)
        trans_var = scale_minmax(trans_var)      
        sns.distplot(trans_var , fit=stats.norm);
        plt.title(var+' Tramsformed')
        plt.xlabel('')
        
        i+=1
        plt.subplot(frows,fcols,i)
        _=stats.probplot(trans_var, plot=plt)
        plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
        plt.xlabel('')
        plt.ylabel('')
        
        i+=1
        plt.subplot(frows,fcols,i)
        plt.plot(trans_var, dat['target'],'.',alpha=0.5)
        plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))```

Box Cox transformation
Box Cox transformation is a data transformation commonly used in statistical modeling, which is used when continuous response variables do not meet the normal distribution. For example, when using linear regression, due to the residual ϵ \epsilon ϵ If it does not conform to the normal distribution and does not meet the modeling conditions, the response variable Y should be transformed to turn the data into normal. After box Cox transform, the correlation between residual and prediction variables can be reduced to a certain extent.

# Box Cox transformation
#Meaning of underline in Python: https://www.runoob.com/w3cnote/python-5-underline.html
cols_transform=data_all.columns[0:-2]
for col in cols_transform:   
    # transform column
    #boxcox requires the input data to be positive.
    #stats.boxcox() returns a positive data set converted to the power of box Cox
    data_all.loc[:,col], _ = stats.boxcox(data_all.loc[:,col]+1)

print(data_all.target.describe())
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_all.target.dropna() , fit=stats.norm)
plt.subplot(1,2,2)
_ = stats.probplot(data_all.target.dropna(), plot=plt)

Meaning of underline in python:

Use logarithmic transformation target value to improve the normality of characteristic data. Refer to: https://www.zhihu.com/question/22012482

sp = data_train.target
data_train.target1 = np.power(1.5,sp)
print(data_train.target1.describe())

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.distplot(data_train.target1.dropna(),fit=stats.norm)
plt.subplot(1,2,2)
_=stats.probplot(data_train.target1.dropna(), plot=plt)

As can be seen from the QQ chart, the normality has been improved.

4. Model building and integrated learning

Build training set and test set

Separate the previously merged training set from the test set, and then delete some attribute values added manually and irrelevant to training

# function to get training samples
def get_training_data():
    # extract training samples
    from sklearn.model_selection import train_test_split
    df_train = data_all[data_all["oringin"]=="train"]
    df_train["label"]=data_train.target1
    # split SalePrice and features
    y = df_train.target
    X = df_train.drop(["oringin","target","label"],axis=1)
    X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.3,random_state=100)
    return X_train,X_valid,y_train,y_valid

# extract test data (without SalePrice)
def get_test_data():
    df_test = data_all[data_all["oringin"]=="test"].reset_index(drop=True)
    return df_test.drop(["oringin","target"],axis=1)

Evaluation function of rmse and mse

from sklearn.metrics import make_scorer
# metric for evaluation
def rmse(y_true, y_pred):
    diff = y_pred - y_true
    sum_sq = sum(diff**2)    
    n = len(y_pred)   
    return np.sqrt(sum_sq/n)

def mse(y_ture,y_pred):
    return mean_squared_error(y_ture,y_pred)

# scorer to be used in sklearn model fitting
rmse_scorer = make_scorer(rmse, greater_is_better=False) 

#Entered score_ When func is a scoring function, the value is True (default value); This value is False when the input function is a loss function
mse_scorer = make_scorer(mse, greater_is_better=False)

Find outliers and delete

# function to detect outliers based on the predictions of a model
def find_outliers(model, X, y, sigma=3):

    # predict y values using model
    model.fit(X,y)
    y_pred = pd.Series(model.predict(X), index=y.index)
        
    # calculate residuals between the model prediction and true y values
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()

    # calculate z statistic, define outliers to be where |z|>sigma
    z = (resid - mean_resid)/std_resid    
    outliers = z[abs(z)>sigma].index
    
    # print and plot the results
    print('R2=',model.score(X,y))
    print('rmse=',rmse(y, y_pred))
    print("mse=",mean_squared_error(y,y_pred))
    print('---------------------------------------')

    print('mean of residuals:',mean_resid)
    print('std of residuals:',std_resid)
    print('---------------------------------------')

    print(len(outliers),'outliers:')
    print(outliers.tolist())

    plt.figure(figsize=(15,5))
    ax_131 = plt.subplot(1,3,1)
    plt.plot(y,y_pred,'.')
    plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro')
    plt.legend(['Accepted','Outlier'])
    plt.xlabel('y')
    plt.ylabel('y_pred');

    ax_132=plt.subplot(1,3,2)
    plt.plot(y,y-y_pred,'.')
    plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro')
    plt.legend(['Accepted','Outlier'])
    plt.xlabel('y')
    plt.ylabel('y - y_pred');

    ax_133=plt.subplot(1,3,3)
    z.plot.hist(bins=50,ax=ax_133)
    z.loc[outliers].plot.hist(color='r',bins=50,ax=ax_133)
    plt.legend(['Accepted','Outlier'])
    plt.xlabel('z')
    
    return outliers

# get training data
X_train, X_valid,y_train,y_valid = get_training_data()
test=get_test_data()

# find and remove outliers using a Ridge model
outliers = find_outliers(Ridge(), X_train, y_train)
X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(outliers)
y_t=y_train.drop(outliers)

model training

def get_trainning_data_omitoutliers():
    #Obtain training data and omit outliers
    y=y_t.copy()
    X=X_t.copy()
    return X,y

def train_model(model, param_grid=[], X=[], y=[], 
                splits=5, repeats=5):

    # get data
    if len(y)==0:
        X,y = get_trainning_data_omitoutliers()
        
    # Cross validation
    rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)
    
    # Optimal parameters of grid search
    if len(param_grid)>0:
        gsearch = GridSearchCV(model, param_grid, cv=rkfold,
                               scoring="neg_mean_squared_error",
                               verbose=1, return_train_score=True)

        # train
        gsearch.fit(X,y)

        # Best model
        model = gsearch.best_estimator_        
        best_idx = gsearch.best_index_

        # Obtain cross validation evaluation indicators
        grid_results = pd.DataFrame(gsearch.cv_results_)
        cv_mean = abs(grid_results.loc[best_idx,'mean_test_score'])
        cv_std = grid_results.loc[best_idx,'std_test_score']

    # No grid search  
    else:
        grid_results = []
        cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
        cv_mean = abs(np.mean(cv_results))
        cv_std = np.std(cv_results)
    
    # Merge data
    cv_score = pd.Series({'mean':cv_mean,'std':cv_std})

    # forecast
    y_pred = model.predict(X)
    
    # Statistics of model performance        
    print('----------------------')
    print(model)
    print('----------------------')
    print('score=',model.score(X,y))
    print('rmse=',rmse(y, y_pred))
    print('mse=',mse(y, y_pred))
    print('cross_val: mean=',cv_mean,', std=',cv_std)
    
    # Residual analysis and visualization
    y_pred = pd.Series(y_pred,index=y.index)
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()
    z = (resid - mean_resid)/std_resid    
    n_outliers = sum(abs(z)>3)
    outliers = z[abs(z)>3].index
    
    return model, cv_score, grid_results

# Define training variables to store data
opt_models = dict()
score_models = pd.DataFrame(columns=['mean','std'])
splits=5
repeats=5

model = 'Ridge'  #Replaceable, see various models in case analysis I
opt_models[model] = Ridge() #Replaceable, see various models in case analysis I
alph_range = np.arange(0.25,6,0.25)
param_grid = {'alpha': alph_range}

opt_models[model],cv_score,grid_results = train_model(opt_models[model], param_grid=param_grid, 
                                              splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(alph_range, abs(grid_results['mean_test_score']),
             abs(grid_results['std_test_score'])/np.sqrt(splits*repeats))
plt.xlabel('alpha')
plt.ylabel('score')

# Prediction function
def model_predict(test_data,test_y=[]):
    i=0
    y_predict_total=np.zeros((test_data.shape[0],))
    for model in opt_models.keys():
        if model!="LinearSVR" and model!="KNeighbors":
            y_predict=opt_models[model].predict(test_data)
            y_predict_total+=y_predict
            i+=1
        if len(test_y)>0:
            print("{}_mse:".format(model),mean_squared_error(y_predict,test_y))
    y_predict_mean=np.round(y_predict_total/i,6)
    if len(test_y)>0:
        print("mean_mse:",mean_squared_error(y_predict_mean,test_y))
    else:
        y_predict_mean=pd.Series(y_predict_mean)
        return y_predict_mean

#Model prediction and result saving
y_ = model_predict(test)
y_.to_csv('predict.txt',header = None,index = False)

Topics: Machine Learning Data Analysis

Programmer Think