# 1 data preprocessing

Data preprocessing is roughly divided into three steps: data preparation, data conversion and data output.

## 1.1 formatted data

Scikit learn provides two standard data formatting methods, Fit and Multiple Transform and combined fit and transform. It is recommended to give priority to Fit and Multiple Transform methods.

## 1.2 adjust data scale

# # # Adjust data scale (0..) from pandas import read_csv from numpy import set_printoptions from sklearn.preprocessing import MinMaxScaler # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] transformer = MinMaxScaler(feature_range=(0, 1)) # data conversion newX = transformer.fit_transform(X) # Set the print format of data set_printoptions(precision=3) print(newX)

Note: in scikit learn, the data scale can be adjusted through MinMaxScaler class to unify the data of different units into the same scale, which is conducive to the classification or grouping of things. In fact, MinMaxScaler scales the attribute to a specified range, or normalizes the data and aggregates the data around 0 with a variance of 1.

## 1.3 normalized data

"""Normalized data""" from pandas import read_csv from numpy import set_printoptions from sklearn.preprocessing import StandardScaler # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] transformer = StandardScaler().fit(X) # data conversion newX = transformer.transform(X) # Format data for printing set_printoptions(precision=3) print(newX)

Note: normalized data is an effective means to deal with data conforming to Gaussian distribution. The output result takes 0 as the median and variance as 1, and is used as the input of the algorithm assuming that the data conforms to Gaussian distribution.

## 1.4 standardized data

"""Standardized data""" from pandas import read_csv from numpy import set_printoptions from sklearn.preprocessing import Normalizer # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] transformer = Normalizer().fit(X) # data conversion newX = transformer.transform(X) # Set the print format of data set_printoptions(precision=3) print(newX)

Note: normalized data processing is to process the distance of each row of data into 1 data, also known as "normalized" processing, which is suitable for processing sparse data. The normalized data plays a significant role in improving the accuracy of neural network using weight input and K-nearest neighbor algorithm using distance.

## 1.5 binary data

"""Binary data""" from pandas import read_csv from numpy import set_printoptions from sklearn.preprocessing import Binarizer # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output data array = data.values X = array[:, 0:8] Y = array[:, 8] transformer = Binarizer(threshold=0.0).fit(X) # data conversion newX = transformer.transform(X) # Set the print format of data set_printoptions(precision=3) print(newX)

Note: binary data is used to convert data into binary values. If it is greater than the threshold value, it is set to 1, and if it is less than the threshold value, it is set to 0. This process is called binary data or threshold conversion.

# 2 data feature selection

## 2.1 feature selection

Feature selection is a process that can select feature data that can help improve the accuracy of prediction results, or help find the output results we are interested in. If the data contains irrelevant feature attributes, it will reduce the accuracy of the algorithm and interfere with the prediction of new data, especially the linear correlation algorithm. Therefore, before starting modeling, performing feature selection helps:

- Reduce the fitting degree of data: less redundant data will make the algorithm more likely to draw conclusions.
- Improve the accuracy of the algorithm: less misleading data can improve the accuracy of the algorithm.
- Reduce the time required for training: the less data, the less time required for training the model.

## 2.2 univariate feature selection

"""Selected data characteristics by chi square test""" from pandas import read_csv from numpy import set_printoptions from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] # Feature selection test = SelectKBest(score_func=chi2, k=4) fit = test.fit(X, Y) set_printoptions(precision=3) print(fit.scores_) features = fit.transform(X) print(features)

Note: statistical analysis can be used to analyze and select the data features that have the greatest impact on the results. The classical chi square test is a method to test the correlation between qualitative independent variables and qualitative dependent variables. Chi square test is the degree of deviation between the actual observed value of statistical samples and the theoretical inferred value. The degree of deviation determines the size of chi square value. The larger the chi square value, the more inconsistent it is; The smaller the chi square value, the smaller the deviation and the more consistent it tends to be; If the two values are exactly equal, the chi square value is 0, indicating that the theoretical value is completely consistent.

## 2.3 recursive feature elimination

"""Select features by recursive elimination""" from pandas import read_csv from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] # Feature selection model = LogisticRegression() rfe = RFE(estimator=model, n_features_to_select=3) fit = rfe.fit(X, Y) print("Number of features:") print(fit.n_features_) print("Selected features:") print(fit.support_) print("Feature ranking:") print(fit.ranking_)

Note: recursive feature elimination (RFE) uses a base model for multi Lun training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is carried out based on the new feature set. Through the accuracy of each base model, find the data features that have the greatest impact on the final prediction results. There are more descriptions about recursive feature elimination in the scikit learn document.

## 2.4 analysis of main components

from pandas import read_csv from sklearn.decomposition import PCA # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] # Feature selection pca = PCA(n_components=3) fit = pca.fit(X) print("Explain variance:%s" % fit.explained_variance_ratio_) print(fit.components_)

Dimensionality reduction (PCA) is a method of dimensionless data reduction (Note: PCA is usually called unsupervised algebraic data reduction). Common dimensionality reduction methods include principal component analysis (PCA) and linear discriminant analysis (LDA), which is also a classification model.

## 2.5 feature importance

from pandas import read_csv from sklearn.ensemble import ExtraTreesClassifier # Import data filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = read_csv(filename, names=names) # Divide the data into input data and output results array = data.values X = array[:, 0:8] Y = array[:, 8] # Feature selection model = ExtraTreesClassifier() fit = model.fit(X, Y) print(fit.feature_importances_)

Note: Bagged Decision Trees algorithm, random forest algorithm and extreme random tree algorithm can be used to calculate the importance of data features. These three algorithms are bagged algorithms in the integrated algorithm.

# reference material

[1] Wei Zhenyuan. 2018 Machine learning: Python practice [M] Beijing: Electronic Industry Press, scikit - learning scikit - learning scikit - Learning

[2] scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation