# Introduction to Feature Engineering

Posted by JoWiGo on Wed, 22 Apr 2020 18:37:01 +0200

Author: Lin Zelong

## 1 What is feature engineering?

Excellent models often depend on excellent feature extraction, which involves feature engineering.The purpose of feature engineering is to maximize the extraction of features from raw data for use by algorithms and models.Therefore, feature engineering is mainly carried out in the aspect of feature processing. Next, several classical and effective feature engineering methods are introduced.

sklearn library needs to be installed before practice. It provides a more complete feature processing method, including data preprocessing, feature selection, dimension reduction, and so on.In this article, the sklearns IRIS (Iris) dataset To explain the feature processing function.The IRIS dataset, organized by Fisher in 1936, contains four features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), all of which are positive floating-point numbers in centimeters.The target values are Iris Setosa, Iris Versicolour, Iris Virginica.

The code to import the IRIS dataset is as follows:

from sklearn.datasets import load_iris

#Import IRIS Dataset

#Characteristic Matrix
iris.data

#Target Vector
iris.target


## 2 Data preprocessing

By feature extraction, we can get unprocessed features, which may have the following problems:

1. Not in the same dimension: that is, the specifications of the features are different and cannot be compared together.Dimensionalization can solve this problem.
2. Information Redundancy: For some quantitative characteristics, valid information is divided into intervals, such as academic performance, if only "pass" or "pass" is concerned.
3. There are missing values: missing values need to be supplemented.

### 2.1 Uniform Dimension

Unified dimensions require data to be converted to the same specification. Standardization and interval scaling are common methods.The premise of standardization is that the eigenvalues follow a normal distribution, after standardization, they are converted to a standard normal distribution.Interval scaling uses boundary value information to scale the value range of a feature to the range of a feature.

#### 2.1.1 Standardization

Standardization requires calculating the mean and standard deviation of features as follows: $$x^{\prime}=\frac{x-\bar{X}}{S}$$

The code is as follows:

from sklearn.preprocessing import StandardScaler

#Standardized, returned values are standardized data
StandardScaler().fit_transform(iris.data)


#### 2.1.2 Interval Scaling

The idea of interval scaling is to present all the values proportionally. There are many scaling methods. Generally, scaling with two maximum values is expressed as:

$$x^{\prime}=\frac{x-M i n}{M a x-M i n}$$

The code is as follows:

from sklearn.preprocessing import MinMaxScaler

#Zoom interval, return value is data zoomed to [0,1]interval
MinMaxScaler().fit_transform(iris.data)


#### 2.1.3 Normalization

Simply put, standardization treats data according to the columns of the characteristic matrix, which converts the eigenvalues of a sample to the same dimension by calculating the z-score.Normalization is the processing of data according to the rows of the characteristic matrix. The purpose of normalization is that sample vectors have uniform criteria when calculating similarity by point multiplication or other kernel functions, that is, they are all converted to "unit vectors".The normalization formula with rule l2 is as follows:

$$x^{\prime}=\frac{x}{\sqrt{\sum_{j}^{m} x[j]^{2}}}$$

The code is as follows:

from sklearn.preprocessing import Normalizer

#Normalized, returned value is normalized data
Normalizer().fit_transform(iris.data)


## 2.2 Binarization

The core of binarization is to set a threshold value, which is greater than 1 and less than or equal to 0. The formula is as follows:

$$x=\left{\begin{array}{ll}1 & x>\text { threshold} \ 0 & x \leqslant t \text {hreshold}\end{array}\right.$$

The code is as follows:

from sklearn.preprocessing import Binarizer

#Binary, threshold set to 3, return value to binary data
Binarizer(threshold=3).fit_transform(iris.data)


### 2.3 Default Value Calculation

IRIS datasets do not have missing values, but the actual process is often missing data, the common method is to use adjacent data to complete, or other methods.You can also calculate missing values directly using the Imputer class of the preproccessing library.

The code is as follows:

from numpy import vstack, array, nan
from sklearn.impute  import SimpleImputer

#Missing value calculation, return value is data after missing value calculation
#The missing_value parameter is a representation of the missing value and defaults to NaN
#The parameter strategy is populated with missing values and defaults to mean
Imp=SimpleImputer().fit_transform(vstack((array([nan, nan, nan, nan]), iris.data)))


### 3 Feature Selection

When data preprocessing is complete, we need to select meaningful features to input into the machine learning algorithm and model for training.Generally speaking, there are two ways to choose a feature:

1. Whether a feature is divergent or not: If a feature is not divergent, for example, the variance is close to 0, that is, there is little difference in this feature between samples, this feature is not useful for distinguishing between samples.
2. Relevance of features to objectives: This is more obvious, and features that are highly relevant to objectives should be preferred.In addition to the variance method, the other methods described in this paper are concerned with correlation.

Based on the form of feature selection, the feature selection methods can be divided into three types: Filter, Wrapper and Embedded.

### 3.1 Filtration

Filtering, scoring each feature according to divergence or correlation, setting a threshold or the number of thresholds to be selected, and selecting features.

#### 3.1.1 Variance Selection Method

Using the difference selection method, the variance of each feature is calculated first, and then, according to the threshold value, the features whose variance is greater than the threshold value are selected.The code to select a feature using the VarianceThreshold class of the feature_selection library is as follows:

from sklearn.feature_selection import VarianceThreshold

#Variance selection, returning data with a feature selection value
#Parameter threshold is the threshold of variance
VarianceThreshold(threshold=3).fit_transform(iris.data)


#### 3.1.2 correlation coefficient method

With the correlation coefficient method, the correlation coefficient of each feature to the target value and the P value of the correlation coefficient are calculated.The code for selecting features using the SelectKBest class of the feature_selection library with correlation coefficients is as follows:

from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr

#Select the K best features and return the data after selecting the features
#The first parameter is to calculate a function that evaluates whether a feature is good, which inputs a feature matrix and a target vector, outputs an array of binaries (scores, P values), and whose item i is the score and P value of the feature i I.Defined here as calculating the correlation coefficient
#Parameter k is the number of selected features
Pea=SelectKBest(lambda X, Y: array(list(map(lambda x:pearsonr(x, Y), X.T))).T[0], k=2).fit_transform(iris.data, iris.target)


#### 3.1.3 chi-square test

The classical chi-square test is to test the correlation between qualitative independent variables and qualitative dependent variables.Assuming that the independent variable has N values and the dependent variable has M values, the statistic is constructed by considering the difference between the observed and expected frequencies of samples with independent variable equal to i and dependent variable equal to j:

$$\chi^{2}=\sum \frac{(A-E)^{2}}{E}$$

The code is as follows:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#Select the K best features and return the data after selecting the features
SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)


#### 3.1.4 Mutual Information Method

The classical reciprocal information also evaluates the correlation between qualitative independent variables and qualitative dependent variables. The reciprocal information calculation formulas are as follows:

$$I(X ; Y)=\sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}$$

In order to process quantitative data, the Maximum Information Factor method has been proposed. The code for selecting features using the SelectKBest class of the feature_selection Library in conjunction with the Maximum Information Factor method is as follows:

from sklearn.feature_selection import SelectKBest
from minepy import MINE

#Since MINE is not designed as a function, define the mic method to return a binary with the second item of the binary set to a fixed P value of 0.5
def mic(x, y):
m = MINE()
m.compute_score(x, y)
return (m.mic(), 0.5)

#Select K best features and return the data after feature selection
MIN=SelectKBest(lambda X, Y: array(list(map(lambda x:mic(x, Y), X.T))).T[0], k=2).fit_transform(iris.data, iris.target)


### 3.2 Packaging Method

Packaging, which selects or excludes features at a time based on the objective function (usually the prediction effect score).

#### 3.2.1 Recursive Feature Elimination Method

Recursive Elimination Feature Method uses a base model for multiple rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is based on a new set of features.The code to select features using the RFE class of the feature_selection library is as follows:

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

#Recursive Feature Elimination, returning data after feature selection
#Parameter estimator is the base model
#Parameter n_features_to_select is the number of features selected
RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(iris.data, iris.target)


### 3.3 Embedding Method

Embedding method, first uses some machine learning algorithms and models to train, get the weight coefficients of each feature, and select features from large to small according to the coefficients.Similar to the Filter method, but trained to determine the quality of the features.

#### 3.3.1 Feature Selection Method Based on Penalty Items

Using a base model with penalties, in addition to filtering out features, dimensions are also reduced.Using the SelectFromModel class of the feature_selection library combined with a logistic regression model with L1 penalties, select the following code for the feature:

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

#Logistic Regression with L1 Penalty as Feature Selection of Base Model
SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(iris.data, iris.target)


### 3.3.2 Feature Selection Method Based on Tree Model

GBDT in the tree model can also be used as a base model for feature selection. Using the SelectFromModel class of the feature_selection Library in conjunction with the GBDT model, the code for selecting features is as follows:

from sklearn.feature_selection import SelectFromModel

#GBDT as Feature Selection of Base Model


## 4-Feature Dimension Reduction

When the feature selection is complete, the model can be trained directly, but it is necessary to reduce the dimension of the feature matrix because the feature matrix is too large, which may result in a large amount of calculation and a long training time.In addition to the L1 penalty-based model mentioned above, there are also principal component analysis (PCA) and linear discriminant analysis (LDA), which is also a classification model.If you want to know more about Mr. Li Hongyi who can study MO platform organization curriculum

### 4.1 Principal Component Analysis (PCA)

The code to select features using the decomposition library for the PCA class is as follows:

from sklearn.decomposition import PCA

#Principal Component Analysis returns reduced dimension data
#The parameter n_components is the number of principal components
PCA(n_components=2).fit_transform(iris.data)


### 4.2 Linear Discriminant Analysis (LDA)

The code for selecting features using LDA classes from the LDA library is as follows:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

#Linear Discriminant Analysis returns reduced-dimension data
#The parameter n_components is the reduced dimension
Lda=LDA(n_components=2).fit_transform(iris.data, iris.target)