[Kaggle Learning Notes]| Intermediate Machine Learning

Posted by landavia on Wed, 17 Jun 2020 05:04:27 +0200

On the basis of the Intro to Machine Learning course, it explains how to deal with missing values, non-numeric data, and data leaks.

Lost value handling

  1. Discard columns containing missing values directly
  2. Replace the missing value with the average value (sklearn.impute SimpleInpuer function in)
  3. Record all columns on the original data as new columns instead of average
# Find column names containing missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Method 1: Remove data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# Method 2: Replace the mean value with the mean value
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# Method 3
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

Classified Data Processing

For example, there are seven values for an attribute named Date from Monday to Sunday, and the following three methods

  1. Discard: Remove this property directly
  2. Label encoding: Change attribute values to numbers 1 to 7 in turn (the disadvantage is that attribute values differ in size, but are actually equal every day)
  3. Solitary heat vector encoding: becomes a vector consisting of seven values, with Tuesday and Triple being (0,1,0,0,0,0,0) (0,0,1,0,0,0)
# Find column names containing categorical data
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

# Method 1: Discard columns containing categorical data directly
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

# Method 2: Label encoding
from sklearn.preprocessing import LabelEncoder

# Copy the original training set, test set for encoding, to prevent pollution of the original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# LabelEncoder for label encoding
label_encoder = LabelEncoder()
for col in object_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

# Method 3: Single heat vector coding
from sklearn.preprocessing import OneHotEncoder

# Unique heat vector coding for categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
# Handle_UnknowownIgnores the error returned by the coded training set data that does not appear in the validation set data (for example, the training set color column only has red, but the validation set has red and blue, and the blue has no corresponding code). sparse ensures that it is a numpy array instead of a coefficient matrix
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# Row index of each row of data is added back (previously randomly selected into training and test sets)
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Delete columns of original categorized data
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Linking data that deletes columns of categorical data to categorical data encoded by a single heat vector
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

Using pipeline

# Step 1: Define a data preprocessing method
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing of numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing of Classified Data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle different preprocessing processes on a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Step 2: Set up the model (n_estimators is the number of iterations, random_state is a random seed)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Step 3: Create and analyze a Pipeline
rom sklearn.metrics import mean_absolute_error

# Bundle processing steps and models onto a Pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Fitting data
my_pipeline.fit(X_train, y_train)

# Predicted value of validated set
preds = my_pipeline.predict(X_valid)

# Obtain average absolute error and analyze model performance
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Output the predicted value of the test set as csv
preds_test = my_pipeline.predict(X_test)
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

Cross-validation

Cross-validation is a method of partitioning datasets to avoid accidental errors. By dividing the datasets into k parts, one of which is used as test set, and the other k-1 is used as training set model, K average absolute errors are obtained, and the mean of K values is used as evaluation result.

from sklearn.model_selection import cross_val_score

# Multiply -1 because the MAE calculated in sklearn is negative
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5, # k-values in cross-validation
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

XGBoost

XGBoost is an integrated model for gradient enhancement different from Random Sentry

# n_estimators is the number of iterations, learning_rate is the learning rate, n_jobs is the number of parallel cores (not useful for small datasets)
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, # training set
             early_stopping_rounds=5,  # Early stop
             eval_set=[(X_valid, y_valid)],  # Verification Set
             verbose=False) # Whether to output the evaluation results (Boolean or numeric values, which represent results that are output every few rounds)

Data Disclosure

The problem is divided into two main types: target leak and training test set pollution

Target leak: Attributes change with labels, such as whether or not a patient with pneumonia takes medication before and after the illness, which largely results in a pneumonia patient's predominant source of medication failure, i.e., the attribute of taking medication discloses what we are testing for

Training Test Set Pollution: A data leak that may occur when a dataset is divided into training and test sets, such as data preprocessing (processing of missing values and classified data) prior to partitioning, so that the results of training may be very good, but when this model is applied to actual data, it may not perform well because the actual data is not missingPerfect as data from pre-processed training sets

Therefore, in the process of data processing, attention should be paid to whether data leakage will occur, and then select methods such as deleting a specific attribute, although the possibility may be reduced, but the model will have a better generalization effect.

Topics: Attribute encoding