On the basis of the Intro to Machine Learning course, it explains how to deal with missing values, non-numeric data, and data leaks.
Lost value handling
- Discard columns containing missing values directly
- Replace the missing value with the average value (sklearn.impute SimpleInpuer function in)
- Record all columns on the original data as new columns instead of average
# Find column names containing missing values cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()] # Method 1: Remove data reduced_X_train = X_train.drop(cols_with_missing, axis=1) reduced_X_valid = X_valid.drop(cols_with_missing, axis=1) # Method 2: Replace the mean value with the mean value from sklearn.impute import SimpleImputer my_imputer = SimpleImputer() imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train)) imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid)) imputed_X_train.columns = X_train.columns imputed_X_valid.columns = X_valid.columns # Method 3 X_train_plus = X_train.copy() X_valid_plus = X_valid.copy() for col in cols_with_missing: X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull() X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull() my_imputer = SimpleImputer() imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus)) imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus)) imputed_X_train_plus.columns = X_train_plus.columns imputed_X_valid_plus.columns = X_valid_plus.columns
Classified Data Processing
For example, there are seven values for an attribute named Date from Monday to Sunday, and the following three methods
- Discard: Remove this property directly
- Label encoding: Change attribute values to numbers 1 to 7 in turn (the disadvantage is that attribute values differ in size, but are actually equal every day)
- Solitary heat vector encoding: becomes a vector consisting of seven values, with Tuesday and Triple being (0,1,0,0,0,0,0) (0,0,1,0,0,0)
# Find column names containing categorical data s = (X_train.dtypes == 'object') object_cols = list(s[s].index) # Method 1: Discard columns containing categorical data directly drop_X_train = X_train.select_dtypes(exclude=['object']) drop_X_valid = X_valid.select_dtypes(exclude=['object']) # Method 2: Label encoding from sklearn.preprocessing import LabelEncoder # Copy the original training set, test set for encoding, to prevent pollution of the original data label_X_train = X_train.copy() label_X_valid = X_valid.copy() # LabelEncoder for label encoding label_encoder = LabelEncoder() for col in object_cols: label_X_train[col] = label_encoder.fit_transform(X_train[col]) label_X_valid[col] = label_encoder.transform(X_valid[col]) # Method 3: Single heat vector coding from sklearn.preprocessing import OneHotEncoder # Unique heat vector coding for categorical data OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) # Handle_UnknowownIgnores the error returned by the coded training set data that does not appear in the validation set data (for example, the training set color column only has red, but the validation set has red and blue, and the blue has no corresponding code). sparse ensures that it is a numpy array instead of a coefficient matrix OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols])) OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols])) # Row index of each row of data is added back (previously randomly selected into training and test sets) OH_cols_train.index = X_train.index OH_cols_valid.index = X_valid.index # Delete columns of original categorized data num_X_train = X_train.drop(object_cols, axis=1) num_X_valid = X_valid.drop(object_cols, axis=1) # Linking data that deletes columns of categorical data to categorical data encoded by a single heat vector OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1) OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
Using pipeline
# Step 1: Define a data preprocessing method from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder # Preprocessing of numerical data numerical_transformer = SimpleImputer(strategy='constant') # Preprocessing of Classified Data categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Bundle different preprocessing processes on a ColumnTransformer preprocessor = ColumnTransformer( transformers=[ ('num', numerical_transformer, numerical_cols), ('cat', categorical_transformer, categorical_cols) ]) # Step 2: Set up the model (n_estimators is the number of iterations, random_state is a random seed) from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(n_estimators=100, random_state=0) # Step 3: Create and analyze a Pipeline rom sklearn.metrics import mean_absolute_error # Bundle processing steps and models onto a Pipeline my_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model) ]) # Fitting data my_pipeline.fit(X_train, y_train) # Predicted value of validated set preds = my_pipeline.predict(X_valid) # Obtain average absolute error and analyze model performance score = mean_absolute_error(y_valid, preds) print('MAE:', score) # Output the predicted value of the test set as csv preds_test = my_pipeline.predict(X_test) output = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test}) output.to_csv('submission.csv', index=False)
Cross-validation
Cross-validation is a method of partitioning datasets to avoid accidental errors. By dividing the datasets into k parts, one of which is used as test set, and the other k-1 is used as training set model, K average absolute errors are obtained, and the mean of K values is used as evaluation result.
from sklearn.model_selection import cross_val_score # Multiply -1 because the MAE calculated in sklearn is negative scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, # k-values in cross-validation scoring='neg_mean_absolute_error') print("MAE scores:\n", scores)
XGBoost
XGBoost is an integrated model for gradient enhancement different from Random Sentry
# n_estimators is the number of iterations, learning_rate is the learning rate, n_jobs is the number of parallel cores (not useful for small datasets) my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4) my_model.fit(X_train, y_train, # training set early_stopping_rounds=5, # Early stop eval_set=[(X_valid, y_valid)], # Verification Set verbose=False) # Whether to output the evaluation results (Boolean or numeric values, which represent results that are output every few rounds)
Data Disclosure
The problem is divided into two main types: target leak and training test set pollution
Target leak: Attributes change with labels, such as whether or not a patient with pneumonia takes medication before and after the illness, which largely results in a pneumonia patient's predominant source of medication failure, i.e., the attribute of taking medication discloses what we are testing for
Training Test Set Pollution: A data leak that may occur when a dataset is divided into training and test sets, such as data preprocessing (processing of missing values and classified data) prior to partitioning, so that the results of training may be very good, but when this model is applied to actual data, it may not perform well because the actual data is not missingPerfect as data from pre-processed training sets
Therefore, in the process of data processing, attention should be paid to whether data leakage will occur, and then select methods such as deleting a specific attribute, although the possibility may be reduced, but the model will have a better generalization effect.