Record the details of a kaggle-based game, time series segmentation, feature construction, missing value processing

Posted by HA7E on Wed, 31 Jul 2019 11:55:57 +0200

Record the details of a kaggle-based game, time series segmentation, feature construction, missing value processing.

Preface

data processing
Treatment of missing values
Model Construction Training Section

Preface

This time, I participated in a credit card fraud prediction project on kaggle. I hope some details of the competition can be recorded and shared in order to avoid long forgetting.
Competition Address: https://www.kaggle.com/c/ieee-fraud-detection

data processing

Briefly describe the data characteristics of this competition:
1. 400 anonymous features
2. Some features have large-scale missing values.
3. Distribution of missing values in training and prediction data is quite different.
4. Characteristics of suspected time increment (to confirm)
5. Two kinds of anonymous data are given, one is about the card account information, the other is the card transaction information.

// An highlighted block
#Read data
train_id=pd.read_csv('data1/train_identity.csv')
test_id=pd.read_csv('data1/test_identity.csv')
train_data=pd.read_csv('data1/train_transaction.csv')
test_data=pd.read_csv('data1/test_transaction.csv')
#Combine the two types of data and clear the memory at the same time 
train = train_data.merge(train_id, how='left', on='TransactionID')
test = test_data.merge(test_id, how='left',on='TransactionID')
del train_id; gc.collect()
del test_id; gc.collect()
del train_data; gc.collect()
del test_data; gc.collect()
train.describe()
train.isnull().sum()

TransactionID          0
isFraud                0
TransactionDT          0
TransactionAmt         0
ProductCD              0
card1                  0
card2               8933
card3               1565
card4               1577
card5               4259
card6               1571
addr1              65706
addr2              65706
dist1             352271
dist2             552913
P_emaildomain      94456
R_emaildomain     453249
C1                     0
C2                     0
C3                     0
C4                     0
C5                     0
C6                     0
C7                     0
C8                     0
C9                     0
C10                    0
C11                    0
C12                    0
C13                    0
                   ...  
id_11             449562
id_12             446307
id_13             463220
id_14             510496
id_15             449555
id_16             461200
id_17             451171
id_18             545427
id_19             451222
id_20             451279
id_21             585381
id_22             585371
id_23             585371
id_24             585793
id_25             585408
id_26             585377
id_27             585371
id_28             449562
id_29             449562
id_30             512975
id_31             450258
id_32             512954
id_33             517251
id_34             512735
id_35             449555
id_36             449555
id_37             449555
id_38             449555
DeviceType        449730
DeviceInfo        471874
Length: 434, dtype: int64

Comparison of missing values between train and test:

training_missing = train.isna().sum(axis=0) / train.shape[0] 
test_missing = test.isna().sum(axis=0) / test.shape[0] 
change = (training_missing / test_missing).sort_values(ascending=False)
change = change[change<1e6] # remove the divide by zero errors
train_more=change[change>4].reset_index()
train_more.columns=['train_more_id','rate']
test_more=change[change<0.4].reset_index()
test_more.columns=['test_more_id','rate']
train_more_id=train_more['train_more_id'].values
train[train_more_id].head()

In comparison with V80, the missing values are filled in with -999. It is obvious that the number of train s missing after ["Transaction DT"]>2.5e7 is much higher than that of Test.

fig, axs = plt.subplots(ncols=2)

train_vals = train["V80"].fillna(-999)
test_vals = test[test["TransactionDT"]>2.5e7]["V80"].fillna(-999) # values following the shift


axs[0].hist(train_vals, alpha=0.5, normed=True, bins=25)
    
axs[1].hist(test_vals, alpha=0.5, normed=True, bins=25)


fig.set_size_inches(7,3)
plt.tight_layout()

Next, we make a simple analysis of the data.

Firstly, it can be observed in the data that most of the fraudulent users use'P_email domain'. Consider listing as a feature separately, and classify all kinds of mailboxes at the same time.
As mentioned earlier, Transaction DT is suspected of time increment. The maximum and minimum values of training and prediction set are subtracted, and 364 is obtained by removing'606024', while the training set is 182 and the prediction set is 182.
Obviously, the whole data is clean sequential cutting, that is, there is no duplicate time, so if we use datetime to deal with Transaction DT, only hour and weekend can help model training, other features such as MONTH in training may help you get good scores. However, the lack of prediction affects the accuracy of the model. ;
In this case, TimeSeries Split segmentation training and testing are used in the training model, which should be the key point of the game.
As mentioned earlier, the training and prediction sets are unevenly distributed when Transaction DT is greater than or equal to 2.5e7, which is also the key to get a good ranking.
The distribution of hour and isFraud is observed, showing a feature that is significantly higher from 1 to 6 points than other time points, and constructing'night'feature (considering the data based on one year, it is not helpful for feature extraction of seasons and holidays);
At the same time, for a user, the number of features missing is also related to the impact of isFraud (the more anonymous, the safer?). Add'isna'feature;

Data distribution and its imbalance have been attempted under-sampling and re-sampling as well as smote re-sampling, and the effect is relatively general. Especially under-sampling is lower than I expected, but the last choice is 1:5 under-sampling (after all, fast running, haha) plus classification penalty weight. The reason for adding weight is to classify more 0 into 1 after under-sampling.
Code

isna = train.isna().sum(axis=1)
isna_test = test.isna().sum(axis=1)
train['isna']=train.isna().sum(axis=1)
test['isna']=test.isna().sum(axis=1)

 import datetime
 START_DATE = '2017-12-01'
 startdate = datetime.datetime.strptime(START_DATE, '%Y-%m-%d')
 train['TransactionDT'] = train['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds = x)))
 test['TransactionDT'] = test['TransactionDT'].apply(lambda x: (startdate + datetime.timedelta(seconds = x)))
 
 train['hour'] = train['TransactionDT'].dt.hour
 test['hour'] = test['TransactionDT'].dt.hour
 
 train['month'] = train['TransactionDT'].dt.month
 test['month'] = test['TransactionDT'].dt.month
 #print(train['TransactionDT'])
 
 x=test['hour']<7
 x1=test['hour']>19
 x.astype(int)
 x1.astype(int)
 x2=np.add(x,x1)
 x2.astype(int)
 test['night']=x2.astype(int)
 
 x=train['hour']<7
 x1=train['hour']>19
 x.astype(int)
 x1.astype(int)
 x2=np.add(x,x1)
 x2.astype(int)
 train['night']=x2.astype(int)
 index
 _pro=train['P_emaildomain']=='protonmail.com'
 index_pro_data=train['P_emaildomain'][index_pro]
 train['isprotonmail']=index_pro_data
 del index_pro
 del index_pro_data
 index_pro=test['P_emaildomain']=='protonmail.com'
 index_pro_data=test['P_emaildomain'][index_pro]
 test['isprotonmail']=index_pro_data
 del index_pro
 del index_pro_data
 
 train.loc[ (train.isprotonmail.isnull()), 'isprotonmail' ] = 0
 train['isprotonmail'].describe(include="all")
 
 test.loc[ (test.isprotonmail.isnull()), 'isprotonmail' ] = 0
 test['isprotonmail'].describe(include="all")
 emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo', 'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other', 'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}
 us_emails = ['gmail', 'net', 'edu']
 for c in ['P_emaildomain', 'R_emaildomain']:
     train[c + '_bin'] = train[c].map(emails)
     test[c + '_bin'] = test[c].map(emails)
     
     train[c + '_suffix'] = train[c].map(lambda x: str(x).split('.')[-1])
     test[c + '_suffix'] = test[c].map(lambda x: str(x).split('.')[-1])
     
     train[c + '_suffix'] = train[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
     test[c + '_suffix'] = test[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
 
 from sklearn.preprocessing import Imputer
 imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

Treatment of missing values

test[['dist1','dist2','C1','C2','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']]=imp.fit_transform(test[['dist1','dist2','C1','C2','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']])
train[['dist1','dist2','C1','C2','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']]=imp.fit_transform(train[['dist1','dist2','C1','C2','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13','C14']])
 test[test.TransactionDT>2.5e7][train_more_id]=imp.fit_transform(test[test.TransactionDT>2.5e7][train_more_id])
 train[train_more_id]=imp.fit_transform(train[train_more_id])

This part deals with a few missing data, and then some missing but important features are filled with random forest method. Here is only a piece of code.

from sklearn.ensemble import RandomForestRegressor
def set_missing_card2(df):
    # Remove existing numerical features and throw them into Random Forest Regressor
    give_df = df[['card2', 'TransactionID', 'card1', 'TransactionAmt', 'C1', 'C2',  'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13']]
        # Passengers are divided into known age and unknown age.
        # ,'card3','card5','addr1''addr2'
    known_card2 = give_df[give_df.card2.notnull()].as_matrix()
    unknown_card2 = give_df[give_df.card2.isnull()].as_matrix()
        # y is the target age
    y = known_card2[:, 0]
    # X is the characteristic attribute value
    X = known_card2[:, 1:]
        # fit into Random Forest Regressor
    rfr = RandomForestRegressor(random_state=0, n_estimators=10, n_jobs=-1)
    rfr.fit(X, y)
        # Prediction of unknown age results using the obtained model
    predictedcard2= rfr.predict(unknown_card2[:, 1::])# One or two colons are the same.
        # Filling in the missing data with the predicted results
    df.loc[(df.card2.isnull()), 'card2']= predictedcard2
    return df, rfr
    # train_data.head()
train, rfr = set_missing_card2(train)
test, rfr = set_missing_card2(test)

Next, other missing feature values and dictionary character feature codes are processed.

for c1, c2 in train.dtypes.reset_index().values:
    if c2=='O':
        train[c1] = train[c1].map(lambda x: labels[str(x).lower()])
        test[c1] = test[c1].map(lambda x: labels[str(x).lower()])
train.fillna(-999, inplace=True)
test.fillna(-999, inplace=True)

for f in test.columns:                  #There are items in the train that need to be predicted, so I don't want to remove them and use test directly.
    if train[f].dtype=='object' or test[f].dtype=='object': 
        print(f)
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values) + list(test[f].values))
        train[f] = lbl.transform(list(train[f].values))
        test[f] = lbl.transform(list(test[f].values))

The original idea here was to separate the data by month and then do under/resampling. It was done by training for 4 months and testing for 2 months. Later, it was more convenient to directly divide the data into TimeSeries Split.

columns_list=list(train)
####Separating data sets by month
train_12=train[train.month==12]
train_1=train[train.month==1]
train_2=train[train.month==2]
train_3=train[train.month==3]
train_4=train[train.month==4]
train_5=train[train.month==5]
train_6=train[train.month==6]

from imblearn.under_sampling import RandomUnderSampler
ran=RandomUnderSampler(return_indices=True) ##intialize to return indices of dropped rows
train_12,y_train12,dropped12 = ran.fit_sample(train_12,train_12['isFraud'])
train_1,y_train1,dropped1 = ran.fit_sample(train_1,train_1['isFraud'])
train_2,y_train2,dropped2 = ran.fit_sample(train_2,train_2['isFraud'])
train_3,y_train3,dropped3 = ran.fit_sample(train_3,train_3['isFraud'])
train_4,y_train4,dropped4 = ran.fit_sample(train_4,train_4['isFraud'])
train_5,y_train5,dropped5 = ran.fit_sample(train_5,train_5['isFraud'])
train_6,y_train6,dropped6 = ran.fit_sample(train_6,train_6['isFraud'])
del train
del y_train12
del y_train1
del y_train2
del y_train3
del y_train4
del y_train5
del y_train6

'''#Random oversampling
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
train_data, y_train = ros.fit_sample(train, y_train)
del train; gc.collect()'''
'''#smote oversampling
from imblearn.over_sampling import SMOTE
smote = SMOTE(frac=0.1, random_state=1)
train_data, y_train = smote.fit_sample(train, y_train)'''

'''from imblearn.under_sampling import RandomUnderSampler
ran=RandomUnderSampler(return_indices=True) ##intialize to return indices of dropped rows
train_data,y_train,dropped = ran.fit_sample(train,y_train)
del train; gc.collect()'''

Model Construction Training Section

This time, we did not use SVM to play the game. We used the artifact XGBOOST. In two days, we will push XGBOOST by hand and post some adjustments. Here we will not go into details. Direct to the code on the document:

def fast_auc(y_true, y_prob):
 
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    nfalse = 0
    auc = 0
    n = len(y_true)
    for i in range(n):
        y_i = y_true[i]
        nfalse += (1 - y_i)
        auc += y_i * nfalse
    auc /= (nfalse * (n - nfalse))
    return auc


def eval_auc(y_true, y_pred):

    return 'auc', fast_auc(y_true, y_pred), True


def group_mean_log_mae(y_true, y_pred, types, floor=1e-9):

    maes = (y_true-y_pred).abs().groupby(types).mean()
    return np.log(maes.map(lambda x: max(x, floor))).mean()
def train_model_classification(X, X_test, y, params, folds, model_type='lgb', eval_metric='auc', columns=None, plot_feature_importance=False, model=None,
                               verbose=10000, early_stopping_rounds=200, n_estimators=50000, splits=None, n_folds=3, averaging='usual', n_jobs=-1):
    """
    A function to train a variety of classification models.
    Returns dictionary with oof predictions, test predictions, scores and, if necessary, feature importances.
    
    :params: X - training data, can be pd.DataFrame or np.ndarray (after normalizing)
    :params: X_test - test data, can be pd.DataFrame or np.ndarray (after normalizing)
    :params: y - target
    :params: folds - folds to split data
    :params: model_type - type of model to use
    :params: eval_metric - metric to use
    :params: columns - columns to use. If None - use all columns
    :params: plot_feature_importance - whether to plot feature importance of LGB
    :params: model - sklearn model, works only for "sklearn" model type
    
    """
    columns = X.columns if columns is None else columns
    n_splits = folds.n_splits if splits is None else n_folds
    X_test = X_test[columns]
    
    # to set up scoring parameters
    metrics_dict = {'auc': {'lgb_metric_name': eval_auc,
                        'catboost_metric_name': 'AUC',
                        'sklearn_scoring_function': metrics.roc_auc_score},
                    }
    
    result_dict = {}
    if averaging == 'usual':
        # out-of-fold predictions on train data
        oof = np.zeros((len(X), 1))

        # averaged predictions on train data
        prediction = np.zeros((len(X_test), 1))
        
    elif averaging == 'rank':
        # out-of-fold predictions on train data
        oof = np.zeros((len(X), 1))

        # averaged predictions on train data
        prediction = np.zeros((len(X_test), 1))

    
    # list of scores on folds
    scores = []
    feature_importance = pd.DataFrame()
    
    # split and train on folds
    for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
        print(f'Fold {fold_n + 1} started at {time.ctime()}')
        if type(X) == np.ndarray:
            X_train, X_valid = X[columns][train_index], X[columns][valid_index]
            y_train, y_valid = y[train_index], y[valid_index]
        else:
            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
            
        if model_type == 'lgb':
            model = lgb.LGBMClassifier(**params, n_estimators=n_estimators, n_jobs = n_jobs)
            model.fit(X_train, y_train, 
                    eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
                    verbose=verbose, early_stopping_rounds=early_stopping_rounds)
            
            y_pred_valid = model.predict_proba(X_valid)[:, 1]
            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)[:, 1]
            
        if model_type == 'xgb':
            train_data = xgb.DMatrix(data=X_train, label=y_train, feature_names=X.columns)
            valid_data = xgb.DMatrix(data=X_valid, label=y_valid, feature_names=X.columns)

            watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
            model = xgb.train(dtrain=train_data, num_boost_round=n_estimators, evals=watchlist, early_stopping_rounds=early_stopping_rounds, verbose_eval=verbose, params=params)
            y_pred_valid = model.predict(xgb.DMatrix(X_valid, feature_names=X.columns), ntree_limit=model.best_ntree_limit)
            y_pred = model.predict(xgb.DMatrix(X_test, feature_names=X.columns), ntree_limit=model.best_ntree_limit)
        
        if model_type == 'sklearn':
            model = model
            model.fit(X_train, y_train)
            
            y_pred_valid = model.predict(X_valid).reshape(-1,)
            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
            print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')
            print('')
            
            y_pred = model.predict_proba(X_test)
        
        if model_type == 'cat':
            model = CatBoostClassifier(iterations=n_estimators, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'], **params,
                                      loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False)

            y_pred_valid = model.predict(X_valid)
            y_pred = model.predict(X_test)
        
        if averaging == 'usual':
            
            oof[valid_index] = y_pred_valid.reshape(-1, 1)
            scores.append(metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid))
            
            prediction += y_pred.reshape(-1, 1)

        elif averaging == 'rank':
                                  
            oof[valid_index] = y_pred_valid.reshape(-1, 1)
            scores.append(metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid))
                                  
            prediction += pd.Series(y_pred).rank().values.reshape(-1, 1)        
        
        if model_type == 'lgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

    prediction /= n_splits
    
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
    
    result_dict['oof'] = oof
    result_dict['prediction'] = prediction
    result_dict['scores'] = scores
    
    if model_type == 'lgb':
        if plot_feature_importance:
            feature_importance["importance"] /= n_splits
            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False)[:50].index

            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

            plt.figure(figsize=(16, 12));
            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False));
            plt.title('LGB Features (avg over folds)');
            
            result_dict['feature_importance'] = feature_importance
            result_dict['top_columns'] = cols
        
    return result_dict

model parameter

It's used here. TimeSeriesSplit Divide.
from sklearn.model_selection import TimeSeriesSplit
n_fold = 7
folds = TimeSeriesSplit(n_splits=n_fold)
folds = KFold(n_splits=5)

import time
params = {'num_leaves': 256,
          'min_child_samples': 79,
          'objective': 'binary',
          'max_depth': 13,
          'learning_rate': 0.03,
          "boosting_type": "gbdt",
          "subsample_freq": 3,
          "subsample": 0.9,
          "bagging_seed": 11,
          "metric": 'auc',
          "verbosity": -1,
          'reg_alpha': 0.3,
          'reg_lambda': 0.3,
          'colsample_bytree': 0.9,
          # 'categorical_feature': cat_cols
          }
result_dict_lgb = train_model_classification(X=train, X_test=X_test, y=y_train, params=params, folds=folds, model_type='lgb', eval_metric='auc', plot_feature_importance=True,
                                             verbose=500, early_stopping_rounds=200, n_estimators=5000, averaging='usual', n_jobs=-1)

* Training Result and Characteristic Importance Ranking

At the time of final submission, give me a surprise of 0.9430, a little wonder why the score is higher than the local training accuracy? Are some features not helpful in local training but effective in predicting? For the time being, if I drop 10%, I'm thinking about stacking several models (this method is hardly useful in production, but it's a big killer in competition, but I still hope to find more features and get better scores with the way out).