PyTorch deep learning practice introduction note 9 exercise - Multi classification using kaggle's Otto data set

Posted by saint4 on Fri, 28 Jan 2022 06:24:41 +0100

In the article Introduction to PyTorch deep learning practice notes 9-SoftMax classifier Mr. Liu gave an after-school exercise and downloaded kaggle's Otto dataset Do more classification.

0 Overview

Let's take a look at the background of the official website.

The Otto Group is one of the world's biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). We are selling millions of products worldwide every day, with several thousand products being added to our product line.

[Otto group is one of the largest e-commerce companies in the world, with subsidiaries in more than 20 countries, including crite & barrel in the United States, Otto.de in Germany and 3 Suisse in France. We sell millions of products worldwide every day, of which thousands of products are added to our product line.]

A consistent analysis of the performance of our products is crucial. However, due to our diverse global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights we can generate about our product range.

[a consistent analysis of our product performance is crucial. However, due to our diversified global infrastructure, many of the same products are classified differently. Therefore, the quality of our product analysis depends largely on the ability to accurately cluster similar products. The better the classification, the more we know about the product range.]

For this competition, we have provided a dataset with 93 features for more than 200,000 products. The objective is to build a predictive model which is able to distinguish between our main product categories. The winning models will be open sourced.

[in this competition, we provided data sets containing 93 features for more than 200000 products. Our goal is to build a prediction model that can distinguish our main product categories. The winning model will be open source.]

1. Data acquisition

Click the official website link Otto Group Product Classification Challenge | Kaggle Can be downloaded.

2 view data

Read the data first, and then check the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#1. Read data
otto_data = pd.read_csv("./otto/train.csv")
otto_data.describe()  #8 rows × 94 columns(id  feat_1 ... feat_93)

otto_data.shape

The train dataset has 61878 rows and 95 columns (including the above features and target). The simple description and statistical results of 94 non character features are shown in the figure above.

Since target is a character variable, we draw a picture and show the code as follows:

import seaborn as sns
sns.countplot(otto_data["target"])
plt.show()

Target has 9 categories in total. Because it is character type, define a function to convert the category label of target into index representation, so as to facilitate the calculation of cross entropy later. The code is as follows:

def target2idx(targets):
        target_idx = []
        target_labels = ['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6', 'Class_7', 'Class_8', 'Class_9','Class_10']
        for target in targets:
            target_idx.append(target_labels.index(target))
        return target_idx

3 build model

3.1 reading data

import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
import torch.optim as optim

#1. Read data
class OttoDataset(Dataset):
    def __init__(self,filepath):
        data = pd.read_csv(filepath)
        labels = data['target']
        self.len = data.shape[0]
        
        self.X_data = torch.tensor(np.array(data)[:,1:-1].astype(float))
        self.y_data = target2idx(labels)

    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
 
    def __len__(self):
        return self.len
        
otto_dataset1 = OttoDataset('./otto/train.csv')
otto_dataset2 = OttoDataset('./otto/testn.csv')
train_loader = DataLoader(dataset=otto_dataset1, batch_size=64, 
                          shuffle=True, num_workers=2)
test_loader = DataLoader(dataset=otto_dataset2, batch_size=64, 
                          shuffle=False, num_workers=2)

3.2 building models

#2. Build model
class OttoNet(torch.nn.Module):
    def __init__(self):
        super(OttoNet, self).__init__()
        self.linear1 = torch.nn.Linear(93, 64)
        self.linear2 = torch.nn.Linear(64, 32)
        self.linear3 = torch.nn.Linear(32, 16)
        self.linear4 = torch.nn.Linear(16, 9)
        self.relu = torch.nn.ReLU()
        self.dropout = torch.nn.Dropout(p=0.1)
        self.softmax = torch.nn.Softmax(dim=1)
 
    def forward(self, x):
        x = x.view(-1,93)
        x = self.relu(self.linear1(x))
        x = self.relu(self.linear2(x))
        x = self.dropout(x)
        x = self.relu(self.linear3(x))
        x = self.linear4(x)
        x = self.softmax(x)
        return x


ottomodel = OttoNet()
ottomodel

Output:

OttoNet(
  (linear1): Linear(in_features=93, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=32, bias=True)
  (linear3): Linear(in_features=32, out_features=16, bias=True)
  (linear4): Linear(in_features=16, out_features=9, bias=True)
  (relu): ReLU()
  (dropout): Dropout(p=0.1, inplace=False)
  (softmax): Softmax(dim=1)
)

3.3 construct loss and optimizer

#3.loss and optimizer
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(ottomodel.parameters(), lr=0.01, momentum=0.56)

3.4 training model

if __name__ == '__main__':
    for epoch in range(10):
        running_loss = 0.0
        for batch, data in enumerate(train_loader):
            inputs, target = data
            optimizer.zero_grad()
            outputs = ottomodel(inputs.float())
            loss = criterion(outputs, target)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            if batch % 500 == 499:
                print('[%d, %5d] loss: %.3f' % (epoch+1, batch+1, running_loss/300))
                running_loss = 0.0

Output:

[1,   500] loss: 3.591
[2,   500] loss: 3.011
[3,   500] loss: 2.957
[4,   500] loss: 2.940
[5,   500] loss: 2.902
[6,   500] loss: 2.881
[7,   500] loss: 2.873
[8,   500] loss: 2.800
[9,   500] loss: 2.789
[10,   500] loss: 2.779

3.5 forecast

with torch.no_grad():
    output = []
    for data in test_loader:
        inputs,labels = data
        outputs = torch.max(ottomodel(inputs.float()),1)[1]
        output.extend(outputs.numpy().tolist())

Save the results and submit to kaggle.

submission = pd.read_csv('./otto/sampleSubmission.csv')#(144368, 10)
submission['target'] = output
submission.to_csv('./otto/submission_result1.csv', index=False)

The submission failed, and the data format is incorrect. In the view of the reasons, small partners who encounter the same problems can tell me. Thank you.

Note: take study notes. If you make mistakes, please correct them! It's not easy to write an article. Please contact me for reprint.

Topics: Python Machine Learning neural networks Pytorch Deep Learning