# Li Mu took you to learn AI for the sake of a small paper

Posted by elite_prodigy on Fri, 12 Nov 2021 15:50:34 +0100

# Today, let's test the softmax image classification problem

!!!!! I omitted the part of the animation display about the calculation accuracy
torchvision.transforms is an image preprocessing package in pytorch, which contains many functions to transform image data, which are essential in the step of image data reading

• torchvision.transforms.Compose(): combine multiple transforms together

• transforms.ToTensor() is to convert an image format into the data form of torch.FloatTensor

• transforms.Resize(resize) adjusts the wind of the picture to turn green

• Record the role of keepdim

```import numpy as np

a = torch.ones((2,2))

b = np.array([[1,2,3],[1,1,1]])
c = torch.from_numpy(b)

interval_0 = torch.sum(c, dim=0, keepdim=True)

interval_1 = torch.sum(c, dim=1)
```

The function of keepdim is to prevent the calculated data from becoming row by row data. The second is to prevent the dimension from being squeezed

```batch_size = 256

"""Four processes are used to read data."""
return 2

#ok, now we integrate all the components and define functions to get the data set
trans = [transforms.ToTensor()] #Define a list that contains the data type transformations you want to make on plmimage type data
if resize:
trans.insert(0,transforms.Resize(resize))#If we set resize, its resolution will be reset here
trans = transforms.Compose(trans)#Combine these operations

batch_size = 256

#Initialize the parameters of the model
num_inputs = 784	#We need to lengthen a 28 * 28 picture to a vector of 1 * 784, although this will lose spatial information
num_outputs = 10	#Finally, our prediction is only 10 categories

#Define the operation of softmax
def softmax(X):
X_exp = torch.exp(X)  #Calculate the power of each parameter
partition = X_exp.sum(1,keepdim=True)
return X_exp/partition  #Note that the broadcast mechanism is applied here
```

I have something to say about Softmax here
X = torch.normal(0,1,(2,5)) # creates a matrix with a mean of 0 and a variance of 1, with 2 rows and 5 columns
tensor([[ 1.3066, 0.0417, 0.6489, -0.0553, -0.9866],[-1.7921, -0.4884, 1.7815, 2.2112, -0.6010]])
X_exp = torch.exp(X). Here, we perform the operation based on e for each element
partition = X_exp.sum(1,keepdim=True)
dim=1 indicates that in the end, our data will become a column of data. Why should it become a column of data? For example, here it becomes
partition = X_exp.sum(1,keepdim=True)
partition.shape
torch.Size([2, 1])
Because for us, a row represents a sample, we need to predict the distribution on another row. tip applies the broadcast operation at the last return. Finally, the partition will become a [2 * 5] matrix, and the value of each column is the same as that of the first column

supplement
tensor.reshape(-1). We all know that when we transform the matrix, we need to keep the number of elements consistent- 1 means that an unknown number is defined, and this dimension only needs to be calculated according to the information of other dimensions

Let me briefly describe how the cross entropy is calculated

## Start again

```%matplotlib inline
import torch
import torchvision
from IPython import display
from torch.utils import data
from torchvision import transforms
from d2l import torch as d2l
```

This part is the library we need to import. Now I found that in fact, d2l library is a library specially written by Mr. Li, which is convenient for us to study. Comrades put their tears on the public screen.

```def get_fashion_mnist_labels(labels):
"""return Fashion-MNIST The text label of the dataset."""
text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]
```

In this function, we will pass in a label of type int, such as labels = [1,2,3], and return the corresponding text label

```def get_dataloader_workers():
"""Four processes are used to read data."""
return 4
```

Read data in parallel. This part can be turned up and down according to the situation of your computer

```def load_data_fashion_mnist(batch_size, resize=None):
trans = [transforms.ToTensor()] # Here, we specify what the data will be transformed
if resize:  #If the resize parameter is not 0, we will reset its resolution
trans.insert(0, transforms.Resize(resize))  #Perform the Resize function to reset the resolution
trans = transforms.Compose(trans) #Put this series of operations together
mnist_train = torchvision.datasets.FashionMNIST(    #Get training dataset
mnist_test = torchvision.datasets.FashionMNIST(     #Get test data set
```

Note that we need to perform a series of operations on the data, and trans can be understood as a collection of these operations
resize is to redefine its resolution
In pytorch, all our data needs to be sub packaged into dataLoader. Given a data set, specify batchsize, specify whether to disrupt, and specify several processes to read. This is used a lot, so I won't say more

```def softmax(X):
X_exp = torch.exp(X)  #First, the input X, which may be a vector or a matrix, is solved exponentially
partition = X_exp.sum(1, keepdim=True)  #Now we are eliminating the dimension of column. Why? Because each row is a sample, we require row sum rather than column sum
return X_exp / partition  # The broadcast mechanism is applied here
```

Let's take this example
Our input is
X is a matrix of , W is a matrix of , that is, a matrix of . We perform a softmax operation on a matrix of .
In our example, our row represents our sample, so we need a row sum, not a column sum.
Finally, it is applied to the broadcast mechanism. When we use the summation, we obtain a  matrix, which needs to be extended to
A [256 * 10] matrix.

```def net(X):
return softmax(torch.matmul(X.reshape((-1, W.shape)), W) + b)    #The reshape -1 parameter means that the size of this position is calculated according to other dimensions
#Here, we elongated the dimension of X from a 256 * 1 * 28 * 28 dimension to a 256 * 784 dimension in advance
```

Here we define our network layer. Our network layer can simply understand a linear mapping plus a softmax

```def cross_entropy(y_hat, y):
return - torch.log(y_hat[range(len(y_hat)), y])   #To define our loss function here, we only need to know that for classification problems, we use the loss function of cross entropy most of the time```
We define our loss function by using cross entropy. We will understand the specific formula later.
But I can only say
```

y_hat is a matrix of 256 10
The dimension of y is a matrix of 2561

```def train_epoch_ch3(net, train_iter, loss, updater):#Function that really starts training
"""The training model has an iteration cycle (see Chapter 3 for definition)."""
# Set the model to training mode
#net is the network we created. Here, it is a full connection layer of softmax
#train_iter is our training data set
#Loss is the loss function we define
#updater is the sgd optimization function, which is very simple, that is, take a lr learning rate value for each parameter in the gradient direction
if isinstance(net, torch.nn.Module):#First, set the network mode to the training mode, that is, start the gradient recording and gradient calculation
net.train()
# Total training loss, total training accuracy, number of samples
metric = Accumulator(3) #Create an accumulator
for X, y in train_iter:
#According to our definition, each time we will read a label corresponding to 256 pictures from the data
# The dimension of X is 256 * 1 * 28 * 28
# The dimension of y is 256 * 1

# Calculate the gradient and update the parameters
y_hat = net(X)  #To calculate the predicted value, in fact, what comes out here is a 256 * 10 matrix
l = loss(y_hat, y)  #To calculate the loss, cross entropy is used as the loss function
if isinstance(updater, torch.optim.Optimizer):
# Use the built-in optimizer and loss function of pytorch, which is the case when the api with pytorch is used
updater.step()#Take the calculated gradient update model
else:
# Using custom optimizers and loss functions
l.sum().backward()#What if we use our own way of definition? It's another case
updater(X.shape) #Update parameters
# Return training loss and training accuracy
return metric / metric, metric / metric   #Return loss and quasi loss rate
```

Remember that our steps must be to calculate the loss first, then call the backwar() directional propagation function according to the calculated loss, and then update the model. It must be such an operation

```def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater):
"""Training model (see Chapter 3 for definition)."""
animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9],
legend=['train loss', 'train acc', 'test acc']) #This part is an animation function. We can put it for a while
for epoch in range(num_epochs): #An iteration is performed according to our specified training cycle
train_metrics = train_epoch_ch3(net, train_iter, loss, updater)
#net is the network we created. Here, it is a full connection layer of softmax
#train_iter is our training data set
#Loss is the loss function we define
#updater is the sgd optimization function, which is very simple, that is, take a lr learning rate value for each parameter in the gradient direction

test_acc = evaluate_accuracy(net, test_iter)
animator.add(epoch + 1, train_metrics + (test_acc,))
train_loss, train_acc = train_metrics
assert train_loss < 0.5, train_loss
assert train_acc <= 1 and train_acc > 0.7, train_acc
assert test_acc <= 1 and test_acc > 0.7, test_acc
```
```def updater(batch_size):
return d2l.sgd([W, b], lr, batch_size)
#This sgd is an optimization update method. For each parameter, the learning rate lr and gradient direction are optimized

batch_size = 256    #Here we initialize a batch_size of size
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) #Here we first get our training set and our test set
for X, y in train_iter:
print(X.reshape(-1, 784).shape)
break

num_inputs = 784
num_outputs = 10

W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True) #Create a data set with mean value of 0 and variance of 0.01, which can be understood as a neural network with input of 784 dimensions and output of 10 dimensions, and turn on gradient recording
b = torch.zeros(num_outputs, requires_grad=True)    #Create our bias and turn on gradient recording

num_epochs = 10
train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater) #In theory, you can start running
```

## What if you use pytorch's api?

```batch_size = 256
train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size)  #I won't say much else. This data initialization still needs to be initialized by myself, baby

net = nn.Sequential(nn.Flatten(),nn.Linear(784,10))# Here, nn.Flatten is to smooth the data. The default is from dimension 1, which is different from torch.flatten()

def init_weight(m):
if type(m) == nn.Linear:
nn.init.normal_(m.weight,std=0.01)  #We begin to define how our parameters are initialized
net.apply(init_weights)#Initialize our function

loss = nn.CrossEntropyLoss()#Call our api directly in the future, which is our loss function

trainer = torch.optim.SGD(net.parameters(), lr=0.1)#This is our optimization function

num_epochs = 10#Discussion on definition optimization
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)#It's so simple and elegant to start training
```

From here, we can see that the essence of the so-called liniar is the multiplication of matrices. The so-called neural network is such a LInear change one by one, which is not difficult.
And here I also want to propose that the optimization function is what we should do after determining the updated gradient and learning rate. Backward () is back propagation to calculate the gradient, and loss specifies our loss function. This is a reverse process. The forward process is to calculate the loss function first, because the training mode is turned on, the gradient will be recorded, then the gradient will be calculated and recorded according to backward(), and then the parameters will be updated with the optimization function, above!!!! A classic process

Topics: AI Pytorch Deep Learning