Today, let's test the softmax image classification problem
!!!!! I omitted the part of the animation display about the calculation accuracy
torchvision.transforms is an image preprocessing package in pytorch, which contains many functions to transform image data, which are essential in the step of image data reading
-
torchvision.transforms.Compose(): combine multiple transforms together
-
transforms.ToTensor() is to convert an image format into the data form of torch.FloatTensor
-
transforms.Resize(resize) adjusts the wind of the picture to turn green
-
Record the role of keepdim
import numpy as np a = torch.ones((2,2)) b = np.array([[1,2,3],[1,1,1]]) c = torch.from_numpy(b) interval_0 = torch.sum(c, dim=0, keepdim=True) interval_1 = torch.sum(c, dim=1)
The function of keepdim is to prevent the calculated data from becoming row by row data. The second is to prevent the dimension from being squeezed
batch_size = 256 def get_dataloader_workers(): """Four processes are used to read data.""" return 2 #ok, now we integrate all the components and define functions to get the data set def load_data_fashion_mnist(batch_size,resize=None): #Download the fashion MNIST dataset and load it into memory trans = [transforms.ToTensor()] #Define a list that contains the data type transformations you want to make on plmimage type data if resize: trans.insert(0,transforms.Resize(resize))#If we set resize, its resolution will be reset here trans = transforms.Compose(trans)#Combine these operations mnist_train=torchvision.datasets.FashionMNIST(root="./data",train=True,transform=trans,download=True)#Load data into memory mnist_test=torchvision.datasets.FashionMNIST(root="./data",train=False,transform=trans,download=True)#Load data into memory return (data.DataLoader(mnist_train,batch_size,shuffle=True,num_workers=get_dataloader_workers()),data.DataLoader (mnist_test,batch_size,shuffle=False,num_workers=get_dataloader_workers())) batch_size = 256 train_iter, test_iter = load_data_fashion_mnist(batch_size) #Initialize the parameters of the model num_inputs = 784 #We need to lengthen a 28 * 28 picture to a vector of 1 * 784, although this will lose spatial information num_outputs = 10 #Finally, our prediction is only 10 categories W = torch.normal(0,0.01,size=(num_inputs,num_outputs),requires_grad=True)#Define our layer and record the gradient b = torch.zeros(num_outputs,requires_grad=True)#Define our bias and record the gradient #Define the operation of softmax def softmax(X): X_exp = torch.exp(X) #Calculate the power of each parameter partition = X_exp.sum(1,keepdim=True) return X_exp/partition #Note that the broadcast mechanism is applied here
I have something to say about Softmax here
X = torch.normal(0,1,(2,5)) # creates a matrix with a mean of 0 and a variance of 1, with 2 rows and 5 columns
tensor([[ 1.3066, 0.0417, 0.6489, -0.0553, -0.9866],[-1.7921, -0.4884, 1.7815, 2.2112, -0.6010]])
X_exp = torch.exp(X). Here, we perform the operation based on e for each element
partition = X_exp.sum(1,keepdim=True)
dim=1 indicates that in the end, our data will become a column of data. Why should it become a column of data? For example, here it becomes
partition = X_exp.sum(1,keepdim=True)
partition.shape
torch.Size([2, 1])
Because for us, a row represents a sample, we need to predict the distribution on another row. tip applies the broadcast operation at the last return. Finally, the partition will become a [2 * 5] matrix, and the value of each column is the same as that of the first column
supplement
tensor.reshape(-1). We all know that when we transform the matrix, we need to keep the number of elements consistent- 1 means that an unknown number is defined, and this dimension only needs to be calculated according to the information of other dimensions
Let me briefly describe how the cross entropy is calculated
Start again
%matplotlib inline import torch import torchvision from IPython import display from torch.utils import data from torchvision import transforms from d2l import torch as d2l
This part is the library we need to import. Now I found that in fact, d2l library is a library specially written by Mr. Li, which is convenient for us to study. Comrades put their tears on the public screen.
def get_fashion_mnist_labels(labels): """return Fashion-MNIST The text label of the dataset.""" text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat', 'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot'] return [text_labels[int(i)] for i in labels]
In this function, we will pass in a label of type int, such as labels = [1,2,3], and return the corresponding text label
def get_dataloader_workers(): """Four processes are used to read data.""" return 4
Read data in parallel. This part can be turned up and down according to the situation of your computer
def load_data_fashion_mnist(batch_size, resize=None): """download Fashion-MNIST The dataset and then load it into memory.""" trans = [transforms.ToTensor()] # Here, we specify what the data will be transformed if resize: #If the resize parameter is not 0, we will reset its resolution trans.insert(0, transforms.Resize(resize)) #Perform the Resize function to reset the resolution trans = transforms.Compose(trans) #Put this series of operations together mnist_train = torchvision.datasets.FashionMNIST( #Get training dataset root="./data", train=True, transform=trans, download=True) mnist_test = torchvision.datasets.FashionMNIST( #Get test data set root="./data", train=False, transform=trans, download=True) return (data.DataLoader(mnist_train, batch_size, shuffle=True, num_workers=get_dataloader_workers()), # In pytorch, there is no doubt about using DataLoader data.DataLoader(mnist_test, batch_size, shuffle=False, num_workers=get_dataloader_workers()))
Note that we need to perform a series of operations on the data, and trans can be understood as a collection of these operations
resize is to redefine its resolution
In pytorch, all our data needs to be sub packaged into dataLoader. Given a data set, specify batchsize, specify whether to disrupt, and specify several processes to read. This is used a lot, so I won't say more
def softmax(X): X_exp = torch.exp(X) #First, the input X, which may be a vector or a matrix, is solved exponentially partition = X_exp.sum(1, keepdim=True) #Now we are eliminating the dimension of column. Why? Because each row is a sample, we require row sum rather than column sum return X_exp / partition # The broadcast mechanism is applied here
Let's take this example
Our input is
X is a matrix of [256784], W is a matrix of [78410], that is, a matrix of [25610]. We perform a softmax operation on a matrix of [25610].
In our example, our row represents our sample, so we need a row sum, not a column sum.
Finally, it is applied to the broadcast mechanism. When we use the summation, we obtain a [2561] matrix, which needs to be extended to
A [256 * 10] matrix.
def net(X): return softmax(torch.matmul(X.reshape((-1, W.shape[0])), W) + b) #The reshape -1 parameter means that the size of this position is calculated according to other dimensions #Here, we elongated the dimension of X from a 256 * 1 * 28 * 28 dimension to a 256 * 784 dimension in advance
Here we define our network layer. Our network layer can simply understand a linear mapping plus a softmax
def cross_entropy(y_hat, y): return - torch.log(y_hat[range(len(y_hat)), y]) #To define our loss function here, we only need to know that for classification problems, we use the loss function of cross entropy most of the time``` We define our loss function by using cross entropy. We will understand the specific formula later. But I can only say
y_hat is a matrix of 256 10
The dimension of y is a matrix of 2561
def train_epoch_ch3(net, train_iter, loss, updater):#Function that really starts training """The training model has an iteration cycle (see Chapter 3 for definition).""" # Set the model to training mode #net is the network we created. Here, it is a full connection layer of softmax #train_iter is our training data set #Loss is the loss function we define #updater is the sgd optimization function, which is very simple, that is, take a lr learning rate value for each parameter in the gradient direction if isinstance(net, torch.nn.Module):#First, set the network mode to the training mode, that is, start the gradient recording and gradient calculation net.train() # Total training loss, total training accuracy, number of samples metric = Accumulator(3) #Create an accumulator for X, y in train_iter: #According to our definition, each time we will read a label corresponding to 256 pictures from the data # The dimension of X is 256 * 1 * 28 * 28 # The dimension of y is 256 * 1 # Calculate the gradient and update the parameters y_hat = net(X) #To calculate the predicted value, in fact, what comes out here is a 256 * 10 matrix l = loss(y_hat, y) #To calculate the loss, cross entropy is used as the loss function if isinstance(updater, torch.optim.Optimizer): # Use the built-in optimizer and loss function of pytorch, which is the case when the api with pytorch is used updater.zero_grad() #Clear the gradient, otherwise it will accumulate and stack on the original gradient l.backward()#Calculated gradient updater.step()#Take the calculated gradient update model metric.add(float(l) * len(y), accuracy(y_hat, y), y.size().numel())#Start add else: # Using custom optimizers and loss functions l.sum().backward()#What if we use our own way of definition? It's another case updater(X.shape[0]) #Update parameters metric.add(float(l.sum()), accuracy(y_hat, y), y.numel())#calculation # Return training loss and training accuracy return metric[0] / metric[2], metric[1] / metric[2] #Return loss and quasi loss rate
Remember that our steps must be to calculate the loss first, then call the backwar() directional propagation function according to the calculated loss, and then update the model. It must be such an operation
def train_ch3(net, train_iter, test_iter, loss, num_epochs, updater): """Training model (see Chapter 3 for definition).""" animator = Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0.3, 0.9], legend=['train loss', 'train acc', 'test acc']) #This part is an animation function. We can put it for a while for epoch in range(num_epochs): #An iteration is performed according to our specified training cycle train_metrics = train_epoch_ch3(net, train_iter, loss, updater) #net is the network we created. Here, it is a full connection layer of softmax #train_iter is our training data set #Loss is the loss function we define #updater is the sgd optimization function, which is very simple, that is, take a lr learning rate value for each parameter in the gradient direction test_acc = evaluate_accuracy(net, test_iter) animator.add(epoch + 1, train_metrics + (test_acc,)) train_loss, train_acc = train_metrics assert train_loss < 0.5, train_loss assert train_acc <= 1 and train_acc > 0.7, train_acc assert test_acc <= 1 and test_acc > 0.7, test_acc
def updater(batch_size): return d2l.sgd([W, b], lr, batch_size) #This sgd is an optimization update method. For each parameter, the learning rate lr and gradient direction are optimized batch_size = 256 #Here we initialize a batch_size of size train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) #Here we first get our training set and our test set for X, y in train_iter: print(X.reshape(-1, 784).shape) break num_inputs = 784 num_outputs = 10 W = torch.normal(0, 0.01, size=(num_inputs, num_outputs), requires_grad=True) #Create a data set with mean value of 0 and variance of 0.01, which can be understood as a neural network with input of 784 dimensions and output of 10 dimensions, and turn on gradient recording b = torch.zeros(num_outputs, requires_grad=True) #Create our bias and turn on gradient recording num_epochs = 10 train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, updater) #In theory, you can start running
What if you use pytorch's api?
batch_size = 256 train_iter,test_iter = d2l.load_data_fashion_mnist(batch_size) #I won't say much else. This data initialization still needs to be initialized by myself, baby net = nn.Sequential(nn.Flatten(),nn.Linear(784,10))# Here, nn.Flatten is to smooth the data. The default is from dimension 1, which is different from torch.flatten() def init_weight(m): if type(m) == nn.Linear: nn.init.normal_(m.weight,std=0.01) #We begin to define how our parameters are initialized net.apply(init_weights)#Initialize our function loss = nn.CrossEntropyLoss()#Call our api directly in the future, which is our loss function trainer = torch.optim.SGD(net.parameters(), lr=0.1)#This is our optimization function num_epochs = 10#Discussion on definition optimization d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)#It's so simple and elegant to start training
From here, we can see that the essence of the so-called liniar is the multiplication of matrices. The so-called neural network is such a LInear change one by one, which is not difficult.
And here I also want to propose that the optimization function is what we should do after determining the updated gradient and learning rate. Backward () is back propagation to calculate the gradient, and loss specifies our loss function. This is a reverse process. The forward process is to calculate the loss function first, because the training mode is turned on, the gradient will be recorded, then the gradient will be calculated and recorded according to backward(), and then the parameters will be updated with the optimization function, above!!!! A classic process