Torch in PyTorch Use of Optim optimizer

Posted by ir4z0r on Wed, 05 Jan 2022 09:14:29 +0100

1, Basic usage of optimizer

Create optimizer instance
Cycle:
1. Empty gradient
2. Forward propagation
3. Calculate Loss
4. Back propagation
5. Update parameters

Example:

from torch import optim
input = .....
optimizer = optim.SGD(params=net.parameters(), lr=1) # Optimizer instance
optimizer.zero_grad()  # Empty gradient
output = net(input)    # Forward propagation
loss = criterion(outputs, labels) #Calculate Loss
loss.backward()        # Back propagation
optimizer.step()       # Update parameters

Second Optimizer

PyTorch provides torch optim. lr_ Scheduler to help users change their learning rate. Next, we will start with Optimizer to see how this class works.

Why start with Optimizer, because both Adam and SGD inherit this class. At the same time, the scheduler also serves all optimizers, so the methods to be used will be defined in this base class. Just look at the properties of this class. Give the code in Doc link.

The first is to initialize the method def__ init__ (self, params, defaults). The params parameter of this method is the network parameter we passed in when initializing the optimizer, such as alexnet Parameters (), and all subsequent parameters will be merged into dict parameters as the defaults of this method.
Take a look at Alex net What are stored in parameters():

for alex in Alexnet.parameters():
    print(alex.shape)

You can see that the parameters of the whole network are stored here.
There are two ways to define optimizer:

The first method:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

In this initialization method, these parameters will be transformed into [{'parameters': alexnet. Parameters()}]
Such a list with a length of 1. Then process the list and add the parameters in defaults. If we use Alex net as an example, it looks like the following:

optimizer = torch.optim.Adam(Alexnet.parameters(), lr=0.001)
print([group.keys() for group in optimizer.param_groups])
# [dict_keys(['params', 'lr', 'betas', 'eps', 'weight_decay', 'amsgrad'])]

The second method: sometimes, different learning rates need to be allocated to different layers during training. At this time, you can use the "param" in optimizer_ Groups

optimizer = optim.SGD([
    {'params': model.base.parameters()},
    {'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)

Since the input is in the form of dict, we will continue to process it and add the following parameters. Let's look at the results directly:

optimizer = torch.optim.SGD([
    {'params': Alexnet.features.parameters()},
    {'params': Alexnet.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
print([group.keys() for group in optimizer.param_groups])
# [dict_keys(['params', 'lr', 'momentum', 'dampening', 'weight_decay', 'nesterov']),
# dict_keys(['params', 'lr', 'momentum', 'dampening', 'weight_decay', 'nesterov'])]

This time, the list has become two elements, and the composition and use of Adam of each element are different. This is obvious because different optimizers need different parameters. For lr different settings of different layers, the official website is given here link)

But the two are similar, that is, each element has params and lr, which is enough.

III. LRScheduler

All dynamically modified lr classes inherit from this class, so let's see what methods this class contains. Source code link.

Def in initialization method__ init__ (self, optimizer, last_epoch = - 1) contains two parameters. The first parameter is any subclass of optimizer mentioned above. The second parameter means which epoch is currently executed. When we do not specify it, although the default is - 1, step will be called once in init and set to 0.

Pytorch versions after 1.1.0 are trained first, and then step().

After we call initialization, we will add a field to optimizer. Take a look:

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
print([group.keys() for group in optimizer.param_groups])
# [dict_keys(['params', 'lr', 'betas', 'eps', 'weight_decay', 
# 'amsgrad', 'initial_lr'])]

New initial_ The lr field is the original lr.

In the def step(self, epoch=None) method, we usually do not need to specify this parameter epoch, because it will increase by 1 every time it is called. In this function, a method that needs to be overloaded will be called_ lr(), the changed lr will be extracted from this method and assigned to optimizer every time.

In fact, I have always had a question about the relationship between the scheduler's step and the optimizer's step. In fact, through the source code, we can see that these two functions have no relationship! The scheduler's step will only modify lr, and both need to be executed!

Let's take a look at the get of two scheduler s_ LR () compare. Take a look at SetpLR first:

def get_lr(self):
    if (self.last_epoch == 0) or (self.last_epoch % self.step_size != 0):
        return [group['lr'] for group in self.optimizer.param_groups]
    return [group['lr'] * self.gamma
            for group in self.optimizer.param_groups]

This will lr*gamma when the integer multiple of the step size is set.
The exponential LR will multiply gamma at the end of each round, and the reduction is really exponential.

def get_lr(self):
    if self.last_epoch == 0:
        return self.base_lrs
    return [group['lr'] * self.gamma
            for group in self.optimizer.param_groups]

Demo

scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
train_loader = Data.DataLoader(
        dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True, pin_memory=True)
for epoch in range(100):
    for X, y in train_loader:
        ...
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    scheduler.step()

IV. dynamically adjust the learning rate

Detailed explanation and example reference of pytoch's method of dynamically adjusting learning rate: Detailed explanation and examples of pytoch's method of dynamically adjusting learning rate_ Blog of returning o on a snowy night - CSDN blog_ Dynamic learning rate

There is also a possibility of using less:

The code is as follows: it means that the learning rate of every 20 epoch s is adjusted to the previous 10%

optimizer = optim.SGD(gan.parameters(), 
                                  lr=0.1,
                                  momentum=0.9,
                                  weight_decay=0.0005)

lr = optimizer.param_groups[0]['lr'] * (0.1 ** (epoch // 20))
for param_group in optimizer.param_groups:
    param_group['lr'] = lr
print(optimizer.param_groups[0]['lr'])

Topics: Pytorch Deep Learning CNN

Programmer Think