[deep learning pytorch] regularization

Posted by Sk~ on Thu, 27 Jan 2022 04:18:46 +0100

The basic concept of regularization has been recorded in the blog before. Here is only a brief introduction to the implementation of regularization

 

weight decay

Complexity of model -- how to measure the distance between function and 0 -- Lp norm

L2 regularized linear model constitutes the classical ridge regression algorithm. L1 regularized linear regression is usually called lasso regression. L2 norm is often used in practice.

One reason for using L2 norm is that it imposes huge penalties on the large components of the weight vector, which makes the learning algorithm prefer the model with evenly distributed weights on a large number of features.

In practice, this may make them more stable for the observation error in a single variable. In contrast, L1 penalty will cause the model to focus the weight on a small part of features and clear other weights to zero.  

 

L2 regularized linear model, fitted with validation data:

                                                

The small batch random gradient descent update of L2 regularization regression is as follows:

         

 

Implementation from zero

The loss function of the linear model from zero can be modified

 

Concise implementation

The optimization algorithm of linear model from zero needs to be modified:

trainer = torch.optim.SGD([
    {"params":net[0].weight,'weight_decay': wd},
    {"params":net[0].bias}], lr=lr)

When instantiating the optimizer, directly use weight_ Decaly specifies the weight decaly super parameter.

By default, PyTorch attenuates both weights and offsets. Only weight is set here_ Therefore, the offset parameter b will not decay.

 

dropout method

Another aspect of model simplicity is smoothness, that is, the function should not be sensitive to small changes in its input.

In the training process, noise is injected into each layer of the network before calculating the subsequent layer. When training a deep network with multiple layers, the injected noise will only enhance the smoothness of the input-output mapping. This idea is called dropout.

In the process of forward propagation, the back off method calculates each internal layer and injects noise at the same time, which has become a common technology for training neural networks. This method is called the fallback method because we superficially drop out some neurons during training. In each iteration of the whole training process, the standard fallback method includes setting some nodes in the current layer to zero before calculating the next layer.

                   

 

Implementation from zero

import torch
from torch import nn
from d2l import torch as d2l


def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1:
        return torch.zeros_like(X)
    if dropout == 0:
        return X
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)

num_inputs, num_outputs, num_hiddens1, num_hiddens2 = 784, 10, 256, 256
dropout1, dropout2 = 0.2, 0.5
class Net(nn.Module): def __init__(self, num_inputs, num_outputs, num_hiddens1, num_hiddens2, is_training = True): super(Net, self).__init__() self.num_inputs = num_inputs self.training = is_training self.lin1 = nn.Linear(num_inputs, num_hiddens1) self.lin2 = nn.Linear(num_hiddens1, num_hiddens2) self.lin3 = nn.Linear(num_hiddens2, num_outputs) self.relu = nn.ReLU() def forward(self, X): H1 = self.relu(self.lin1(X.reshape((-1, self.num_inputs)))) if self.training == True: H1 = dropout_layer(H1, dropout1) H2 = self.relu(self.lin2(H1)) if self.training == True: H2 = dropout_layer(H2, dropout2) out = self.lin3(H2) return out net = Net(num_inputs, num_outputs, num_hiddens1, num_hiddens2)
num_epochs, lr, batch_size = 10, 0.5, 256
loss = nn.CrossEntropyLoss(reduction='none')
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

Concise implementation

net = nn.Sequential(nn.Flatten(),
        nn.Linear(784, 256),
        nn.ReLU(),
        nn.Dropout(dropout1),
        nn.Linear(256, 256),
        nn.ReLU(),
        nn.Dropout(dropout2),
        nn.Linear(256, 10))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

net.apply(init_weights)

trainer = torch.optim.SGD(net.parameters(), lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

 

Topics: Deep Learning