Where is the video memory used?
Generally, when training neural networks, the memory is mainly occupied by network models and intermediate variables.
- The parameters of convolution layer, full connection layer and standardization layer in the network model occupy the video memory, while the activation layer and pooling layer do not occupy the video memory in essence.
- Intermediate variables include feature graphs and optimizers, which consume the most video memory.
- In fact, pytorch itself also occupies some video memory, but it doesn't occupy much.
Tip 1: use local operation
In place operation literally means to operate variables in place. Corresponding to pytorch, it means to operate variables on the original memory without applying for new memory space. Specifically, it includes three ways:
- Use an activation function that defines the inplace property as true, such as NN ReLu(inplace=True)
- Use pytorch methods with local operations. Generally, the method name is followed by an underscore "", Such as tensor add_ ()，tensor.scatter_ ()，F.relu_ ()
- Use operators that operate in place, such as y += x, y *= x
Tip 2: avoid intermediate variables
In the forward function, avoid using too many intermediate variables and try to operate in the applied memory. For example, the following code uses too many intermediate variables and occupies the display memory in unnecessary places:
def forward(self, x): x0 = self.conv0(x) # Input layer x1 = F.relu_(self.conv1(x0) + x0) x2 = F.relu_(self.conv2(x1) + x1) x3 = F.relu_(self.conv3(x2) + x2) x4 = F.relu_(self.conv4(x3) + x3) x5 = F.relu_(self.conv5(x4) + x4) x6 = self.conv(x5) # Output layer return x6
In order to reduce the occupation of video memory, the forward function can be modified as follows:
def forward(self, x): x = self.conv0(x) # Input layer x = F.relu_(self.conv1(x) + x) x = F.relu_(self.conv2(x) + x) x = F.relu_(self.conv3(x) + x) x = F.relu_(self.conv4(x) + x) x = F.relu_(self.conv5(x) + x) x = self.conv(x) # Output layer return x
The functions of the above two codes are the same, but the occupation of video memory is very different. The latter can save nearly 90% of the former's occupation of video memory.
Tip 3: optimize the network model
The occupation of video memory by the network model mainly refers to the parameters of convolution layer, full connection layer and standardization layer. The specific optimization methods include but are not limited to:
- Reduce the number of convolution cores (= reduce the number of channels of the output characteristic graph)
- Global pooling NN Adaptive avgpool2d() replaces the full connection layer NN Linear()
- Do not use standardized layers
- Avoid too large short connections span (avoid generating intermediate variables)
Tip 4: reduce BATCH_SIZE
- When training convolutional neural networks, epoch represents the number of times the data is trained as a whole, and batch represents the splitting of an epoch into batch_ I'm here to participate in the training.
- Change batch_size is a common skill. If the display memory is not enough during training, it is generally preferred to reduce batch_size, but batch_ The size cannot be infinitely smaller. Too small will lead to the instability of the network, and too small will lead to the non convergence of the network. It has been demonstrated before that it seems to be optimal between 18 and 32. Generally, it only needs batch_size greater than 18 should be OK.
Tip 5: split BATCH
Split batch and reduce batch in skill 4_ Size is different in nature. Splitting a batch can be understood as three steps, such as the original batch_size=64:
- Split batch into two batches_ Small batch with size = 32
- Add the results obtained by inputting the network respectively, and then calculate the loss with the target value
- Loss back propagation
This split batch operation can also be understood as adding the losses of the two trainings and then back propagating, but reducing the batch_ The operation of size is to train once and back propagate once.
Tip 6: reduce PATCH_SIZE
- In convolutional neural network training, patch_size refers to the image size of the input neural network, i.e. (H*W).
- The size of the network input patch has a great impact on the size of the subsequent feature map. Patches such as [64 * 64], [128 * 128] may be used during training. If the display memory is insufficient, the size of the patch can be further reduced, such as [32 * 32], [16 * 16].
- However, there are problems with this method, which may greatly affect the generalization ability of the network. When cutting, we must pay attention to random cutting on the original graph, which is generally not recommended.
Tip 7: optimize loss summation
After a batch training, a corresponding loss value will be obtained. If you want to calculate the loss of an epoch, you need to accumulate all the previous batch losses, but the previous batch losses occupy the video memory in the GPU, and the accumulated epoch loss will also occupy the video memory in the GPU. You can optimize it by the following methods:
epoch_loss += batch_loss.detach().item() # epoch loss
Tip 8: adjust training accuracy
- Reduce training accuracy
The floating-point numbers in pytorch use 32-bit floating-point data by default. It is not necessary to train networks that do not require high accuracy. Therefore, 16 bit floating-point data can be changed for training. However, pay attention to converting the data and network model into 16 bit floating-point data at the same time, otherwise an error will be reported and the implementation process is simple. However, if the optimizer selects Adam, an error may be reported, If the SGD optimizer is selected, no error will be reported. The specific steps are as follows:
model.cuda().half() # Network model setting half precision # Network input and target setting half precision x, y = Variable(x).cuda().half(), Variable(y).cuda().half()
- Hybrid accuracy training
Hybrid precision training refers to that when GPU is used to train the network, the relevant data are stored and multiplied with half precision in memory to speed up the calculation, and accumulated with full precision to avoid rounding error, which can reduce the training time by about half. In pytorch1 Before 6, the apex library provided by NVIDIA was used for mixed accuracy training, and then the amp library provided by pytorch was used. The example code is as follows:
import torch from torch.nn.functional import mse_loss from torch.cuda.amp import autocast, GradScaler EPOCH = 10 # Training times LEARNING_RATE = 1e-3 # Learning rate x, y = torch.randn(3, 100).cuda(), torch.randn(3, 5).cuda() # Define network inputs and outputs myNet = torch.nn.Linear(100, 5).cuda() # Instantiated network, a full connection layer optimizer = torch.optim.SGD(myNet.parameters(), lr=LEARNING_RATE) # Define optimizer scaler = GradScaler() # Gradient scaling for i in range(EPOCH): # train with autocast(): # Set blend accuracy run y_pred = myNet(x) loss = mse_loss(y_pred, y) scaler.scale(loss).backward() # The tensor is multiplied by the scale factor and back propagated scaler.step(optimizer) # Divide the gradient tensor of the optimizer by the scale factor. scaler.update() # Update scale factor
Tip 9: split the training process
If the training network is very deep, for example, resnet101 is a very deep network. Direct training of this network requires very high video memory, and the whole network cannot be trained at one time. In this case, the network can be divided into two small networks for training respectively. checkpoint is a solution to the shortage of video memory by exchanging time for space in pytorch. This method essentially reduces the overall parameters of the training network. The following is an example code.
import torch import torch.nn as nn from torch.utils.checkpoint import checkpoint # Custom function def conv(inplanes, outplanes, kernel_size, stride, padding): return nn.Sequential(nn.Conv2d(inplanes, outplanes, kernel_size, stride, padding), nn.BatchNorm2d(outplanes), nn.ReLU() ) class Net(nn.Module): # Custom network structure, divided into three sub networks def __init__(self): super().__init__() self.conv0 = conv(3, 32, 3, 1, 1) self.conv1 = conv(32, 32, 3, 1, 1) self.conv2 = conv(32, 64, 3, 1, 1) self.conv3 = conv(64, 64, 3, 1, 1) self.conv4 = nn.Linear(64, 10) # Full connection layer def segment0(self, x): # Subnet 1 x = self.conv0(x) return x def segment1(self, x): # Sub network 2 x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) return x def segment2(self, x): # Sub network 3 x = self.conv4(x) return x def forward(self, x): x = checkpoint(self.segment0, x) # Using checkpoint x = checkpoint(self.segment1, x) x = checkpoint(self.segment2, x) return x
In the given code, a network structure is divided into three sub networks for training. Whether NN is used Sequential () works the same way. It's just a few more items in a custom subnet.
Tip 10: clean up memory garbage
- Generally, the variables defined in python will not release resources immediately at the end of use. At the beginning of the training cycle, you can use the following code to recycle memory garbage.
import gc gc.collect() # Clean up memory
Tip 11: use gradient accumulation
- Due to the limitation of video memory size, large batch cannot be used when training large network models_ Size, and generally larger batch_size can make the network model converge faster.
- Gradient accumulation is to average the losses calculated by multiple batches and then accumulate them for back propagation, which is similar to the idea of splitting batches in skill 5 (but skill 5 is to split large batches, and the training is still large batches, while gradient accumulation trains small batches).
- The idea of gradient accumulation can be used to simulate large batch_size can achieve the effect. The specific implementation code is as follows:
output = myNet(input_) # Input to network loss = mse_loss(target, output) # Calculate loss loss = loss / 4 # Cumulative 4th gradient loss.backward() # Back propagation if step % 4 == 0: # If step 4 is performed optimizer.step() # Update network parameters optimizer.zero_grad() # Optimizer gradient zeroing
Tip 12: remove unnecessary gradients
The operation related to gradient is not involved when running the test program, so unnecessary gradient can be clear to save video memory, including but not limited to the following operations:
- Use the code model Eval () puts the model in the test state and does not enable operations such as standardization and random abandonment of neurons.
- Put the test code into the context manager with torch no_ Grad (): no operations such as graph construction are performed in.
- Add gradient clearing operation at the beginning of each training or test cycle
myNet.zero_grad() # Gradient clearing of model parameters optimizer.zero_grad() # Optimizer parameter gradient zeroing
Tip 13: periodically clean up video memory
- Similarly, you can use the code provided by pytorch to clean up the video memory at the beginning of each training cycle to release unused video memory resources.
torch.cuda.empty_cache() # Release video memory
The video memory resources released by executing this statement are in use Nvidia-smi The command cannot be displayed when viewing, but it has been released. actually pytorch In principle, if the variable is no longer referenced, it will be automatically released, so this statement may not be useful, but I think it is somewhat useful.
Tip 14: use down sampling more often
Downsampling is similar to pooling in terms of implementation, but not limited to pooling. In fact, downsampling can also be performed by replacing pooling with steps greater than 1. As a result, the feature map obtained by down sampling will be reduced, the feature map will be reduced, and the amount of natural parameters will be reduced, so as to save video memory. It can be realized in the following two ways:
nn.Conv2d(32, 32, 3, 2, 1) # Sampling with step greater than 1 nn.Conv2d(32, 32, 3, 1, 1) # Convolution kernel connected to pooled down sampling nn.MaxPool2d(2, 2)
Tip 15: use the del() function
The function of del() method is to completely delete a variable. It must be recreated if it is to be reused. What is deleted is a variable rather than deleting a data from memory. This data may also be referenced by other variables. The implementation method is very simple, such as:
def forward(self, x): input_ = x x = F.relu_(self.conv1(x) + input_) x = F.relu_(self.conv2(x) + input_) x = F.relu_(self.conv3(x) + input_) del input_ # Delete variable input_ x = self.conv4(x) # Output layer return x