Detailed explanation and practice of GPU operation in pytoch model

What is GPU?
GPU (Graphic Process Units). It is a single-chip processor, which is mainly used to manage and improve the performance of video and graphics. GPU accelerated computing refers to the use of graphics processor (GPU) and CPU to speed up the running speed of applications.
Why use GPU?
Deep learning involves many vector or multi matrix operations, such as matrix multiplication, matrix addition, matrix vector multiplication and so on. Deep model algorithms, such as BP, auto encoder, CNN, etc., can be written in the form of matrix operation without cyclic operation. However, when executed on a single core CPU, the matrix operation will be expanded into a circular form, which is still executed serially in essence. The multi-core architecture of GPU contains thousands of stream processors, which can parallelize the matrix operation and greatly shorten the computing time.
How to use GPU?
At present, many deep learning tools support GPU operation, which can be configured simply. Pytoch supports GPU. You can transfer data from memory to GPU video memory through the to(device) function. If there are multiple GPUs, you can locate which GPU or which GPU. Pytorch generally applies GPU to data structures such as tensors or models (including some network models under torch.nn and models created by itself).

1, About the function interface of CUDA

1.1 torch.cuda

How to view the configuration information of the platform GPU? Enter the command NVIDIA SMI on the cmd command line (suitable for Linux or Windows environment).

Using this command for the first time may display
'NVIDIA SMI' is not an internal or external command, nor is it a runnable program or batch file.
We only need to configure the environment variables. The path to configure the environment is shown below,

C:\Program Files\NVIDIA Corporation\NVSMI

Then, after adding environment variables, the GPU configuration information of this machine can be displayed. Examples are as follows:

import torch

print(torch.cuda.is_available()) # Check whether the system GPU can be used. It is often used to judge whether the GPU version of pytorch is installed
print(torch.cuda.current_device())# Returns the serial number of the current device
print(torch.cuda.get_device_name(0))# Returns the name of device 0
print(torch.cuda.device_count())# Returns the number of GPUs that can be used
print(torch.cuda.memory_allocated(device="cuda:0"))#Returns the current GPU video memory usage of device 0 in bytes

1.2 torch.device

As an attribute of Tensor, it contains two device types, cpu and gpu, which are usually created in two ways:

#1. Create by string
torch.device('cuda')  # Current cuda device

#2. Create by string plus equipment number
torch.device('cuda', 0)
torch.device('cpu', 0)

#There are several common ways to create Tensor objects on gpu devices:
torch.randn((2,3),device = torch.device('cuda:0'))
torch.randn((2,3),device = 'cuda:0')
torch.randn((2,3),device = 0)  #Legacy practices, only gpu is supported

# Transfer data from memory to GPU, generally for tensors (the data we need) and models. For tensors (type: FloatTensor or long tensor, etc.), the method is used directly to(device) or cuda().
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Or device = torch device("cuda:0")
device1 = torch.device("cuda:1")  
for batch_idx, (img, label) in enumerate(train_loader):
# For the model, the same method is used to(device) or cuda to put the network into GPU video memory.
#Instantiation network
model = Net()   #Use GPU with sequence number 0
#or #Use GPU with serial number 1

1.3 .to()

Device conversion is also a common way to set gpu.
One way I personally prefer to use is:
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

You can usually call:

to(device=None, dtype=None, non_blocking=False)
#The first one can set the current device, such as device = torch device('cuda:0')
#Or device = torch device("cuda" if torch.cuda.is_available() else "cpu")
#The second is the data type, such as torch float,torch. int,torch. double
#If the third parameter is set to True and the resource of this object is stored in pinned memory, the copy generated by this cuda() function will be synchronized with the original storage object on the host side. Otherwise, this parameter has no effect

1.4 use the specified GPU

PyTorch uses GPU starting from 0 by default. There are usually two ways to specify a specific GPU

 Terminal settings:  CUDA_VISIBLE_DEVICES=1,2  python    (for instance)
Set in code:
			import os
			os.environ["CUDA_VISIBLE_DEVICES"] = '1,2'
Set in code:
			import torch

However, it is officially recommended CUDA_VISIBLE_DEVICES´╝îNot recommended set_device Function.

1.5 multi GPU training

In order to improve the training speed, a machine often has multiple GPUs. At this time, parallel training can be carried out to improve efficiency. Parallel training can be divided into data parallel processing and model parallel processing.
Data parallel processing refers to using the same model to evenly distribute the data to multiple GPUs for training;
Model parallel processing means that different parts of multiple gpu training models use the same batch of data.

2, Training example code display

2.1 data parallel processing

This code takes Boston house price data as an example, with a total of 506 samples and 13 features. The data is divided into training set and test set, and then used data The dataloader is converted to a batch loadable mode. NN Dataparallel concurrency mechanism. The environment has two GPU s. Of course, the amount of data is very small, so it is reasonable not to use NN Dataparallel, here is just to illustrate the use method.

from sklearn import datasets
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F

# Load data
boston = datasets.load_boston()
X, y = (,
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Combined training data and labels
myset = list(zip(X_train, y_train))

# Convert the data to batch loading mode, the batch size is 128, and disrupt the data
from torch.utils import data
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dtype = torch.FloatTensor
train_loader = data.DataLoader(myset,batch_size=128,shuffle=True)

# Define network
class Net1(nn.Module):
    use sequential Building networks, Sequential()The function is to group the layers of the network together

    def __init__(self, in_dim, n_hidden_1, n_hidden_2, out_dim):
        super(Net1, self).__init__()
        self.layer1 = torch.nn.Sequential(nn.Linear(in_dim, n_hidden_1))
        self.layer2 = torch.nn.Sequential(nn.Linear(n_hidden_1, n_hidden_2))
        self.layer3 = torch.nn.Sequential(nn.Linear(n_hidden_2, out_dim))

    def forward(self, x):
        x1 = F.relu(self.layer1(x))
        x1 = F.relu(self.layer2(x1))
        x2 = self.layer3(x1)
        # Displays the data size allocated for each GPU
        print("\tIn Model: input size", x.size(), "output size", x2.size())
        return x2

if __name__ == '__main__':
	# Convert the model to multi GPU concurrent processing format
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    # Instantiation network
    model = Net1(13, 16, 32, 1)
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs")
        # dim = 0 [64, xxx] -> [32, ...], [32, ...] on 2GPUs
        model = nn.DataParallel(model)
    # Select optimizer and loss function
    optimizer_orig = torch.optim.Adam(model.parameters(), lr=0.01)
    loss_func = torch.nn.MSELoss()
    # Model training and visualization of loss values
    # from torch.utils.tensorboard import SummaryWriter
    # writer = SummaryWriter(log_dir='logs')
    for epoch in range(100):
        for data, label in train_loader:
            input = data.type(dtype).to(device)
            label = label.type(dtype).to(device)
            output = model(input)
            loss = loss_func(output, label)
            # Back propagation
            print("Outside: input size", input.size(), "output_size", output.size())
        # writer.add_scalar('train_loss_paral', loss, epoch)

Operation results:

Let's use 2 GPUs
(module): Net1(
(layer1): Sequential(
(0): Linear(in_features=13, out_features=16, bias=True)
(layer2): Sequential(
(0): Linear(in_features=16, out_features=32, bias=True)
(layer3): Sequential(
(0): Linear(in_features=32, out_features=1, bias=True)

It can be seen from the running results that a batch data (batch size = 128) is divided into two copies, each with a size of 64 and placed on different GPU s.

In Model: input size torch.Size([64, 13]) output size torch.Size([64, 1])
In Model: input size torch.Size([64, 13]) output size torch.Size([64, 1])
Outside: input size torch.Size([128, 13]) output_size torch.Size([128, 1])
In Model: input size torch.Size([64, 13]) output size torch.Size([64, 1])
In Model: input size torch.Size([64, 13]) output size torch.Size([64, 1])
Outside: input size torch.Size([128, 13]) output_size torch.Size([128, 1])


The large amplitude in the graph is due to batch processing and no preprocessing of the data. The standardization of the data should be smoother. You can try it.
DistributedParallel can also be used for single machine multi GPU, which is mostly used for distributed training, but it can also be used for single machine multi GPU training. The configuration is better than NN Dataparallel is a little more troublesome, but the training speed and effect are better. The specific configuration is:

#Initialize the backend using nccl
#Model parallelization

When a single machine is running, use the following method to start

