Briefly introduce the use of distributed training DDP in pytorch (combined with examples, quick start)

Posted by sendoh07 on Tue, 08 Mar 2022 16:16:02 +0100

DDP principle

Distributed data parallel (DDP) supports multi machine and multi card distributed training. pytorch native support. This paper briefly summarizes the use of DDP and the test under multi card, and introduces it according to the actual code.

voxceleb_trainer: open source voiceprint recognition tool, simple and easy to use, suitable for researchers.

Popular understanding:

  1. In DDP mode, n processes will be started, and each process loads models on one graphics card. These models are the same (N copies have been copied to n graphics cards), alleviating the limitation of GIL lock.
  2. In the training phase, each process communicates with other processes (exchanging their own gradients) through the ring reduce method
  3. Each process updates its parameters with the average gradient. Because the initial parameters and update gradient of the model under each process are the same, the parameters of the updated model are also consistent.

DP mode appeared earlier. It supports single machine and multi card training. How to use it

model=torch.nn.DataParallel(model)

There is only one process in DP mode, which is easily limited by GIL. The master node is equivalent to a parameter server, which broadcasts parameters to other cards. After the gradient is back propagated, each card summarizes the gradient to the master node. The master updates the parameters after averaging the gradient, and then sends the parameters to other cards.

Obviously, this mode will lead to the computing tasks of nodes and heavy traffic, which will lead to network congestion and reduce the training speed.

DDP is strongly recommended

What is GIL? Why is DDP faster?

GIL (global interpreter lock), you can refer to GIL ), the main disadvantage is that python processes can only use one CPU core, which is not suitable for computing intensive tasks. Using multiple processes can make effective use of multi-core computing resources. DDP starts multiple processes, which avoids this limitation to a certain extent.

Ring reduce gradient merging: each process calculates the gradient independently. Each process passes the gradient to the next process in turn, and then passes the gradient obtained from the previous process to the next process. After cycling n (number of processes) times, all processes can get all the gradients.

Reason for fast: each process only communicates with its own upstream and downstream processes, which greatly alleviates the communication congestion of the parameter server.

Generally speaking, there are three kinds of neural network parallelism:

  • Data parallelism: data parallelism. Batch can be increased indirectly_ size. DP and DDP are commonly used in this mode
  • Model parallelism: model parallelism. Put the model on different graphics cards and the calculation is parallel. It may accelerate, depending on the actual communication efficiency.
  • Workload Partitioning: put the model on different graphics cards. The calculation is serial and cannot be accelerated.

DDP usage in pytorch

DDP recommends the use of single process single card, that is, a model is placed on a card.

You can also use single process multi card. There are three cases of allocation:

  • One card per process. (the best model officially recommended)
  • Multiple cards per process, copy mode. A model is copied on different cards, and each process is equivalent to DP mode. But the speed is not as fast as single card and single process, which is generally not used
  • Multiple cards per process, parallel mode. Different parts of a model are distributed on different cards. Generally, it is used when the model is very large, and one card can't fit batch_size=1.

This article only introduces the situation of single card and single process. (in fact, I didn't touch a model that was too big to be jammed. It's a small broken laboratory ε=ε=ε= ┏(゜ロ゜;) ┛)

Related concepts

First understand the following related concepts:

  • Group, process group. By default, there is only one group,

  • world size: Global parallel number,

    torch.distributed.get_world_size()

  • Rank: indicates the sequence number of the current process, which is used for inter process communication. Starting from 0, the process with rank=0 is the master process

    torch.distributed.get_rank()

  • local_rank: the sequence number of the process on each machine.

    In general, use local_rank to manually set which GPU of the current machine the model runs on.

    torch.distributed.local_rank()

Use process

It is easy to use. Add:

model = DDP(model, device_ids=[local_rank], output_device=local_rank)

The original model is pytorch model, and the new model is DDP model.

https://zhuanlan.zhihu.com/p/178402798

## main.py file
import torch
import argparse

# Add 1: dependency
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# New 2: get local from the outside_ The rank parameter, which will be automatically given when calling DDP, will be described later. So don't think too much, just copy it.
#       argparse is a system library of python, which is used to handle command line calls. If you are not familiar with it, you can baidu it a little. It's very simple!
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=-1)
FLAGS = parser.parse_args()
local_rank = FLAGS.local_rank

# Add 3: DDP backend initialization
#   a. According to local_rank to set which GPU is currently used
torch.cuda.set_device(local_rank)
#   b. Initialize DDP and use the default backend(nccl). If the CPU model is running, you need to select another backend.
dist.init_process_group(backend='nccl')

# New 4: define and place the model on a separate GPU, which needs to be done before calling 'model=DDP(model)'.
#       If you want to load the model, you must do it here.
device = torch.device("cuda", local_rank)
model = nn.Linear(10, 10).to(device)
# Possible load models

# Add 5: after that, initialize the DDP model
model = DDP(model, device_ids=[local_rank], output_device=local_rank)

In addition to the model part, the most important thing is the distribution of data. In short, it is to divide the data set equally into different cards to ensure that the data of each card is different (if you take the whole data, there will be redundancy).

Use torch in pytorch utils. data. distributed. Distributedsampler realizes data distribution.

my_trainset = torchvision.datasets.CIFAR10(root='./data', train=True)
# New 1: using DistributedSampler, DDP helps us encapsulate all the details. Use it, it's done!
#       The principle of sampler will also be introduced later.
train_sampler = torch.utils.data.distributed.DistributedSampler(my_trainset)
# Batch needs attention here_ Size refers to the batch under each process_ size.  In other words, the total batch_size is the batch here_ Size is multiplied by the world_size.
trainloader = torch.utils.data.DataLoader(my_trainset, batch_size=batch_size, sampler=train_sampler)


for epoch in range(num_epochs):
    # New 2: set the epoch of the sampler, which is required by the DistributedSampler to maintain the same random number seed between processes
    trainloader.sampler.set_epoch(epoch)
    # The latter part is completely consistent with the original.
    for data, label in trainloader:
        prediction = model(data)
        loss = loss_fn(prediction, label)
        loss.backward()
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
        optimizer.step()

If you don't finish the above two, you can basically carry out multi card training.

Model save:

# 1. When saving the model, like the DP mode, there is one point to pay attention to: the model is saved Module instead of model.
#    Because the model is actually a DDP model, the parameters are wrapped by 'model=DDP(model)'.
# 2. I only need to save on process 0 once to avoid saving duplicate things many times.
if dist.get_rank() == 0:
    torch.save(model.module, "saved_model.ckpt")

Note:

  1. Theoretically, without buffer parameters (such as BN), DDP performance is completely consistent with that of single card Gradient Accumulation. Parallel 8 is equal to a single card with a Gradient Accumulation Step of 8.
  2. In terms of speed, DDP is faster than the single card of Gradient Accumulation

How to start

There are two ways: 1 torch. distributed. Launch 2 torch. multiprocessing. spawn

torch.distributed.launch

Introduce some parameters:

  • – how many machines are there in nnodes
  • –node_ Which machine is rank currently
  • –nproc_per_node how many processes are there per machine

Implementation method: run torch once on each machine distributed. Launch, each torch distributed. Launch will start n processes and give each process a -- local_ Parameters of rank = I

Stand alone mode:

## Bash operation
# Suppose we only run on one machine and the number of cards available is 8
python -m torch.distributed.launch --nproc_per_node 8 main.py

Multi machine mode:

–master_ Address: the network address of the master process. The default is 127.0.0.1 (only for stand-alone)

–master_ Port: a port of the master process, 29500 by default. Before using it, you need to confirm whether the port is occupied by other programs.

## Bash operation
# Suppose we run on 2 machines, and the number of cards available for each machine is 8
#    Machine 1:
python -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node 8 \
  --master_adderss $my_address --master_port $my_port main.py
#    Machine 2:
python -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node 8 \
  --master_adderss $my_address --master_port $my_port main.py

spawn call mode

Give a demo:

https://zhuanlan.zhihu.com/p/178402798

def demo_fn(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    # lots of code.
    ...

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

Compared with launch, spawn is a little more complicated to use, but it is well encapsulated and convenient for others to use directly.

The principle and details of DDP implementation refer to: https://zhuanlan.zhihu.com/p/187610959

For instance voxceleb_trainer multi card introduction

voxceleb_trainer Is an open source voiceprint recognition tool with simple code. The multi card mode is realized. Based on the spawn startup mode, let's have a brief look:

Similar to the process described above:

First, set the address, port number, initialize the process group, and first place the model on a single card, and then package it as a DDP model.

if args.distributed:
    # For single machine and multi card
    # Set local ip
    os.environ['MASTER_ADDR']='localhost'
    # Port number
    os.environ['MASTER_PORT']=args.port
	# Initialize process group
    dist.init_process_group(backend='nccl', world_size=ngpus_per_node, rank=args.gpu)
    torch.cuda.set_device(args.gpu)
    # Transfer the model to GPU
    s.cuda(args.gpu)

    # BN synchronization
    if args.syncBN:
    s = torch.nn.SyncBatchNorm.convert_sync_batchnorm(s)
    print('----syncBN----')
    s = torch.nn.parallel.DistributedDataParallel(s, device_ids=[args.gpu], find_unused_parameters=False)

    print('Loaded the model on GPU {:d}'.format(args.gpu))

The above is encapsulated in a main_ In the work function, the data loading becomes:

Datasets->DistributedSampler->BatchSampler->DataLoader

train_dataset = train_dataset_loader(**vars(args))

train_sampler = train_dataset_sampler(train_dataset, **vars(args))
# Total batch_size = args.batch_size * n_gpu (number of graphics cards)
train_loader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=args.batch_size,
    num_workers=args.nDataLoaderThread,
    sampler=train_sampler,
    pin_memory=False,
    worker_init_fn=worker_init_fn,
    drop_last=True,
)

spawn start:

torch.multiprocessing.spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method='spawn')

  • fn: for the incoming function, define main(rank, *args). When defining, the first parameter leaves rank, and the program automatically assigns the value of rank to tell the function which GPU it is currently on.
  • args: fn parameter passed in, tuple type
  • nprocs: number of processes
  • Join: whether to join the same process group

For example:

mp.spawn(main_worker, nprocs=n_gpus, args=(n_gpus, args))

Point a thumb here and pay attention~

Update later: some details of DDP

Topics: Python Pytorch Deep Learning