Distributed operation Barrier of PyTorch
Original document: https://www.yuque.com/lart/ug...
On the concept of barrier
For the concept of barrier, please refer to the introduction in Wiki: synchronization barrier is a synchronization method in parallel computing. For a group of processes or threads, a synchronization barrier in the program means that any thread / process must wait until all threads / processes reach this point.
It should be noted here that the barrier method is not unique to pytorch. It is a basic concept in parallel computing. It may also be involved in other parallel computing scenarios. This article mainly discusses the situation in pytorch.
torch.distributed.barrier(group=None, async_op=False, device_ids=None) Synchronizes all processes. This collective blocks processes until the whole group enters this function, if async_op is False, or if async work handle is called on wait(). Parameters group (ProcessGroup, optional) – The process group to work on. If None, the default process group will be used. async_op (bool, optional) – Whether this op should be an async op device_ids ([int], optional) – List of device/GPU ids. Valid only for NCCL backend. Returns Async work handle, if async_op is set to True. None, if not async_op or if not part of the group
In multi card training, because different GPU s are often set in different processes, sometimes in order to perform some tasks in a separate process, but also want to limit the execution progress of other processes, there is a need to use barrier.
A practical scenario is to prepare the data set: we only need to process it in process 0. Other processes do not need to perform this task, but the follow-up work of other processes depends on the prepared data. Therefore, it is necessary to block other processes during the execution of process 0 to make them enter the waiting state. Wait until it is handled, and then release it together.
Under this requirement, a typical configuration based on the form of context manager is as follows:
# https://github.com/ultralytics/yolov5/blob/7d56d451241e94cd9dbe4fcb9bfba0e92c6e0e23/utils/torch_utils.py#L29-L38 @contextmanager def torch_distributed_zero_first(local_rank: int): """ Decorator to make all processes in distributed training wait for each local_master to do something. """ if local_rank not in [-1, 0]: dist.barrier(device_ids=[local_rank]) yield if local_rank == 0: dist.barrier(device_ids=[0])
Details about barrier
# -*- coding: utf-8 -*- import os import time import torch.distributed as dist import torch.multiprocessing as mp def ddp_test_v0(local_rank, word_size): # Initializes the distributed backend which will take care of sychronizing nodes/GPUs dist.init_process_group(backend="nccl", world_size=word_size, rank=local_rank) print("first before barrier{}\n".format(local_rank)) if local_rank != 0: dist.barrier() print("first after barrier{}\n".format(local_rank)) print("inter {}".format(local_rank)) print("second before barrier{}\n".format(local_rank)) if local_rank == 0: dist.barrier() print("second after barrier{}\n".format(local_rank)) print("{} exit".format(local_rank)) def ddp_test_v1(local_rank, word_size): # Initializes the distributed backend which will take care of synchronizing nodes/GPUs dist.init_process_group(backend="nccl", world_size=word_size, rank=local_rank) if local_rank != 0: print("1 before barrier{}\n".format(local_rank)) start = time.time() time.sleep(5) dist.barrier() print(time.time() - start) print("1 after barrier{}\n".format(local_rank)) dist.barrier() print("1 after barrier{}\n".format(local_rank)) else: print("0 before barrier{}\n".format(local_rank)) start = time.time() dist.barrier() print(time.time() - start) print("0 after barrier{}\n".format(local_rank)) print("0 after barrier{}\n".format(local_rank)) dist.barrier() print("0 after barrier{}\n".format(local_rank)) print("{} exit".format(local_rank)) def main(): world_size = 2 os.environ["MASTER_ADDR"] = "127.0.0.1" os.environ["MASTER_PORT"] = "29500" mp.spawn(ddp_test_v0, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": main()
Two examples are shown here. In fact, an important feature of this method is shown in addition to the officially displayed dist.barrier , that is, its operation actually requires the same number of executions within each process to change from blocking to normal operation.
Let's look at the first example:
def ddp_test(local_rank, word_size): # Initializes the distributed backend which will take care of sychronizing nodes/GPUs dist.init_process_group(backend="nccl", world_size=word_size, rank=local_rank) print("first before barrier{}\n".format(local_rank)) if local_rank != 0: dist.barrier() print("first after barrier{}\n".format(local_rank)) print("inter {}".format(local_rank)) print("second before barrier{}\n".format(local_rank)) if local_rank == 0: dist.barrier() print("second after barrier{}\n".format(local_rank)) print("{} exit".format(local_rank))
Its output is:
first before barrier1 first before barrier0 first after barrier0 inter 0 second before barrier0 second after barrier0 0 exit first after barrier1 inter 1 second before barrier1 second after barrier1 1 exit Process finished with exit code 0
As you can see, there are several details:
Before barrier , all operations were output by each GPU process.
- Due to local_ When rank = 0 , executes to its own visible barrier , multiple will be output, while local_rank=1 ﹐ then there is only one first before barrier 1.
- After second before barrier 0 , No. 0 executes its own barrier, which makes other processes no longer blocked and start normal operation. Due to the time of intermediate operation, first No. 0 outputs its own second after barrier 0 , and then exits, and then No. 1 starts to output its own results.
It is worth noting that the barriers} of different processes actually correspond to each other. All processes must execute the barrier once before they can be released again and move forward normally.
For the second code:
def ddp_test_v1(local_rank, word_size): # Initializes the distributed backend which will take care of sychronizing nodes/GPUs dist.init_process_group(backend="nccl", world_size=word_size, rank=local_rank) if local_rank != 0: print("1 before barrier{}\n".format(local_rank)) start = time.time() time.sleep(5) dist.barrier() print(time.time() - start) print("1 after barrier{}\n".format(local_rank)) dist.barrier() print("1 after barrier{}\n".format(local_rank)) else: print("0 before barrier{}\n".format(local_rank)) start = time.time() dist.barrier() print(time.time() - start) print("0 after barrier{}\n".format(local_rank)) print("0 after barrier{}\n".format(local_rank)) dist.barrier() print("0 after barrier{}\n".format(local_rank)) print("{} exit".format(local_rank))
There is an output:
1 before barrier1 0 before barrier0 5.002117395401001 5.0021262168884281 after barrier1 0 after barrier0 0 after barrier0 0 after barrier0 0 exit 1 after barrier1 1 exit Process finished with exit code 0
An important point can be seen that the output of the two print(time.time() - start) is basically the same. No matter how long the delay is in front, the time after barrier , is calculated according to the longest interval between arriving and executing barrier ,. This more reflects the mutual restriction relationship between different process barriers. After 0 reaches its second barrier , it will make No. 1 run again. But 0 ends first.
In addition, it can be verified that if one of the two barrier s , in the code corresponding to a certain number, the other will fall into infinite waiting.
For example:
def ddp_test_v1(local_rank, word_size): # Initializes the distributed backend which will take care of sychronizing nodes/GPUs dist.init_process_group(backend="nccl", world_size=word_size, rank=local_rank) if local_rank != 0: print("1 before barrier{}\n".format(local_rank)) start = time.time() time.sleep(5) dist.barrier() print(time.time() - start) print("1 after barrier{}\n".format(local_rank)) # dist.barrier() print("1 after barrier{}\n".format(local_rank)) else: print("0 before barrier{}\n".format(local_rank)) start = time.time() time.sleep(3) dist.barrier() print(time.time() - start) print("0 after barrier{}\n".format(local_rank)) print("0 after barrier{}\n".format(local_rank)) dist.barrier() print("0 after barrier{}\n".format(local_rank)) print("{} exit".format(local_rank))
Output:
0 before barrier0 1 before barrier1 5.002458572387695 1 after barrier1 1 after barrier1 1 exit 5.002473831176758 0 after barrier0 0 after barrier0 Traceback (most recent call last): File "/home/lart/Coding/SODBetterProj/tools/dist_experiment_test.py", line 67, in <module> main() File "/home/lart/Coding/SODBetterProj/tools/dist_experiment_test.py", line 63, in main mp.spawn(ddp_test_v1, args=(world_size,), nprocs=world_size, join=True) File "/home/lart/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/lart/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/lart/miniconda3/envs/pt17/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 75, in join ready = multiprocessing.connection.wait( File "/home/lart/miniconda3/envs/pt17/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/home/lart/miniconda3/envs/pt17/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Will wait indefinitely at the second barrier.
This feature is also mentioned in this answer:
when a process encounters a barrier it will block the position of the barrier is not important (not all processes have to enter the same if-statement, for instance) a process is blocked by a barrier until all processes have encountered a barrier, upon which the barrier is lifted for all processes
https://stackoverflow.com/a/59766443
Important references
Original [PyTorch] DDP series
PyTorch single machine multi GPU training method and principle arrangement
Pytorch distributed training (the diagram is very friendly)
Distribution is all you need