Distributed training with Huawei mindspire

Posted by Loriq on Tue, 01 Feb 2022 07:49:58 +0100

Technical background

coupon https://m.fenfaw.net/

Distributed and parallel computing are very important concepts in the computer field. For some laymen, I always think this is a very simple job, but if we look at the hardware development history of computers, from CPU to GPU, TPU and Huawei's shengteng (NPU), and even the current hot quantum computer (QPU), it is actually the development history of distributed and parallel computing. From simple data parallelism, to algorithm parallelism, to graph parallelism, and finally the physical parallelism brought by quantum superposition. Therefore, whether we can do well in distributed and parallel technology largely determines the upper limit of the performance of a tool. In this paper, let's study the mindspire distributed training method.

Environment deployment

In the previous blog, we discussed using Docker and Singularity container to install the CPU version and GPU version of mindcore. Interested readers can turn to these historical blogs. In this article, we will have installed a mindspire GPU environment locally by default. On this premise, we will discuss how to use the distributed function of mindspire in the environment of single machine and multiple GPU cards. For a more complete introduction, please refer to this official address, which contains a complete introduction to installation, deployment and use. Here we only introduce the basic installation and use methods for the local Ubuntu environment.

Install openmpi

A total of 2 softwares need to be installed here. We all adopt the method of source code installation. First, download the source code of the corresponding version from the download link given by mindspire:

According to the version, we can download the tar. 4.0.3 GZ compressed package:

After downloading, you can roughly install the source code according to the operation instructions:

If the installation is successful, run mpirun --version to see the version number. Yes, there are two points to remind: 1 Decompression does not necessarily use the instructions given here. You can use your own; 2. If the prefix of this guideline is followed, it needs to be in LD later_ LIBRARY_ Add / usr/local/lib / to the environment variable path, otherwise some so libraries will not be recognized.

Install NCCL

The NCCL tool is used for mindspire's distributed communication on GPU. Similarly, let's go to the link given here to find the source package to download and install:

The Local installation scheme is recommended here:

After finding the software package suitable for your own system version, you can install it step by step according to the following guidelines:

If you fail to execute apt key once, you can execute it twice more.

Environmental testing

After successful installation, after both openmpi and NCCL are installed successfully, you can use the following initialization example to test the deployment of the environment:

# test-init.py

from mindspore import context
from mindspore.communication.management import init

if __name__ == "__main__":
    context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
    init("nccl")

If the execution result is as follows and there is no error message, the environment deployment is successful:

dechin@ubuntu2004:~/projects/mindspore/test$ mindspore test-init.py 
dechin@ubuntu2004:~/projects/mindspore/test$

Mindspire distributed training

First of all, some people may wonder, what is the last instruction mindscore used in the previous chapter? This is actually a shortcut command of mindscore GPU version configured locally by myself:

dechin@ubuntu2004:~/projects/$ cat ~/.bashrc | grep mindspore
alias mindspore='singularity exec --nv /home/dechin/tools/singularity/mindspore-gpu_1.2.0.sif python'

If you think it's convenient, readers can also go to alias for a quick command according to their preferences.

Next, let's look at an implementation case. This code comes from the previous blog, but on the basis of this code, we added the initialization code init() in the previous chapter. The complete code is as follows:

# test_nonlinear.py

from mindspore import context
from mindspore.communication.management import init
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
init()
import numpy as np
from mindspore import dataset as ds
from mindspore import nn, Tensor, Model
import time
from mindspore.train.callback import Callback, LossMonitor

def get_data(num, a=2.0, b=3.0, c=5.0):
    for _ in range(num):
        x = np.random.uniform(-1.0, 1.0)
        y = np.random.uniform(-1.0, 1.0)
        noise = np.random.normal(0, 0.03)
        z = a * x ** 2 + b * y ** 3 + c + noise
        yield np.array([[x**2], [y**3]],dtype=np.float32).reshape(1,2), np.array([z]).astype(np.float32)

def create_dataset(num_data, batch_size=16, repeat_size=1):
    input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['xy','z'])
    input_data = input_data.batch(batch_size)
    input_data = input_data.repeat(repeat_size)
    return input_data

data_number = 160
batch_number = 16
repeat_number = 200

ds_train = create_dataset(data_number, batch_size=batch_number, repeat_size=repeat_number)
dict_datasets = next(ds_train.create_dict_iterator())

class LinearNet(nn.Cell):
    def __init__(self):
        super(LinearNet, self).__init__()
        self.fc = nn.Dense(2, 1, 0.02, 0.02)

    def construct(self, x):
        x = self.fc(x)
        return x

net = LinearNet()
model_params = net.trainable_params()
print ('Param Shape is: {}'.format(len(model_params)))
for net_param in net.trainable_params():
    print(net_param, net_param.asnumpy())
net_loss = nn.loss.MSELoss()

optim = nn.Momentum(net.trainable_params(), learning_rate=0.01, momentum=0.6)
model = Model(net, net_loss, optim)

epoch = 1
model.train(epoch, ds_train, callbacks=[LossMonitor(10)], dataset_sink_mode=False)

for net_param in net.trainable_params():
    print(net_param, net_param.asnumpy())

At this point, we need to run this code with mpirun instead:

dechin@ubuntu2004:~/projects/gitlab/dechin/src/mindspore$ mpirun -n 2 singularity exec --nv /home/dechin/tools/singularity/mindspore-gpu_1.2.0.sif python test_nonlinear.py

During operation, we can see that there are two python tasks running on two different GPU cards:

This monitoring method is actually NVIDIA SMI instruction, but in order to monitor the GPU status for a long time, I use the watch - N 1 NVIDIA SMI instruction to refresh the NVIDIA SMI status every 1s. But there are also two points to note here: 1 This code can also run on a single GPU card directly with the mindshare instruction, but if you want to run it with mpirun, you can't use the mindshare instruction of alias just now, but you need to write the complete instruction manually, unless you add a new instruction to alias; 2. Because the above code is only initialized, although it runs on two cards, in fact, the training process does not communicate with each other. It is two independent tasks. Then, if you want to construct a complete automatic distributed training, you need to add context like the following code set_ auto_ parallel_ Context (parallel_mode = parallelmode. Data_parallel) and some configurations in checkpoint. The new complete code is as follows:

# test_nonlinear.py

from mindspore import context
from mindspore.communication.management import init
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
init()

import numpy as np
from mindspore import dataset as ds
from mindspore import nn, Tensor, Model
import time
from mindspore.train.callback import Callback, LossMonitor, ModelCheckpoint, CheckpointConfig
from mindspore.context import ParallelMode
import mindspore as ms
ms.common.set_seed(0)

start_time = time.time()
def get_data(num, a=2.0, b=3.0, c=5.0):
    for _ in range(num):
        x = np.random.uniform(-1.0, 1.0)
        y = np.random.uniform(-1.0, 1.0)
        noise = np.random.normal(0, 0.03)
        z = a * x ** 2 + b * y ** 3 + c + noise
        yield np.array([[x**2], [y**3]],dtype=np.float32).reshape(1,2), np.array([z]).astype(np.float32)

def create_dataset(num_data, batch_size=16, repeat_size=1):
    input_data = ds.GeneratorDataset(list(get_data(num_data)), column_names=['xy','z'])
    input_data = input_data.batch(batch_size)
    input_data = input_data.repeat(repeat_size)
    return input_data

data_number = 160
batch_number = 16
repeat_number = 20

context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL)
ds_train = create_dataset(data_number, batch_size=batch_number, repeat_size=repeat_number)
dict_datasets = next(ds_train.create_dict_iterator())

class LinearNet(nn.Cell):
    def __init__(self):
        super(LinearNet, self).__init__()
        self.fc = nn.Dense(2, 1, 0.02, 0.02)

    def construct(self, x):
        x = self.fc(x)
        return x

net = LinearNet()
model_params = net.trainable_params()
print ('Param Shape is: {}'.format(len(model_params)))
for net_param in net.trainable_params():
    print(net_param, net_param.asnumpy())
net_loss = nn.loss.MSELoss()

optim = nn.Momentum(net.trainable_params(), learning_rate=0.01, momentum=0.6)
ckpt_config = CheckpointConfig()
ckpt_callback = ModelCheckpoint(prefix='data_parallel', config=ckpt_config)

model = Model(net, net_loss, optim)

epoch = 10
model.train(epoch, ds_train, callbacks=[ckpt_callback], dataset_sink_mode=True)

for net_param in net.trainable_params():
    print(net_param, net_param.asnumpy())

print ('The total time cost is: {}s'.format(time.time() - start_time))

The running result of this code is as follows:

dechin@ubuntu2004:~/projects/gitlab/dechin/src/mindspore$ mpirun -n 2 singularity exec --nv /home/dechin/tools/singularity/mindspore-gpu_1.2.0.sif python test_nonlinear.py 
Param Shape is: 2
Parameter (name=fc.weight, shape=(1, 2), dtype=Float32, requires_grad=True) [[0.02 0.02]]
Parameter (name=fc.bias, shape=(1,), dtype=Float32, requires_grad=True) [0.02]
Param Shape is: 2
Parameter (name=fc.weight, shape=(1, 2), dtype=Float32, requires_grad=True) [[0.02 0.02]]
Parameter (name=fc.bias, shape=(1,), dtype=Float32, requires_grad=True) [0.02]
[WARNING] ME(2528801:139843698521024,MainProcess):2021-06-10-09:45:37.603.010 [mindspore/train/callback/_checkpoint.py:428] OSError, failed to remove the older ckpt file /home/dechin/projects/gitlab/dechin/src/mindspore/data_parallel-1_200.ckpt.
[WARNING] ME(2528799:139709496722368,MainProcess):2021-06-10-09:45:37.713.232 [mindspore/train/callback/_checkpoint.py:428] OSError, failed to remove the older ckpt file /home/dechin/projects/gitlab/dechin/src/mindspore/data_parallel-2_200.ckpt.
[WARNING] ME(2528799:139709496722368,MainProcess):2021-06-10-09:45:37.824.271 [mindspore/train/callback/_checkpoint.py:428] OSError, failed to remove the older ckpt file /home/dechin/projects/gitlab/dechin/src/mindspore/data_parallel-3_200.ckpt.
[WARNING] ME(2528799:139709496722368,MainProcess):2021-06-10-09:45:37.943.749 [mindspore/train/callback/_checkpoint.py:428] OSError, failed to remove the older ckpt file /home/dechin/projects/gitlab/dechin/src/mindspore/data_parallel-4_200.ckpt.
[WARNING] ME(2528801:139843698521024,MainProcess):2021-06-10-09:45:38.433.85 [mindspore/train/callback/_checkpoint.py:428] OSError, failed to remove the older ckpt file /home/dechin/projects/gitlab/dechin/src/mindspore/data_parallel-5_200.ckpt.
Parameter (name=fc.weight, shape=(1, 2), dtype=Float32, requires_grad=True) [[0.12186428 0.21167319]]
Parameter (name=fc.bias, shape=(1,), dtype=Float32, requires_grad=True) [5.561276]
The total time cost is: 8.412446737289429s
Parameter (name=fc.weight, shape=(1, 2), dtype=Float32, requires_grad=True) [[0.12186428 0.21167319]]
Parameter (name=fc.bias, shape=(1,), dtype=Float32, requires_grad=True) [5.561276]
The total time cost is: 8.439369916915894s

Although the operation is successful, it should be noted that the distributed case takes 8.44s, while the training of single card takes 8.15s. The distributed training speed is even slower than that of single card. We will explain this phenomenon in the summary.

Summary

In this article, we mainly discuss how to deploy a distributed training environment based on mindspire framework. When the mindspire environment has been configured, we only need to install openmpi and nccl to realize distributed training. In this paper, we have given the corresponding examples. Although the purpose of distributed and parallel technology is to improve performance, it does not mean that it can accelerate all scenarios. For example, the cases in this article have no acceleration effect. In fact, this is because our use case scenario is too simple. Throughout the whole training process, the utilization rate of GPU is less than 10%. In this case, we also need to consider the overhead of communication. Naturally, it is not faster to train on the same card. This also gives us an inspiration. When considering the use of distributed and parallel computing technology, we must first evaluate whether the problem itself is suitable for parallel processing, otherwise the expected acceleration will not be achieved.

Copyright notice

The first link of this article is: https://www.cnblogs.com/dechinphy/p/dms.html

Author ID: DechinPhy

For more original articles, please refer to: https://www.cnblogs.com/dechinphy/

Special links for rewards: https://www.cnblogs.com/dechinphy/gallery/image/379634.html

Tencent cloud column synchronization: https://cloud.tencent.com/developer/column/91958