The embedded system raspberry pie 3B realizes the acceleration of pytorch calculation through GPU
General idea
- The cross compilation environment is a raspberry pi 3B docker image
- Using the open source GPU library on github QPULib
- Install pytorch on raspberry pie and accelerate the process through cross compilation
- Provided via pytorch Interface Compile and register programs on the c + + side
- Run pytorch directly through the file system mirrored by nfs mount docker
Concrete implementation
1. Install cross compilation environment
Specific process received this Blog Inspired, the process of finding this blog is as follows
- Search the precompiled raspberry pie 3B python wheel in the search engine and find this one in the python forum Posts
- In this post, a user named choonkiatlee mentioned his compiled wheel And the docker image
- The above article was found in the user's github blog interface
Install docker for windows on my host computer (Windows computer), and perform a series of image acceleration and hosting configurations
hosted services
-
Sign in Alibaba cloud , click container image service
-
Select personal instance
-
Create a personal namespace in warehouse management. In my case, it is ja1zhou
-
Create an access certificate and select a fixed password. This password is the password to verify when logging in locally
-
Start the docker engine on the host and enter it on the terminal
sudo docker login --username=your_username registry.cn-beijing.aliyuncs.com #Then enter the set password
Successfully logged in to alicloud
-
Create a new warehouse name in the mirror warehouse, myrasp in my case
Image acceleration
-
According to Alibaba cloud Official documents Image acceleration configuration
-
Directly modify the daemon.for docker for windows JSON file, adding
{ "registry-mirrors": ["https://your_server_name.mirror.aliyuncs.com"] }
Pull the working image, configure shared volume, and set the working environment
-
docker pull choonkiatlee/raspbian:build docker volume create --driver local -o o=bind -o type=none -o device="C:\Users\MyUsername\Downloads\rasp_docker" rasp_docker #The purpose of this command is to directly share many files in the download, such as wheel and QPULib, with docker #torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl is downloaded and placed in the created volume docker run -it --name rasp -v rasp_docker:/root/rasp_docker choonkiatlee/raspbian:build
-
After entering the docker container, you need to enter the following commands to configure. The following commands are a supplement to the blog above
#The python version of docker is 3.7.3 apt-get update && apt-get install -y python3-numpy #torch operation requires numpy support cd /root/rasp_docker pip3 install torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl
-
Verify
root@3288e690face:~/rasp_docker# python3 Python 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.__version__ '1.4.0a0+7f73f1d'
-
torch has been successfully installed
-
Package the docker image of this version as the starting point for subsequent development
docker commit rasp registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch docker push registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
(optional) compile pytorch from the source code (not successfully reproduced yet)
-
Set up an agent to allow the agent software to pass through the intranet and public network firewall
-
Set up agent software to allow connections from the LAN
-
In the docker image, set the environment variable
export all_proxy="http://host_ip:host_port"
-
Download and prepare
cd /root git clone https://github.com/pytorch/pytorch.git git checkout v1.4.0 git submodule sync git submodule update --init --recursive apt install -y python3-cffi python3-numpy libatlas-base-dev pip3 install cython wheel pyyaml pillow #Choose not to compile all add ins export USE_CUDA=0 export USE_CUDNN=0 export USE_MKLDNN=0 export USE_METAL=0 export USE_NCCL=OFF export USE_NNPACK=0 export USE_QNNPACK=0 export USE_DISTRIBUTED=0 export BUILD_TEST=0 export MAX_JOBS=8 python3 setup.py install
-
The error report is related to the submodule protobuf. The author also encountered this problem at that time, but I don't know which tag of protobuf he used when compiling. He has submitted the issue on github
2. Download QPULib and write code according to its functions
QPULib code structure
QPULib/ Lib/ Subdirectories/ *.cpp *.h *.cpp *.h Doc/ irrelevant Tests/ *.cpp Makefile
QPU acceleration principle
-
QPU is a vector processor developed by Broadcom. Its instructions can operate on 16 element vectors of 32-bit integers or floating-point values. For example, given two 16 element vectors
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
and
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
QPU's integer addition instruction calculates the third vector
30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
Where each element in the output is the sum of the corresponding two elements in the input.
-
Each 16 element vector consists of four quarter vector parts.
-
QPU processes a vector part every clock cycle, and QPU instructions need four consecutive clock cycles to provide a complete 16 bit result vector, which is the origin of the name "QPU".
-
Pi contains a total of 12 qpus, each operating at a frequency of 250MHz. This is the maximum throughput of 750M vector instructions per second (250M cycles divided by 4 cycles, each instruction multiplied by 12 qpus). Or: 12B operations per second (750M instructions multiplied by 16 vector elements). In some cases, the QPU instruction can provide two results at a time, so the QPU of Pi usually works with 24 GFLOPS.
-
The QPU is part of the Raspberry Pi graphics pipeline. If you want to make efficient graphics on Pi, you may need OpenGL ES. However, if you only want to try to speed up the non graphical part of the Pi project, QPULib is worth a visit.
-
-
In order to avoid blocking during calculation, pipeline can be introduced.
- While calculating the data of this time, take the data required for the next calculation
-
At the same time, multi QPU parallel computing can be introduced, and multi QPU can be used to calculate data in different regions each time, so as to realize the efficient utilization of GPU.
-
After reading the Makefile, it is found that the Include directory is Lib / directory at compile time, and all *. Under Lib / will be deleted first cpp compiled into * o
-
Finally, add * o and link to executables
Compiling dynamic link libraries
-
According to this idea, first compile all cpp files under Lib / o file and compile it into a dynamic link library to save repeated compilation when registering with pytorch. Rewrite Makefile as follows
# Root directory of QPULib repository ROOT = ../Lib # Compiler and default flags CXX = g++ CXX_FLAGS = -fpermissive -Wconversion -std=c++0x -I $(ROOT) # Object directory OBJ_DIR = obj # Debug mode ifeq ($(DEBUG), 1) CXX_FLAGS += -DDEBUG OBJ_DIR := $(OBJ_DIR)-debug endif # QPU or emulation mode ifeq ($(QPU), 1) CXX_FLAGS += -DQPU_MODE OBJ_DIR := $(OBJ_DIR)-qpu else CXX_FLAGS += -DEMULATION_MODE endif # Object files OBJ = \ Kernel.o \ Source/Syntax.o \ Source/Int.o \ Source/Float.o \ Source/Stmt.o \ Source/Pretty.o \ Source/Translate.o \ Source/Interpreter.o \ Source/Gen.o \ Target/Syntax.o \ Target/SmallLiteral.o \ Target/Pretty.o \ Target/RemoveLabels.o \ Target/CFG.o \ Target/Liveness.o \ Target/RegAlloc.o \ Target/ReachingDefs.o \ Target/Subst.o \ Target/LiveRangeSplit.o \ Target/Satisfy.o \ Target/LoadStore.o \ Target/Emulator.o \ Target/Encode.o \ VideoCore/Mailbox.o \ VideoCore/Invoke.o \ VideoCore/VideoCore.o # Top-level targets .PHONY: top clean LIB = $(patsubst %,$(OBJ_DIR)/%,$(OBJ)) top: $(LIB) @$(CXX) $(CXX_FLAGS) -shared -fPIC $^ -o libqpu.so clean: rm -rf obj obj-debug obj-qpu obj-debug-qpu rm -f Tri GCD Print MultiTri AutoTest OET Hello ReqRecv Rot3D ID *.o rm -f HeatMap rm -f libqpu.so # Intermediate targets $(OBJ_DIR)/%.o: $(ROOT)/%.cpp $(OBJ_DIR) @echo Compiling $< @$(CXX) -c -o $@ $< $(CXX_FLAGS) %.o: %.cpp @echo Compiling $< @$(CXX) -c -o $@ $< $(CXX_FLAGS) $(OBJ_DIR): @mkdir -p $(OBJ_DIR) @mkdir -p $(OBJ_DIR)/Source @mkdir -p $(OBJ_DIR)/Target @mkdir -p $(OBJ_DIR)/VideoCore
-
According to the Makefile file, all files under Lib / will be packaged into libqpu So dynamic link library
-
Add dynamic link library to system library
vim /etc/ld.so.conf #Add a new line to the file, libqpu So path /root/embedded_project/ #: wq save exit ldconfig#Refresh dynamic library path
Write C + + code that calls GPU for parallel computing
-
Write a pipelined matrix c + + operation program for multi qpu parallel operation
- The program mainly realizes the equal size matrix point multiplication and point addition, and returns the result
//"dot.cpp" #include <torch/extension.h> #include <vector> #include <QPULib.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h> const int NQPUS = 4; //Number of qpu s invoked void dotproduct(Int n, Ptr<Float> x, Ptr<Float> y) { Int inc = numQPUs() << 4; Ptr<Float> p = x + index() + (me() << 4); Ptr<Float> q = y + index() + (me() << 4); gather(p); gather(q); Float xOld, yOld; For (Int i = 0, i < n, i = i+inc) gather(p+inc); gather(q+inc);//Get the data required for the next operation receive(xOld); receive(yOld);//Calculate the data obtained before store(xOld * yOld, p); p = p+inc; q = q+inc; End receive(xOld); receive(yOld); } void dotadd(Int n, Ptr<Float> x, Ptr<Float> y) { Int inc = numQPUs() << 4; Ptr<Float> p = x + index() + (me() << 4); Ptr<Float> q = y + index() + (me() << 4); gather(p); gather(q); Float xOld, yOld; For (Int i = 0, i < n, i = i+inc) gather(p+inc); gather(q+inc); receive(xOld); receive(yOld); store(xOld + yOld, p); p = p+inc; q = q+inc; End receive(xOld); receive(yOld); } torch::Tensor dot_product(torch::Tensor input, torch::Tensor weight) { input = input.to(torch::kFloat32); weight = weight.to(torch::kFloat32); float *input_ptr = (float *)input.data_ptr(); float *weight_ptr = (float *)weight.data_ptr(); int width = weight.numel(); int width_16 = width + (16 - width % 16);//Convert the matrix length to a multiple of 16 SharedArray<float> mapA(width_16), mapB(width_16); for (int i = 0; i < width_16; ++i) { if (i < width) { mapA[i] = input_ptr[i]; mapB[i] = weight_ptr[i]; } else { mapA[i] = 0;//Insufficient zero filling mapB[i] = 0; } } auto k = compile(dotproduct); k.setNumQPUs(NQPUS); k(width, &mapA, &mapB); for (int i = 0; i < width; i++) { input_ptr[i] = mapA[i]; } return input; } torch::Tensor dot_add(torch::Tensor input, torch::Tensor weight) { input = input.to(torch::kFloat32); weight = weight.to(torch::kFloat32); float *input_ptr = (float *)input.data_ptr(); float *weight_ptr = (float *)weight.data_ptr(); int width = weight.numel(); int width_16 = width + (16 - width % 16); SharedArray<float> mapA(width_16), mapB(width_16); for (int i = 0; i < width_16; ++i) { if (i < width) { mapA[i] = input_ptr[i]; mapB[i] = weight_ptr[i]; } else { mapA[i] = 0; mapB[i] = 0; } } auto k = compile(dotadd); k.setNumQPUs(NQPUS); k(width, &mapA, &mapB); for (int i = 0; i < width; i++) { input_ptr[i] = mapA[i]; } return input; } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("add", &dot_add, "dot_add"); m.def("product", &dot_product, "dot_product"); }
-
Write a c + + program for multiplying two-dimensional matrix (H x W) by one-dimensional matrix (W x 1) and return a (H * 1) matrix
- This program can be used to calculate the score of each case in deep learning
//matrix.cpp #include <torch/extension.h> #include <vector> #include <QPULib.h> #include <stdio.h> #include <stdlib.h> #include <sys/time.h> const int NQPUS = 4; struct Cursor { Ptr<Float> addr; Float current, next; void init(Ptr<Float> p) { gather(p); current = 0; addr = p + 16; } void prime() { receive(next); gather(addr); } void advance() { addr = addr + 16; gather(addr); current = next; receive(next); } void finish() { receive(next); } }; void step(Ptr<Float> map, Ptr<Float> weight, Ptr<Float> mapOut, Int pitch, Int width, Int height) { Cursor row, cursorofweight; map = map + pitch * me() + index(); For(Int y = me(), y < height, y = y + numQPUs()) // Specify the location where the calculated results of the row matrix and the weight matrix are stored // The results of each line are stored in 16 bits Ptr<Float> p = mapOut + y * 16; // Initialize Cursor class row.init(map); row.prime(); cursorofweight.init(weight); cursorofweight.prime(); // Calculate the result of this row Float accumulate = 0; For(Int x = 0, x < width, x = x + 16) // In each iteration, the current result is calculated, stored, and the data required for the next calculation is obtained row.advance(); cursorofweight.advance(); accumulate = accumulate + row.current * cursorofweight.current; End // Store calculation results on p store(accumulate, p); // Release the Cursor row.finish(); cursorofweight.finish(); map = map + pitch * numQPUs(); End } torch::Tensor accumartix(torch::Tensor input, torch::Tensor weight) { input = input.to(torch::kFloat32); weight = weight.to(torch::kFloat32); int width = weight.numel(); int width_16 = width + (16 - width % 16); int height = input.numel() / width; float *input_ptr = (float *)input.data_ptr(); float *weight_ptr = (float *)weight.data_ptr(); // Create qpu a vector that interacts with the cpu SharedArray<float> mapA(width_16 * height), mapB(width_16), sumofmartix(16 * height); for (int i = 0; i < height; ++i) { for (int j = 0; j < width_16; ++j) { if (j < width) mapA[i * width_16 + j] = input_ptr[i * width + j]; else mapA[i * width_16 + j] = 0; } } for (int j = 0; j < height; ++j) { for (int i = 0; i < 16; ++i) { sumofmartix[16 * j + i] = 0; } } for (int j = 0; j < width_16; ++j) { if (j < width) mapB[j] = weight_ptr[j]; else mapB[j] = 0; } auto k = compile(step); k.setNumQPUs(NQPUS); k(&mapA, &mapB, &sumofmartix, width_16, width, height); torch::Tensor ans = torch::zeros(height); float *ans_ptr = (float *)ans.data_ptr(); for (int j = 0; j < height; ++j) { for (int i = 0; i < 16; ++i) { ans_ptr[j] += sumofmartix[16 * j + i]; } } return ans; } PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) { m.def("gpu", &accumartix, "accumartix"); }
3. Register the c + + program with pytorch
pytorch official documentation
- According to the official pytorch documentation, two files are required to register c + + code with pytorch: cpp file and setup Py file
Register c + + client program with QPULib
-
setup.py file needs to be modified according to the compilation parameters and dependent path of QPULib
from setuptools import setup, Extension from torch.utils import cpp_extension from torch.utils.cpp_extension import BuildExtension, CppExtension setup( name='dot_cpp', ext_modules=[ Extension( name='dot_cpp', sources=['dot.cpp'], include_dirs=cpp_extension.include_paths()+["/root/QPULib/Lib", "/root/QPULib/Lib/Source", "/root/QPULib/Lib/Target","/root/QPULib/Lib/VideoCore","/root/QPULib/Lib/Common"], library_dirs=["."], libraries=["qpu"], language='c++', extra_compile_args = ["-fpermissive","-w","-std=c++0x","-DQPU_MODE"]) ], cmdclass={ 'build_ext': BuildExtension })
-
All documents to be registered are as follows
embedded_project/ dot.cpp setup.py libqpu.so
-
Execute under this folder
python3 setup.py install
-
After successful compilation, the import module is tested
import torch import dot_cpp
-
No error can be reported for verification on the board
4. Use raspberry pi and nfs to mount the file system in docker image to verify whether the above is successful
Environmental preparation
-
Install docker on the lab host, download the docker image, and use the docker cp command to copy the entire file system
sudo apt update && sudo apt install qemu qemu-user-static binfmt-support #This sentence is to be able to simulate arm processor on x64 processor docker pull registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch #You can use named volume to share files between container and host docker volume create --driver local -o o=bind -o type=none -o device="/home/jay/rasp_docker" rasp_volume docker run -it --name rasp -v rasp_volume:/root registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch #Use docker cp to copy the file system sudo docker cp rasp:/ ~/home/jay/rasp_docker_filesystem
-
After starting through nfs, / home/jay/rasp_docker mount and chroot
mount 192.168.0.101:/home/jay/rasp_docker_filesystem /mnt -o nolock cd /mnt mount --rbind /dev dev/#This sentence is to mount the GPU device of the original filesystem to the chroot path chroot .
-
(optional) proxy all traffic through iptables
#Note: when uboot starts, I set the gateway of raspberry pie as the intranet address of the host, so I only need to set the iptables of the host #The following statement is entered in host iptables -t nat -A POSTROUTING -s192.168.0.1/255.255.255.0 -j SNAT --to public_ip_of_host
Formal verification
-
Verify the accuracy of qpu calculation
# accuracy.py import pytorch as torch import time import dot_cpp a = torch.randint(100) b = torch.randint(100) c = a * b print("ans in pytorch:") print(c) d = dot_cpp.product(a, b) print("ans in gpu:") print(d)
- The results are as follows:
ans in pytorch: tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01, 5.0014e-01, -1.2216e-01, 8.5097e-02, -1.4941e+00, 3.5625e+00, 1.2412e-03, 4.9355e-01, -4.8173e-01, 1.3379e-01, 6.8660e-01, -3.0867e-01, 4.1459e-01, 3.8146e-01, 2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01, -7.2695e-01, -7.9857e-01, 9.2179e-01, -4.4537e-01, 1.2229e+00, -1.9606e+00, 2.1500e+00, 6.2939e-02, -2.9404e-02, -1.6333e-01, 5.8653e-01, -3.0282e-01, 1.7500e+00, -1.9485e+00, 1.0097e+00, -2.9966e-01, 5.1717e-01, 8.6291e-01, 1.4203e+00, 1.5049e-01, 4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01, 5.4926e-02, -2.1086e-01, -2.1043e-01, -4.2422e-01, 3.1212e-02, -3.5714e-01, 7.3226e-01, 1.7916e+00, -8.3882e-02, 1.7431e+00, 7.5411e-02, 1.4379e-01, -2.1750e+00, 5.3509e-01, 1.9931e+00, -1.0812e+00, 9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01, 4.8681e-01, -5.7749e-02, 8.6992e-02, -7.8780e-01, 1.3495e+00, -7.5135e-02, 6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00, 1.1846e-01, 1.5355e+00, -4.2969e-01, 2.9539e-01, -5.9056e-01, 1.0564e+00, -5.7899e-01, 1.7013e-02, 5.1986e-01, -4.7120e-02, -3.4399e-02, -1.4235e-01, -1.4144e+00, 5.1103e-01, 7.2233e-01, -6.0687e-01, -8.2988e-01, -2.7205e-01, 1.0952e+00, -9.7423e-02, 4.9439e-02, -1.7460e-02, 2.0516e-01, -7.8793e-01, -1.8765e+00]) ans in gpu: tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01, 5.0014e-01, -1.2216e-01, 8.5097e-02, -1.4941e+00, 3.5625e+00, 1.2412e-03, 4.9355e-01, -4.8173e-01, 1.3379e-01, 6.8660e-01, -3.0867e-01, 4.1459e-01, 3.8146e-01, 2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01, -7.2695e-01, -7.9857e-01, 9.2179e-01, -4.4537e-01, 1.2229e+00, -1.9606e+00, 2.1500e+00, 6.2939e-02, -2.9404e-02, -1.6333e-01, 5.8653e-01, -3.0282e-01, 1.7500e+00, -1.9485e+00, 1.0097e+00, -2.9966e-01, 5.1717e-01, 8.6291e-01, 1.4203e+00, 1.5049e-01, 4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01, 5.4926e-02, -2.1086e-01, -2.1043e-01, -4.2422e-01, 3.1212e-02, -3.5714e-01, 7.3226e-01, 1.7916e+00, -8.3882e-02, 1.7431e+00, 7.5411e-02, 1.4379e-01, -2.1750e+00, 5.3509e-01, 1.9931e+00, -1.0812e+00, 9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01, 4.8681e-01, -5.7749e-02, 8.6992e-02, -7.8780e-01, 1.3495e+00, -7.5135e-02, 6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00, 1.1846e-01, 1.5355e+00, -4.2969e-01, 2.9539e-01, -5.9056e-01, 1.0564e+00, -5.7899e-01, 1.7013e-02, 5.1986e-01, -4.7120e-02, -3.4399e-02, -1.4235e-01, -1.4144e+00, 5.1103e-01, 7.2233e-01, -6.0687e-01, -8.2988e-01, -2.7205e-01, 1.0952e+00, -9.7423e-02, 4.9439e-02, -1.7460e-02, 2.0516e-01, -7.8793e-01, -1.8765e+00])
-
Verify qpu calculation speed
-
Here, we rewrite the operator on the c + + side so that the calculated time returns to the python side in a certain way. At the same time, the operator of c + + cpu is written for comparison
-
The time for comparison here is only the time used for calculation on the c + + side. In fact, when calling the gpu operator on the python side, some time will be consumed in the calling process and the communication between cpu and gpu
-
This project shows the potential performance improvement brought by integrating and writing operators on the c + + side
# time.py # This time only calculates the time spent in calculating the matrix multiplication # And obtained by writing to the return value import torch import time import cpu_cpp import matrix_cpp for i in range(0,6): a = torch.randn(100 *10**i ) b= torch.randn(10**i) gpu = matrix_cpp.gpu(a,b) cpu=cpu_cpp.cpu(a,b) print("cpu 100 * 10 ** %d takes %d.%06d"% (i,cpu[0],cpu[1])) print("gpu 100 * 10 ** %d takes %d.%06d"% (i,gpu[0],gpu[1]))
- The results are as follows:
cpu 100 * 10 ** 0 takes 0.000004 gpu 100 * 10 ** 0 takes 0.000164 cpu 100 * 10 ** 1 takes 0.000023 gpu 100 * 10 ** 1 takes 0.000169 cpu 100 * 10 ** 2 takes 0.000206 gpu 100 * 10 ** 2 takes 0.000171 cpu 100 * 10 ** 3 takes 0.002116 gpu 100 * 10 ** 3 takes 0.000388 cpu 100 * 10 ** 4 takes 0.021245 gpu 100 * 10 ** 4 takes 0.003079 cpu 100 * 10 ** 5 takes 0.214486 gpu 100 * 10 ** 5 takes 0.029622
Write at the end
This project is the course project of embedded system, and the results are preliminary.
Thank the lxk and cxf members of the same group~
Thank Mr. Yang and Mr. LV for their guidance and help~