Raspberry pie 3B realizes the acceleration of pytorch calculation through GPU

Posted by helbom on Fri, 14 Jan 2022 13:05:14 +0100

The embedded system raspberry pie 3B realizes the acceleration of pytorch calculation through GPU

General idea

The cross compilation environment is a raspberry pi 3B docker image
Using the open source GPU library on github QPULib
Install pytorch on raspberry pie and accelerate the process through cross compilation
Provided via pytorch Interface Compile and register programs on the c + + side
Run pytorch directly through the file system mirrored by nfs mount docker

Concrete implementation

1. Install cross compilation environment

Specific process received this Blog Inspired, the process of finding this blog is as follows

Search the precompiled raspberry pie 3B python wheel in the search engine and find this one in the python forum Posts
In this post, a user named choonkiatlee mentioned his compiled wheel And the docker image
The above article was found in the user's github blog interface

Install docker for windows on my host computer (Windows computer), and perform a series of image acceleration and hosting configurations

hosted services

Sign in Alibaba cloud , click container image service
Select personal instance
Create a personal namespace in warehouse management. In my case, it is ja1zhou
Create an access certificate and select a fixed password. This password is the password to verify when logging in locally

Start the docker engine on the host and enter it on the terminal

sudo docker login --username=your_username registry.cn-beijing.aliyuncs.com
#Then enter the set password

Successfully logged in to alicloud

Create a new warehouse name in the mirror warehouse, myrasp in my case

Image acceleration

According to Alibaba cloud Official documents Image acceleration configuration

Directly modify the daemon.for docker for windows JSON file, adding

{
  "registry-mirrors": ["https://your_server_name.mirror.aliyuncs.com"]
}

Pull the working image, configure shared volume, and set the working environment

docker pull choonkiatlee/raspbian:build
docker volume create --driver local -o o=bind -o type=none -o device="C:\Users\MyUsername\Downloads\rasp_docker" rasp_docker
#The purpose of this command is to directly share many files in the download, such as wheel and QPULib, with docker
#torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl is downloaded and placed in the created volume
docker run -it --name rasp -v rasp_docker:/root/rasp_docker choonkiatlee/raspbian:build

After entering the docker container, you need to enter the following commands to configure. The following commands are a supplement to the blog above

#The python version of docker is 3.7.3
apt-get update && apt-get install -y python3-numpy
#torch operation requires numpy support
cd /root/rasp_docker
pip3 install torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl

Verify

root@3288e690face:~/rasp_docker# python3
Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.4.0a0+7f73f1d'

torch has been successfully installed

Package the docker image of this version as the starting point for subsequent development

docker commit rasp registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
docker push registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch

(optional) compile pytorch from the source code (not successfully reproduced yet)

Set up an agent to allow the agent software to pass through the intranet and public network firewall
Set up agent software to allow connections from the LAN
In the docker image, set the environment variable
```
export all_proxy="http://host_ip:host_port"
```

Download and prepare

cd /root
git clone https://github.com/pytorch/pytorch.git
git checkout v1.4.0
git submodule sync
git submodule update --init --recursive
apt install -y python3-cffi python3-numpy libatlas-base-dev
pip3 install cython wheel pyyaml pillow
#Choose not to compile all add ins
export USE_CUDA=0
export USE_CUDNN=0
export USE_MKLDNN=0
export USE_METAL=0
export USE_NCCL=OFF
export USE_NNPACK=0
export USE_QNNPACK=0
export USE_DISTRIBUTED=0
export BUILD_TEST=0
export MAX_JOBS=8
python3 setup.py install

The error report is related to the submodule protobuf. The author also encountered this problem at that time, but I don't know which tag of protobuf he used when compiling. He has submitted the issue on github

2. Download QPULib and write code according to its functions

QPULib code structure

QPULib/
  Lib/
  	Subdirectories/
  		*.cpp
  		*.h
  	*.cpp
  	*.h
  Doc/
  	irrelevant
  Tests/
  	*.cpp
  	Makefile

QPU acceleration principle

QPU is a vector processor developed by Broadcom. Its instructions can operate on 16 element vectors of 32-bit integers or floating-point values. For example, given two 16 element vectors

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

and

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

QPU's integer addition instruction calculates the third vector

30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

Where each element in the output is the sum of the corresponding two elements in the input.
- Each 16 element vector consists of four quarter vector parts.
- QPU processes a vector part every clock cycle, and QPU instructions need four consecutive clock cycles to provide a complete 16 bit result vector, which is the origin of the name "QPU".
- Pi contains a total of 12 qpus, each operating at a frequency of 250MHz. This is the maximum throughput of 750M vector instructions per second (250M cycles divided by 4 cycles, each instruction multiplied by 12 qpus). Or: 12B operations per second (750M instructions multiplied by 16 vector elements). In some cases, the QPU instruction can provide two results at a time, so the QPU of Pi usually works with 24 GFLOPS.
- The QPU is part of the Raspberry Pi graphics pipeline. If you want to make efficient graphics on Pi, you may need OpenGL ES. However, if you only want to try to speed up the non graphical part of the Pi project, QPULib is worth a visit.
In order to avoid blocking during calculation, pipeline can be introduced.
- While calculating the data of this time, take the data required for the next calculation
At the same time, multi QPU parallel computing can be introduced, and multi QPU can be used to calculate data in different regions each time, so as to realize the efficient utilization of GPU.
After reading the Makefile, it is found that the Include directory is Lib / directory at compile time, and all *. Under Lib / will be deleted first cpp compiled into * o
Finally, add * o and link to executables

Compiling dynamic link libraries

According to this idea, first compile all cpp files under Lib / o file and compile it into a dynamic link library to save repeated compilation when registering with pytorch. Rewrite Makefile as follows

# Root directory of QPULib repository
ROOT = ../Lib

# Compiler and default flags
CXX = g++
CXX_FLAGS = -fpermissive -Wconversion -std=c++0x -I $(ROOT)

# Object directory
OBJ_DIR = obj

# Debug mode
ifeq ($(DEBUG), 1)
  CXX_FLAGS += -DDEBUG
  OBJ_DIR := $(OBJ_DIR)-debug
endif

# QPU or emulation mode
ifeq ($(QPU), 1)
  CXX_FLAGS += -DQPU_MODE
  OBJ_DIR := $(OBJ_DIR)-qpu
else
  CXX_FLAGS += -DEMULATION_MODE
endif

# Object files
OBJ =                         \
  Kernel.o                    \
  Source/Syntax.o             \
  Source/Int.o                \
  Source/Float.o              \
  Source/Stmt.o               \
  Source/Pretty.o             \
  Source/Translate.o          \
  Source/Interpreter.o        \
  Source/Gen.o                \
  Target/Syntax.o             \
  Target/SmallLiteral.o       \
  Target/Pretty.o             \
  Target/RemoveLabels.o       \
  Target/CFG.o                \
  Target/Liveness.o           \
  Target/RegAlloc.o           \
  Target/ReachingDefs.o       \
  Target/Subst.o              \
  Target/LiveRangeSplit.o     \
  Target/Satisfy.o            \
  Target/LoadStore.o          \
  Target/Emulator.o           \
  Target/Encode.o             \
  VideoCore/Mailbox.o         \
  VideoCore/Invoke.o          \
  VideoCore/VideoCore.o

# Top-level targets

.PHONY: top clean
LIB = $(patsubst %,$(OBJ_DIR)/%,$(OBJ))
top: $(LIB)
        @$(CXX) $(CXX_FLAGS) -shared -fPIC $^ -o libqpu.so

clean:
        rm -rf obj obj-debug obj-qpu obj-debug-qpu
        rm -f Tri GCD Print MultiTri AutoTest OET Hello ReqRecv Rot3D ID *.o
        rm -f HeatMap
        rm -f libqpu.so
# Intermediate targets

$(OBJ_DIR)/%.o: $(ROOT)/%.cpp $(OBJ_DIR)
        @echo Compiling $<
        @$(CXX) -c -o $@ $< $(CXX_FLAGS)

%.o: %.cpp
        @echo Compiling $<
        @$(CXX) -c -o $@ $< $(CXX_FLAGS)

$(OBJ_DIR):
        @mkdir -p $(OBJ_DIR)
        @mkdir -p $(OBJ_DIR)/Source
        @mkdir -p $(OBJ_DIR)/Target
        @mkdir -p $(OBJ_DIR)/VideoCore

According to the Makefile file, all files under Lib / will be packaged into libqpu So dynamic link library

Add dynamic link library to system library

vim /etc/ld.so.conf
#Add a new line to the file, libqpu So path
/root/embedded_project/
#: wq save exit
ldconfig#Refresh dynamic library path

Write C + + code that calls GPU for parallel computing

Write a pipelined matrix c + + operation program for multi qpu parallel operation

The program mainly realizes the equal size matrix point multiplication and point addition, and returns the result

//"dot.cpp"
#include <torch/extension.h>
#include <vector>
#include <QPULib.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
const int NQPUS = 4; //Number of qpu s invoked

void dotproduct(Int n, Ptr<Float> x, Ptr<Float> y)
{
    Int inc = numQPUs() << 4;
    Ptr<Float> p = x + index() + (me() << 4);
    Ptr<Float> q = y + index() + (me() << 4);
    gather(p); gather(q);

    Float xOld, yOld;
    For (Int i = 0, i < n, i = i+inc)
        gather(p+inc); gather(q+inc);//Get the data required for the next operation
        receive(xOld); receive(yOld);//Calculate the data obtained before
        store(xOld * yOld, p);
        p = p+inc; q = q+inc;
    End

    receive(xOld); receive(yOld);
}

void dotadd(Int n, Ptr<Float> x, Ptr<Float> y)
{
    Int inc = numQPUs() << 4;
    Ptr<Float> p = x + index() + (me() << 4);
    Ptr<Float> q = y + index() + (me() << 4);
    gather(p); gather(q);

    Float xOld, yOld;
    For (Int i = 0, i < n, i = i+inc)
        gather(p+inc); gather(q+inc);
        receive(xOld); receive(yOld);
        store(xOld + yOld, p);
        p = p+inc; q = q+inc;
    End

    receive(xOld); receive(yOld);
}

torch::Tensor dot_product(torch::Tensor input, torch::Tensor weight)
{
    input = input.to(torch::kFloat32);
    weight = weight.to(torch::kFloat32);
    float *input_ptr = (float *)input.data_ptr();
    float *weight_ptr = (float *)weight.data_ptr();

    int width = weight.numel();
    int width_16 = width + (16 - width % 16);//Convert the matrix length to a multiple of 16
    SharedArray<float> mapA(width_16), mapB(width_16);

    for (int i = 0; i < width_16; ++i)
    {
        if (i < width)
        {
            mapA[i] = input_ptr[i];
            mapB[i] = weight_ptr[i];
        }
        else
        {
            mapA[i] = 0;//Insufficient zero filling
            mapB[i] = 0;
        }
    }
    auto k = compile(dotproduct);

    k.setNumQPUs(NQPUS);

    k(width, &mapA, &mapB);

    for (int i = 0; i < width; i++) {
        input_ptr[i] = mapA[i];
    }
    return input;
}

torch::Tensor dot_add(torch::Tensor input, torch::Tensor weight)
{
    input = input.to(torch::kFloat32);
    weight = weight.to(torch::kFloat32);
    float *input_ptr = (float *)input.data_ptr();
    float *weight_ptr = (float *)weight.data_ptr();

    int width = weight.numel();
    int width_16 = width + (16 - width % 16);
    SharedArray<float> mapA(width_16), mapB(width_16);

    for (int i = 0; i < width_16; ++i)
    {
        if (i < width)
        {
            mapA[i] = input_ptr[i];
            mapB[i] = weight_ptr[i];
        }
        else
        {
            mapA[i] = 0;
            mapB[i] = 0;
        }
    }
    auto k = compile(dotadd);

    k.setNumQPUs(NQPUS);

    k(width, &mapA, &mapB);

    for (int i = 0; i < width; i++) {
        input_ptr[i] = mapA[i];
    }
    
    return input;
}


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("add", &dot_add, "dot_add");
  m.def("product", &dot_product, "dot_product");
}

Write a c + + program for multiplying two-dimensional matrix (H x W) by one-dimensional matrix (W x 1) and return a (H * 1) matrix

This program can be used to calculate the score of each case in deep learning

//matrix.cpp
#include <torch/extension.h>
#include <vector>
#include <QPULib.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
const int NQPUS = 4;
struct Cursor
{
    Ptr<Float> addr;
    Float current, next;

    void init(Ptr<Float> p)
    {
        gather(p);
        current = 0;
        addr = p + 16;
    }

    void prime()
    {
        receive(next);
        gather(addr);
    }

    void advance()
    {
        addr = addr + 16;
        gather(addr);
        current = next;
        receive(next);
    }

    void finish()
    {
        receive(next);
    }
};

void step(Ptr<Float> map, Ptr<Float> weight, Ptr<Float> mapOut, Int pitch, Int width, Int height)
{
    Cursor row, cursorofweight;
    map = map + pitch * me() + index();
    For(Int y = me(), y < height, y = y + numQPUs())

        // Specify the location where the calculated results of the row matrix and the weight matrix are stored
        // The results of each line are stored in 16 bits
        Ptr<Float>
        p = mapOut + y * 16;

    // Initialize Cursor class
    row.init(map);
    row.prime();
    cursorofweight.init(weight);
    cursorofweight.prime();

    // Calculate the result of this row
    Float accumulate = 0;
    For(Int x = 0, x < width, x = x + 16)
        // In each iteration, the current result is calculated, stored, and the data required for the next calculation is obtained
        row.advance();
   		cursorofweight.advance();
    	accumulate = accumulate + row.current * cursorofweight.current;

    End
    // Store calculation results on p
    store(accumulate, p);
    // Release the Cursor
    row.finish();
    cursorofweight.finish();
    map = map + pitch * numQPUs();

    End
}
torch::Tensor accumartix(torch::Tensor input, torch::Tensor weight)
{
    input = input.to(torch::kFloat32);
    weight = weight.to(torch::kFloat32);
    int width = weight.numel();
    int width_16 = width + (16 - width % 16);
    int height = input.numel() / width;

    float *input_ptr = (float *)input.data_ptr();
    float *weight_ptr = (float *)weight.data_ptr();
    // Create qpu a vector that interacts with the cpu
    SharedArray<float> mapA(width_16 * height), mapB(width_16), sumofmartix(16 * height);
    for (int i = 0; i < height; ++i)
    {
        for (int j = 0; j < width_16; ++j)
        {
            if (j < width)
                mapA[i * width_16 + j] = input_ptr[i * width + j];
            else
                mapA[i * width_16 + j] = 0;
        }
    }

    for (int j = 0; j < height; ++j)
    {
        for (int i = 0; i < 16; ++i)
        {
            sumofmartix[16 * j + i] = 0;
        }
    }

    for (int j = 0; j < width_16; ++j)
    {
        if (j < width)
            mapB[j] = weight_ptr[j];
        else
            mapB[j] = 0;
    }
    auto k = compile(step);

    k.setNumQPUs(NQPUS);

    k(&mapA, &mapB, &sumofmartix, width_16, width, height);
    torch::Tensor ans = torch::zeros(height);
    float *ans_ptr = (float *)ans.data_ptr();
    for (int j = 0; j < height; ++j)
    {
        for (int i = 0; i < 16; ++i)
        {
            ans_ptr[j] += sumofmartix[16 * j + i];
        }
    }
    return ans; 
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("gpu", &accumartix, "accumartix");
}

3. Register the c + + program with pytorch

pytorch official documentation

According to the official pytorch documentation, two files are required to register c + + code with pytorch: cpp file and setup Py file

Register c + + client program with QPULib

setup.py file needs to be modified according to the compilation parameters and dependent path of QPULib

from setuptools import setup, Extension
from torch.utils import cpp_extension
from torch.utils.cpp_extension import BuildExtension, CppExtension

setup(
    name='dot_cpp',
    ext_modules=[
        Extension(
            name='dot_cpp',
            sources=['dot.cpp'],
            include_dirs=cpp_extension.include_paths()+["/root/QPULib/Lib", "/root/QPULib/Lib/Source",
            "/root/QPULib/Lib/Target","/root/QPULib/Lib/VideoCore","/root/QPULib/Lib/Common"],
            library_dirs=["."],
            libraries=["qpu"],
            language='c++',
            extra_compile_args = ["-fpermissive","-w","-std=c++0x","-DQPU_MODE"])
            
    ],
    cmdclass={
        'build_ext': BuildExtension
    })

All documents to be registered are as follows

embedded_project/
	dot.cpp
	setup.py
	libqpu.so

Execute under this folder
```
python3 setup.py install
```
After successful compilation, the import module is tested
```
import torch
import dot_cpp
```
No error can be reported for verification on the board

4. Use raspberry pi and nfs to mount the file system in docker image to verify whether the above is successful

Environmental preparation

Install docker on the lab host, download the docker image, and use the docker cp command to copy the entire file system

sudo apt update && sudo apt install qemu qemu-user-static binfmt-support
#This sentence is to be able to simulate arm processor on x64 processor
docker pull registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
#You can use named volume to share files between container and host
docker volume create --driver local -o o=bind -o type=none -o device="/home/jay/rasp_docker" rasp_volume
docker run -it --name rasp -v rasp_volume:/root registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
#Use docker cp to copy the file system
sudo docker cp rasp:/ ~/home/jay/rasp_docker_filesystem

After starting through nfs, / home/jay/rasp_docker mount and chroot

mount 192.168.0.101:/home/jay/rasp_docker_filesystem /mnt -o nolock
cd /mnt
mount --rbind /dev dev/#This sentence is to mount the GPU device of the original filesystem to the chroot path
chroot .

(optional) proxy all traffic through iptables

#Note: when uboot starts, I set the gateway of raspberry pie as the intranet address of the host, so I only need to set the iptables of the host
#The following statement is entered in host
iptables -t nat -A POSTROUTING -s192.168.0.1/255.255.255.0 -j SNAT --to public_ip_of_host

Formal verification

Verify the accuracy of qpu calculation

# accuracy.py
import pytorch as torch
import time
import dot_cpp

a = torch.randint(100)
b = torch.randint(100)
c = a * b
print("ans in pytorch:")
print(c)
d = dot_cpp.product(a, b)
print("ans in gpu:")
print(d)

The results are as follows:

ans in pytorch:
tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01,  5.0014e-01, -1.2216e-01,
         8.5097e-02, -1.4941e+00,  3.5625e+00,  1.2412e-03,  4.9355e-01,
        -4.8173e-01,  1.3379e-01,  6.8660e-01, -3.0867e-01,  4.1459e-01,
         3.8146e-01,  2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01,
        -7.2695e-01, -7.9857e-01,  9.2179e-01, -4.4537e-01,  1.2229e+00,
        -1.9606e+00,  2.1500e+00,  6.2939e-02, -2.9404e-02, -1.6333e-01,
         5.8653e-01, -3.0282e-01,  1.7500e+00, -1.9485e+00,  1.0097e+00,
        -2.9966e-01,  5.1717e-01,  8.6291e-01,  1.4203e+00,  1.5049e-01,
         4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01,  5.4926e-02,
        -2.1086e-01, -2.1043e-01, -4.2422e-01,  3.1212e-02, -3.5714e-01,
         7.3226e-01,  1.7916e+00, -8.3882e-02,  1.7431e+00,  7.5411e-02,
         1.4379e-01, -2.1750e+00,  5.3509e-01,  1.9931e+00, -1.0812e+00,
         9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01,  4.8681e-01,
        -5.7749e-02,  8.6992e-02, -7.8780e-01,  1.3495e+00, -7.5135e-02,
         6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00,
         1.1846e-01,  1.5355e+00, -4.2969e-01,  2.9539e-01, -5.9056e-01,
         1.0564e+00, -5.7899e-01,  1.7013e-02,  5.1986e-01, -4.7120e-02,
        -3.4399e-02, -1.4235e-01, -1.4144e+00,  5.1103e-01,  7.2233e-01,
        -6.0687e-01, -8.2988e-01, -2.7205e-01,  1.0952e+00, -9.7423e-02,
         4.9439e-02, -1.7460e-02,  2.0516e-01, -7.8793e-01, -1.8765e+00])
ans in gpu:
tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01,  5.0014e-01, -1.2216e-01,
         8.5097e-02, -1.4941e+00,  3.5625e+00,  1.2412e-03,  4.9355e-01,
        -4.8173e-01,  1.3379e-01,  6.8660e-01, -3.0867e-01,  4.1459e-01,
         3.8146e-01,  2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01,
        -7.2695e-01, -7.9857e-01,  9.2179e-01, -4.4537e-01,  1.2229e+00,
        -1.9606e+00,  2.1500e+00,  6.2939e-02, -2.9404e-02, -1.6333e-01,
         5.8653e-01, -3.0282e-01,  1.7500e+00, -1.9485e+00,  1.0097e+00,
        -2.9966e-01,  5.1717e-01,  8.6291e-01,  1.4203e+00,  1.5049e-01,
         4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01,  5.4926e-02,
        -2.1086e-01, -2.1043e-01, -4.2422e-01,  3.1212e-02, -3.5714e-01,
         7.3226e-01,  1.7916e+00, -8.3882e-02,  1.7431e+00,  7.5411e-02,
         1.4379e-01, -2.1750e+00,  5.3509e-01,  1.9931e+00, -1.0812e+00,
         9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01,  4.8681e-01,
        -5.7749e-02,  8.6992e-02, -7.8780e-01,  1.3495e+00, -7.5135e-02,
         6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00,
         1.1846e-01,  1.5355e+00, -4.2969e-01,  2.9539e-01, -5.9056e-01,
         1.0564e+00, -5.7899e-01,  1.7013e-02,  5.1986e-01, -4.7120e-02,
        -3.4399e-02, -1.4235e-01, -1.4144e+00,  5.1103e-01,  7.2233e-01,
        -6.0687e-01, -8.2988e-01, -2.7205e-01,  1.0952e+00, -9.7423e-02,
         4.9439e-02, -1.7460e-02,  2.0516e-01, -7.8793e-01, -1.8765e+00])

Verify qpu calculation speed
Here, we rewrite the operator on the c + + side so that the calculated time returns to the python side in a certain way. At the same time, the operator of c + + cpu is written for comparison
The time for comparison here is only the time used for calculation on the c + + side. In fact, when calling the gpu operator on the python side, some time will be consumed in the calling process and the communication between cpu and gpu

This project shows the potential performance improvement brought by integrating and writing operators on the c + + side

# time.py
# This time only calculates the time spent in calculating the matrix multiplication
# And obtained by writing to the return value
import torch
import time
import cpu_cpp
import matrix_cpp

for i in range(0,6):
    a = torch.randn(100 *10**i )
    b=  torch.randn(10**i)
    gpu = matrix_cpp.gpu(a,b)
    cpu=cpu_cpp.cpu(a,b)
    print("cpu 100 * 10 ** %d takes %d.%06d"% (i,cpu[0],cpu[1]))
    print("gpu 100 * 10 ** %d takes %d.%06d"% (i,gpu[0],gpu[1]))

The results are as follows:

cpu 100 * 10 ** 0 takes 0.000004
gpu 100 * 10 ** 0 takes 0.000164
cpu 100 * 10 ** 1 takes 0.000023
gpu 100 * 10 ** 1 takes 0.000169
cpu 100 * 10 ** 2 takes 0.000206
gpu 100 * 10 ** 2 takes 0.000171
cpu 100 * 10 ** 3 takes 0.002116
gpu 100 * 10 ** 3 takes 0.000388
cpu 100 * 10 ** 4 takes 0.021245
gpu 100 * 10 ** 4 takes 0.003079
cpu 100 * 10 ** 5 takes 0.214486
gpu 100 * 10 ** 5 takes 0.029622

Write at the end

This project is the course project of embedded system, and the results are preliminary.

Thank the lxk and cxf members of the same group~

Thank Mr. Yang and Mr. LV for their guidance and help~

Topics: Python C++ AI Pytorch Raspberry Pi

Programmer Think