Raspberry pie 3B realizes the acceleration of pytorch calculation through GPU

Posted by helbom on Fri, 14 Jan 2022 13:05:14 +0100

The embedded system raspberry pie 3B realizes the acceleration of pytorch calculation through GPU

General idea

  • The cross compilation environment is a raspberry pi 3B docker image
  • Using the open source GPU library on github QPULib
  • Install pytorch on raspberry pie and accelerate the process through cross compilation
  • Provided via pytorch Interface Compile and register programs on the c + + side
  • Run pytorch directly through the file system mirrored by nfs mount docker

Concrete implementation

1. Install cross compilation environment

Specific process received this Blog Inspired, the process of finding this blog is as follows

  • Search the precompiled raspberry pie 3B python wheel in the search engine and find this one in the python forum Posts
  • In this post, a user named choonkiatlee mentioned his compiled wheel And the docker image
  • The above article was found in the user's github blog interface

Install docker for windows on my host computer (Windows computer), and perform a series of image acceleration and hosting configurations

hosted services
  • Sign in Alibaba cloud , click container image service

  • Select personal instance

  • Create a personal namespace in warehouse management. In my case, it is ja1zhou

  • Create an access certificate and select a fixed password. This password is the password to verify when logging in locally

  • Start the docker engine on the host and enter it on the terminal

    sudo docker login --username=your_username registry.cn-beijing.aliyuncs.com
    #Then enter the set password
    

    Successfully logged in to alicloud

  • Create a new warehouse name in the mirror warehouse, myrasp in my case

Image acceleration
  • According to Alibaba cloud Official documents Image acceleration configuration

  • Directly modify the daemon.for docker for windows JSON file, adding

    {
      "registry-mirrors": ["https://your_server_name.mirror.aliyuncs.com"]
    }
    

Pull the working image, configure shared volume, and set the working environment

  • docker pull choonkiatlee/raspbian:build
    docker volume create --driver local -o o=bind -o type=none -o device="C:\Users\MyUsername\Downloads\rasp_docker" rasp_docker
    #The purpose of this command is to directly share many files in the download, such as wheel and QPULib, with docker
    #torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl is downloaded and placed in the created volume
    docker run -it --name rasp -v rasp_docker:/root/rasp_docker choonkiatlee/raspbian:build
    
  • After entering the docker container, you need to enter the following commands to configure. The following commands are a supplement to the blog above

    #The python version of docker is 3.7.3
    apt-get update && apt-get install -y python3-numpy
    #torch operation requires numpy support
    cd /root/rasp_docker
    pip3 install torch-1.4.0a0+7f73f1d-cp37-cp37m-linux_armv7l.whl
    
  • Verify

    root@3288e690face:~/rasp_docker# python3
    Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.__version__
    '1.4.0a0+7f73f1d'
    
  • torch has been successfully installed

  • Package the docker image of this version as the starting point for subsequent development

    docker commit rasp registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
    docker push registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
    

(optional) compile pytorch from the source code (not successfully reproduced yet)

  • Set up an agent to allow the agent software to pass through the intranet and public network firewall

  • Set up agent software to allow connections from the LAN

  • In the docker image, set the environment variable

    export all_proxy="http://host_ip:host_port"
    
  • Download and prepare

    cd /root
    git clone https://github.com/pytorch/pytorch.git
    git checkout v1.4.0
    git submodule sync
    git submodule update --init --recursive
    apt install -y python3-cffi python3-numpy libatlas-base-dev
    pip3 install cython wheel pyyaml pillow
    #Choose not to compile all add ins
    export USE_CUDA=0
    export USE_CUDNN=0
    export USE_MKLDNN=0
    export USE_METAL=0
    export USE_NCCL=OFF
    export USE_NNPACK=0
    export USE_QNNPACK=0
    export USE_DISTRIBUTED=0
    export BUILD_TEST=0
    export MAX_JOBS=8
    python3 setup.py install
    
  • The error report is related to the submodule protobuf. The author also encountered this problem at that time, but I don't know which tag of protobuf he used when compiling. He has submitted the issue on github

2. Download QPULib and write code according to its functions

QPULib code structure

QPULib/
  Lib/
  	Subdirectories/
  		*.cpp
  		*.h
  	*.cpp
  	*.h
  Doc/
  	irrelevant
  Tests/
  	*.cpp
  	Makefile

QPU acceleration principle

  • QPU is a vector processor developed by Broadcom. Its instructions can operate on 16 element vectors of 32-bit integers or floating-point values. For example, given two 16 element vectors

    10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    and

    20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

    QPU's integer addition instruction calculates the third vector

    30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60

    Where each element in the output is the sum of the corresponding two elements in the input.

    • Each 16 element vector consists of four quarter vector parts.

    • QPU processes a vector part every clock cycle, and QPU instructions need four consecutive clock cycles to provide a complete 16 bit result vector, which is the origin of the name "QPU".

    • Pi contains a total of 12 qpus, each operating at a frequency of 250MHz. This is the maximum throughput of 750M vector instructions per second (250M cycles divided by 4 cycles, each instruction multiplied by 12 qpus). Or: 12B operations per second (750M instructions multiplied by 16 vector elements). In some cases, the QPU instruction can provide two results at a time, so the QPU of Pi usually works with 24 GFLOPS.

    • The QPU is part of the Raspberry Pi graphics pipeline. If you want to make efficient graphics on Pi, you may need OpenGL ES. However, if you only want to try to speed up the non graphical part of the Pi project, QPULib is worth a visit.

  • In order to avoid blocking during calculation, pipeline can be introduced.

    • While calculating the data of this time, take the data required for the next calculation
  • At the same time, multi QPU parallel computing can be introduced, and multi QPU can be used to calculate data in different regions each time, so as to realize the efficient utilization of GPU.

  • After reading the Makefile, it is found that the Include directory is Lib / directory at compile time, and all *. Under Lib / will be deleted first cpp compiled into * o

  • Finally, add * o and link to executables

Compiling dynamic link libraries

  • According to this idea, first compile all cpp files under Lib / o file and compile it into a dynamic link library to save repeated compilation when registering with pytorch. Rewrite Makefile as follows

    # Root directory of QPULib repository
    ROOT = ../Lib
    
    # Compiler and default flags
    CXX = g++
    CXX_FLAGS = -fpermissive -Wconversion -std=c++0x -I $(ROOT)
    
    # Object directory
    OBJ_DIR = obj
    
    # Debug mode
    ifeq ($(DEBUG), 1)
      CXX_FLAGS += -DDEBUG
      OBJ_DIR := $(OBJ_DIR)-debug
    endif
    
    # QPU or emulation mode
    ifeq ($(QPU), 1)
      CXX_FLAGS += -DQPU_MODE
      OBJ_DIR := $(OBJ_DIR)-qpu
    else
      CXX_FLAGS += -DEMULATION_MODE
    endif
    
    # Object files
    OBJ =                         \
      Kernel.o                    \
      Source/Syntax.o             \
      Source/Int.o                \
      Source/Float.o              \
      Source/Stmt.o               \
      Source/Pretty.o             \
      Source/Translate.o          \
      Source/Interpreter.o        \
      Source/Gen.o                \
      Target/Syntax.o             \
      Target/SmallLiteral.o       \
      Target/Pretty.o             \
      Target/RemoveLabels.o       \
      Target/CFG.o                \
      Target/Liveness.o           \
      Target/RegAlloc.o           \
      Target/ReachingDefs.o       \
      Target/Subst.o              \
      Target/LiveRangeSplit.o     \
      Target/Satisfy.o            \
      Target/LoadStore.o          \
      Target/Emulator.o           \
      Target/Encode.o             \
      VideoCore/Mailbox.o         \
      VideoCore/Invoke.o          \
      VideoCore/VideoCore.o
    
    # Top-level targets
    
    .PHONY: top clean
    LIB = $(patsubst %,$(OBJ_DIR)/%,$(OBJ))
    top: $(LIB)
            @$(CXX) $(CXX_FLAGS) -shared -fPIC $^ -o libqpu.so
    
    clean:
            rm -rf obj obj-debug obj-qpu obj-debug-qpu
            rm -f Tri GCD Print MultiTri AutoTest OET Hello ReqRecv Rot3D ID *.o
            rm -f HeatMap
            rm -f libqpu.so
    # Intermediate targets
    
    $(OBJ_DIR)/%.o: $(ROOT)/%.cpp $(OBJ_DIR)
            @echo Compiling $<
            @$(CXX) -c -o $@ $< $(CXX_FLAGS)
    
    %.o: %.cpp
            @echo Compiling $<
            @$(CXX) -c -o $@ $< $(CXX_FLAGS)
    
    $(OBJ_DIR):
            @mkdir -p $(OBJ_DIR)
            @mkdir -p $(OBJ_DIR)/Source
            @mkdir -p $(OBJ_DIR)/Target
            @mkdir -p $(OBJ_DIR)/VideoCore
    
  • According to the Makefile file, all files under Lib / will be packaged into libqpu So dynamic link library

  • Add dynamic link library to system library

    vim /etc/ld.so.conf
    #Add a new line to the file, libqpu So path
    /root/embedded_project/
    #: wq save exit
    ldconfig#Refresh dynamic library path
    

Write C + + code that calls GPU for parallel computing

  • Write a pipelined matrix c + + operation program for multi qpu parallel operation

    • The program mainly realizes the equal size matrix point multiplication and point addition, and returns the result
    //"dot.cpp"
    #include <torch/extension.h>
    #include <vector>
    #include <QPULib.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/time.h>
    const int NQPUS = 4; //Number of qpu s invoked
    
    void dotproduct(Int n, Ptr<Float> x, Ptr<Float> y)
    {
        Int inc = numQPUs() << 4;
        Ptr<Float> p = x + index() + (me() << 4);
        Ptr<Float> q = y + index() + (me() << 4);
        gather(p); gather(q);
    
        Float xOld, yOld;
        For (Int i = 0, i < n, i = i+inc)
            gather(p+inc); gather(q+inc);//Get the data required for the next operation
            receive(xOld); receive(yOld);//Calculate the data obtained before
            store(xOld * yOld, p);
            p = p+inc; q = q+inc;
        End
    
        receive(xOld); receive(yOld);
    }
    
    void dotadd(Int n, Ptr<Float> x, Ptr<Float> y)
    {
        Int inc = numQPUs() << 4;
        Ptr<Float> p = x + index() + (me() << 4);
        Ptr<Float> q = y + index() + (me() << 4);
        gather(p); gather(q);
    
        Float xOld, yOld;
        For (Int i = 0, i < n, i = i+inc)
            gather(p+inc); gather(q+inc);
            receive(xOld); receive(yOld);
            store(xOld + yOld, p);
            p = p+inc; q = q+inc;
        End
    
        receive(xOld); receive(yOld);
    }
    
    torch::Tensor dot_product(torch::Tensor input, torch::Tensor weight)
    {
        input = input.to(torch::kFloat32);
        weight = weight.to(torch::kFloat32);
        float *input_ptr = (float *)input.data_ptr();
        float *weight_ptr = (float *)weight.data_ptr();
    
        int width = weight.numel();
        int width_16 = width + (16 - width % 16);//Convert the matrix length to a multiple of 16
        SharedArray<float> mapA(width_16), mapB(width_16);
    
        for (int i = 0; i < width_16; ++i)
        {
            if (i < width)
            {
                mapA[i] = input_ptr[i];
                mapB[i] = weight_ptr[i];
            }
            else
            {
                mapA[i] = 0;//Insufficient zero filling
                mapB[i] = 0;
            }
        }
        auto k = compile(dotproduct);
    
        k.setNumQPUs(NQPUS);
    
        k(width, &mapA, &mapB);
    
        for (int i = 0; i < width; i++) {
            input_ptr[i] = mapA[i];
        }
        return input;
    }
    
    torch::Tensor dot_add(torch::Tensor input, torch::Tensor weight)
    {
        input = input.to(torch::kFloat32);
        weight = weight.to(torch::kFloat32);
        float *input_ptr = (float *)input.data_ptr();
        float *weight_ptr = (float *)weight.data_ptr();
    
        int width = weight.numel();
        int width_16 = width + (16 - width % 16);
        SharedArray<float> mapA(width_16), mapB(width_16);
    
        for (int i = 0; i < width_16; ++i)
        {
            if (i < width)
            {
                mapA[i] = input_ptr[i];
                mapB[i] = weight_ptr[i];
            }
            else
            {
                mapA[i] = 0;
                mapB[i] = 0;
            }
        }
        auto k = compile(dotadd);
    
        k.setNumQPUs(NQPUS);
    
        k(width, &mapA, &mapB);
    
        for (int i = 0; i < width; i++) {
            input_ptr[i] = mapA[i];
        }
        
        return input;
    }
    
    
    PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
      m.def("add", &dot_add, "dot_add");
      m.def("product", &dot_product, "dot_product");
    }
    
  • Write a c + + program for multiplying two-dimensional matrix (H x W) by one-dimensional matrix (W x 1) and return a (H * 1) matrix

    • This program can be used to calculate the score of each case in deep learning
    //matrix.cpp
    #include <torch/extension.h>
    #include <vector>
    #include <QPULib.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/time.h>
    const int NQPUS = 4;
    struct Cursor
    {
        Ptr<Float> addr;
        Float current, next;
    
        void init(Ptr<Float> p)
        {
            gather(p);
            current = 0;
            addr = p + 16;
        }
    
        void prime()
        {
            receive(next);
            gather(addr);
        }
    
        void advance()
        {
            addr = addr + 16;
            gather(addr);
            current = next;
            receive(next);
        }
    
        void finish()
        {
            receive(next);
        }
    };
    
    void step(Ptr<Float> map, Ptr<Float> weight, Ptr<Float> mapOut, Int pitch, Int width, Int height)
    {
        Cursor row, cursorofweight;
        map = map + pitch * me() + index();
        For(Int y = me(), y < height, y = y + numQPUs())
    
            // Specify the location where the calculated results of the row matrix and the weight matrix are stored
            // The results of each line are stored in 16 bits
            Ptr<Float>
            p = mapOut + y * 16;
    
        // Initialize Cursor class
        row.init(map);
        row.prime();
        cursorofweight.init(weight);
        cursorofweight.prime();
    
        // Calculate the result of this row
        Float accumulate = 0;
        For(Int x = 0, x < width, x = x + 16)
            // In each iteration, the current result is calculated, stored, and the data required for the next calculation is obtained
            row.advance();
       		cursorofweight.advance();
        	accumulate = accumulate + row.current * cursorofweight.current;
    
        End
        // Store calculation results on p
        store(accumulate, p);
        // Release the Cursor
        row.finish();
        cursorofweight.finish();
        map = map + pitch * numQPUs();
    
        End
    }
    torch::Tensor accumartix(torch::Tensor input, torch::Tensor weight)
    {
        input = input.to(torch::kFloat32);
        weight = weight.to(torch::kFloat32);
        int width = weight.numel();
        int width_16 = width + (16 - width % 16);
        int height = input.numel() / width;
    
        float *input_ptr = (float *)input.data_ptr();
        float *weight_ptr = (float *)weight.data_ptr();
        // Create qpu a vector that interacts with the cpu
        SharedArray<float> mapA(width_16 * height), mapB(width_16), sumofmartix(16 * height);
        for (int i = 0; i < height; ++i)
        {
            for (int j = 0; j < width_16; ++j)
            {
                if (j < width)
                    mapA[i * width_16 + j] = input_ptr[i * width + j];
                else
                    mapA[i * width_16 + j] = 0;
            }
        }
    
        for (int j = 0; j < height; ++j)
        {
            for (int i = 0; i < 16; ++i)
            {
                sumofmartix[16 * j + i] = 0;
            }
        }
    
        for (int j = 0; j < width_16; ++j)
        {
            if (j < width)
                mapB[j] = weight_ptr[j];
            else
                mapB[j] = 0;
        }
        auto k = compile(step);
    
        k.setNumQPUs(NQPUS);
    
        k(&mapA, &mapB, &sumofmartix, width_16, width, height);
        torch::Tensor ans = torch::zeros(height);
        float *ans_ptr = (float *)ans.data_ptr();
        for (int j = 0; j < height; ++j)
        {
            for (int i = 0; i < 16; ++i)
            {
                ans_ptr[j] += sumofmartix[16 * j + i];
            }
        }
        return ans; 
    }
    
    PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
      m.def("gpu", &accumartix, "accumartix");
    }
    

3. Register the c + + program with pytorch

pytorch official documentation

  • According to the official pytorch documentation, two files are required to register c + + code with pytorch: cpp file and setup Py file

Register c + + client program with QPULib

  • setup.py file needs to be modified according to the compilation parameters and dependent path of QPULib

    from setuptools import setup, Extension
    from torch.utils import cpp_extension
    from torch.utils.cpp_extension import BuildExtension, CppExtension
    
    setup(
        name='dot_cpp',
        ext_modules=[
            Extension(
                name='dot_cpp',
                sources=['dot.cpp'],
                include_dirs=cpp_extension.include_paths()+["/root/QPULib/Lib", "/root/QPULib/Lib/Source",
                "/root/QPULib/Lib/Target","/root/QPULib/Lib/VideoCore","/root/QPULib/Lib/Common"],
                library_dirs=["."],
                libraries=["qpu"],
                language='c++',
                extra_compile_args = ["-fpermissive","-w","-std=c++0x","-DQPU_MODE"])
                
        ],
        cmdclass={
            'build_ext': BuildExtension
        })
    
    
  • All documents to be registered are as follows

    embedded_project/
    	dot.cpp
    	setup.py
    	libqpu.so
    
  • Execute under this folder

    python3 setup.py install
    
  • After successful compilation, the import module is tested

    import torch
    import dot_cpp
    
  • No error can be reported for verification on the board

4. Use raspberry pi and nfs to mount the file system in docker image to verify whether the above is successful

Environmental preparation

  • Install docker on the lab host, download the docker image, and use the docker cp command to copy the entire file system

    sudo apt update && sudo apt install qemu qemu-user-static binfmt-support
    #This sentence is to be able to simulate arm processor on x64 processor
    docker pull registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
    #You can use named volume to share files between container and host
    docker volume create --driver local -o o=bind -o type=none -o device="/home/jay/rasp_docker" rasp_volume
    docker run -it --name rasp -v rasp_volume:/root registry.cn-beijing.aliyuncs.com/ja1zhou/myrasp:torch
    #Use docker cp to copy the file system
    sudo docker cp rasp:/ ~/home/jay/rasp_docker_filesystem
    
    
  • After starting through nfs, / home/jay/rasp_docker mount and chroot

    mount 192.168.0.101:/home/jay/rasp_docker_filesystem /mnt -o nolock
    cd /mnt
    mount --rbind /dev dev/#This sentence is to mount the GPU device of the original filesystem to the chroot path
    chroot .
    
  • (optional) proxy all traffic through iptables

    #Note: when uboot starts, I set the gateway of raspberry pie as the intranet address of the host, so I only need to set the iptables of the host
    #The following statement is entered in host
    iptables -t nat -A POSTROUTING -s192.168.0.1/255.255.255.0 -j SNAT --to public_ip_of_host
    

Formal verification

  • Verify the accuracy of qpu calculation

    # accuracy.py
    import pytorch as torch
    import time
    import dot_cpp
    
    a = torch.randint(100)
    b = torch.randint(100)
    c = a * b
    print("ans in pytorch:")
    print(c)
    d = dot_cpp.product(a, b)
    print("ans in gpu:")
    print(d)
    
    • The results are as follows:
    ans in pytorch:
    tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01,  5.0014e-01, -1.2216e-01,
             8.5097e-02, -1.4941e+00,  3.5625e+00,  1.2412e-03,  4.9355e-01,
            -4.8173e-01,  1.3379e-01,  6.8660e-01, -3.0867e-01,  4.1459e-01,
             3.8146e-01,  2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01,
            -7.2695e-01, -7.9857e-01,  9.2179e-01, -4.4537e-01,  1.2229e+00,
            -1.9606e+00,  2.1500e+00,  6.2939e-02, -2.9404e-02, -1.6333e-01,
             5.8653e-01, -3.0282e-01,  1.7500e+00, -1.9485e+00,  1.0097e+00,
            -2.9966e-01,  5.1717e-01,  8.6291e-01,  1.4203e+00,  1.5049e-01,
             4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01,  5.4926e-02,
            -2.1086e-01, -2.1043e-01, -4.2422e-01,  3.1212e-02, -3.5714e-01,
             7.3226e-01,  1.7916e+00, -8.3882e-02,  1.7431e+00,  7.5411e-02,
             1.4379e-01, -2.1750e+00,  5.3509e-01,  1.9931e+00, -1.0812e+00,
             9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01,  4.8681e-01,
            -5.7749e-02,  8.6992e-02, -7.8780e-01,  1.3495e+00, -7.5135e-02,
             6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00,
             1.1846e-01,  1.5355e+00, -4.2969e-01,  2.9539e-01, -5.9056e-01,
             1.0564e+00, -5.7899e-01,  1.7013e-02,  5.1986e-01, -4.7120e-02,
            -3.4399e-02, -1.4235e-01, -1.4144e+00,  5.1103e-01,  7.2233e-01,
            -6.0687e-01, -8.2988e-01, -2.7205e-01,  1.0952e+00, -9.7423e-02,
             4.9439e-02, -1.7460e-02,  2.0516e-01, -7.8793e-01, -1.8765e+00])
    ans in gpu:
    tensor([-5.9086e-02, -4.3276e+00, -6.5376e-01,  5.0014e-01, -1.2216e-01,
             8.5097e-02, -1.4941e+00,  3.5625e+00,  1.2412e-03,  4.9355e-01,
            -4.8173e-01,  1.3379e-01,  6.8660e-01, -3.0867e-01,  4.1459e-01,
             3.8146e-01,  2.6874e-01, -1.0085e-01, -1.9247e-01, -3.8177e-01,
            -7.2695e-01, -7.9857e-01,  9.2179e-01, -4.4537e-01,  1.2229e+00,
            -1.9606e+00,  2.1500e+00,  6.2939e-02, -2.9404e-02, -1.6333e-01,
             5.8653e-01, -3.0282e-01,  1.7500e+00, -1.9485e+00,  1.0097e+00,
            -2.9966e-01,  5.1717e-01,  8.6291e-01,  1.4203e+00,  1.5049e-01,
             4.0039e-01, -2.1761e-01, -2.7387e-02, -5.7702e-01,  5.4926e-02,
            -2.1086e-01, -2.1043e-01, -4.2422e-01,  3.1212e-02, -3.5714e-01,
             7.3226e-01,  1.7916e+00, -8.3882e-02,  1.7431e+00,  7.5411e-02,
             1.4379e-01, -2.1750e+00,  5.3509e-01,  1.9931e+00, -1.0812e+00,
             9.5756e-01, -2.2465e-01, -2.7048e-01, -5.4887e-01,  4.8681e-01,
            -5.7749e-02,  8.6992e-02, -7.8780e-01,  1.3495e+00, -7.5135e-02,
             6.2448e-01, -1.1303e-02, -1.0266e-01, -1.4959e+00, -1.6517e+00,
             1.1846e-01,  1.5355e+00, -4.2969e-01,  2.9539e-01, -5.9056e-01,
             1.0564e+00, -5.7899e-01,  1.7013e-02,  5.1986e-01, -4.7120e-02,
            -3.4399e-02, -1.4235e-01, -1.4144e+00,  5.1103e-01,  7.2233e-01,
            -6.0687e-01, -8.2988e-01, -2.7205e-01,  1.0952e+00, -9.7423e-02,
             4.9439e-02, -1.7460e-02,  2.0516e-01, -7.8793e-01, -1.8765e+00])
    
  • Verify qpu calculation speed

  • Here, we rewrite the operator on the c + + side so that the calculated time returns to the python side in a certain way. At the same time, the operator of c + + cpu is written for comparison

  • The time for comparison here is only the time used for calculation on the c + + side. In fact, when calling the gpu operator on the python side, some time will be consumed in the calling process and the communication between cpu and gpu

  • This project shows the potential performance improvement brought by integrating and writing operators on the c + + side

    # time.py
    # This time only calculates the time spent in calculating the matrix multiplication
    # And obtained by writing to the return value
    import torch
    import time
    import cpu_cpp
    import matrix_cpp
    
    for i in range(0,6):
        a = torch.randn(100 *10**i )
        b=  torch.randn(10**i)
        gpu = matrix_cpp.gpu(a,b)
        cpu=cpu_cpp.cpu(a,b)
        print("cpu 100 * 10 ** %d takes %d.%06d"% (i,cpu[0],cpu[1]))
        print("gpu 100 * 10 ** %d takes %d.%06d"% (i,gpu[0],gpu[1]))
    
    • The results are as follows:
    cpu 100 * 10 ** 0 takes 0.000004
    gpu 100 * 10 ** 0 takes 0.000164
    cpu 100 * 10 ** 1 takes 0.000023
    gpu 100 * 10 ** 1 takes 0.000169
    cpu 100 * 10 ** 2 takes 0.000206
    gpu 100 * 10 ** 2 takes 0.000171
    cpu 100 * 10 ** 3 takes 0.002116
    gpu 100 * 10 ** 3 takes 0.000388
    cpu 100 * 10 ** 4 takes 0.021245
    gpu 100 * 10 ** 4 takes 0.003079
    cpu 100 * 10 ** 5 takes 0.214486
    gpu 100 * 10 ** 5 takes 0.029622
    

Write at the end

This project is the course project of embedded system, and the results are preliminary.

Thank the lxk and cxf members of the same group~

Thank Mr. Yang and Mr. LV for their guidance and help~

Topics: Python C++ AI Pytorch Raspberry Pi