Fundamentals of cuda Programming

Posted by spicey on Sat, 09 Oct 2021 14:12:17 +0200

Chapter I

Purpose: accelerate 3D imaging, computational fluid dynamics, molecular simulation, etc. most of these application scenarios require a lot of scientific computing. The powerful parallel computing ability of GPU enables it to complete these super large number of computing tasks, shorten the computing time and reduce the computing cost for the tasks;

Chapter II

First, install cuda toolkit and add environment variables;
Secondly, vs create a project and add cuda's bin, inlcude and lib to the project;
Writing. cu code requires__ global__ Start, and use extern "C" declaration. You also need to use extern "C" declaration in cpp file, and then call;

Chapter III

3.1 and 3.2

The first cuda program: first, create a. cu file and add the following code:
```
#include "cuda_runtime.h"  
#include "device_launch_parameters.h"
__global__ void kernel() {}
extern "C" void test_kernel(){
    kernel <<<1, 1 >>> ();
}
```
Then create the. cpp file and add the following code:
```
#include <iostream>
// test1
extern "C" void test_kernel();


int main()
{
    //test1
    test_kernel();
    printf("hello, world!\n");
    return 0;
}
```
cuda code is based on__ global__ At the beginning, this tells the compiler that the function should be compiled to run on the device rather than on the host, that is, the symbol marks the function kernel() as the device code; In this example, the function kernel() will be given to the compiler that compiles the device code, while the main() function will be given to the host compiler; Among them, the use of the brackets in the call refers to passing some parameters to the runtime system. These parameters are not passed to the device code, which tells the runtime system how to start the device code and pass it to the device code province. The parameter is transferred in parentheses, just like the common parameter passing in the C language.

Note that it is used in. cpp__ global__ When writing CUDA code, you first need #include "cuda_runtime.h" and #include "device_launch_parameters.h"; Secondly, you need to select CUDA C/C + + as the item type in the attributes of the. cpp file, otherwise angle brackets will not be recognized;

Note: the programmer must not dereference the pointer returned by cudaMalloc() in the host code. The host code can pass this pointer as a parameter, perform arithmetic operations on it, and convert it to another type, but it must not use this pointer to read or write to memory;
Restrictions on the use of device pointers:
- The pointer allocated by cudaMalloc() can be passed to the function executed on the device;
- You can use the pointer allocated by cudaMalloc() in the device code to read and write memory;
- The pointer allocated by cudaMalloc() can be passed to the function executed on the host; But they are ultimately passed to the equipment code;
- The pointer allocated by cudaMalloc() cannot be used in the host code for reading and writing;
- You can't use the standard C free() to free the memory allocated by cudaMalloc(). You need to use cudaFree();
- If you need to access the memory on the device, you need to use cudaMemcpy to copy it to the host memory for access. The last parameter is cudaMemcpyDeviceToHost, which tells the runtime that the source pointer is a device pointer and the target pointer is a host pointer; cudaMemcpyHostToDevice is the opposite. cudaMemcpyHostToDevice indicates that both pointers are on the device;
polo

3.3 query equipment

The number of cuda devices can be queried through cudaGetDeviceCount(), and the properties of the device can be queried by cudaGetDeviceProp() according to its device number. The code is as follows:

#include <iostream>
#include<cuda_runtime.h>

__global__ void kernel(){}
int main()
{
	int count;
	cudaGetDeviceCount(&count);
	std::cout << " count = " << count << std::endl;

	cudaDeviceProp prop;
	for (int i = 0; i < count; ++i) {
		cudaGetDeviceProperties(&prop, i);
		std::cout << "prop.name: " << prop.name << std::endl;
		std::cout << "compute capability: " << prop.major <<" "<< prop.minor << std::endl;
	}
    //std::cout << "Hello World!\n";
}

However, it is cumbersome to perform an iterative query on each device to meet the requirements. An automatic way of cuda runtime can be used to perform this iterative operation

#include <iostream>
#include<cuda_runtime.h>

__global__ void kernel(){}
int main()
{	
	cudaDeviceProp prop;
	int dev;
	cudaGetDevice(&dev);
	std::cout << "id of current cuda device: " << dev << std::endl;
	
	//Fill cudaDeviceProp structure
	memset(&prop, 0, sizeof(cudaDeviceProp));
	prop.major = 7;
	prop.minor = 5;

	//Pass to cudaChooseDevice
	cudaChooseDevice(&dev, &prop);
	std::cout << "id of current cuda device closest to 7.5: " << dev << std::endl;
	cudaSetDevice(dev);

    //std::cout << "Hello World!\n";
}

polo

Chapter IV

Vector summation

Implementation on cpu

#include <iostream>
#include<cuda_runtime.h>

#define N 10

void add_cpu(int *a, int *b, int *c) {
	int tid = 0;
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 1;
	}
}
int main()
{
	int a[N], b[N], c[N];
	//Assign values to arrays a and b on the cpu
	for (int i = 0; i < N; ++i) {
		a[i] = -i;
		b[i] = i * i;
	}

	add_cpu(a, b, c);
	//result
	for (int i = 0; i < N; ++i) {
		std::cout << "a[" << i << "] +b[" << i << "] = " << c[i] << std::endl;
	}
	return 0;
}

If the cpu is used to complete the accelerated calculation, it needs to make full use of multithreading for calculation, and use different cores to calculate the balls and operations in different positions, which makes the program code complex.

Implementation on gpu
```
#include <iostream>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>

#define N 10

void add_cpu(int *a, int *b, int *c) {
	int tid = 0;
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 1;
	}
}

__global__ void add(int *a, int *b, int *c) {
	int tid = blockIdx.x;
	if (tid < N) {
		c[tid] = a[tid] + b[tid];
	}
}
int main()
{
	int a[N], b[N], c[N];
	int *dev_a, *dev_b, *dev_c;

	//Allocate memory on GPU
	cudaMalloc((void**)&dev_a, N * sizeof(int));
	cudaMalloc((void**)&dev_b, N * sizeof(int));
	cudaMalloc((void**)&dev_c, N * sizeof(int));

	//Assign values to arrays a and b on the cpu
	for (int i = 0; i < N; ++i) {
		a[i] = -i;
		b[i] = i * i;
	}

	//Copy data to gpu
	cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

	add << <N, 1 >> > (dev_a, dev_b, dev_c);

	//dev_c copy from Gpu to cpu
	cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

	//result
	for (int i = 0; i < N; ++i) {
		std::cout << "a[" << i << "] +b[" << i << "] = " << c[i] << std::endl;
	}

	//Release cuda video memory
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);
	return 0;
}
```
- The implementation process on gpu is roughly the same as that on cpu. It should be noted that the memory created on the gpu can no longer be written in the cpu, and the memory created on the cpu can no longer be written in the gpu. Therefore, there are a lot of cudaMemcpy operations, and cudaFree is also used to release the video memory;
- Meaning of parameters in angle brackets: add < < n, 1 > > (Param1, param2,...)
  
  The first parameter n represents the number of parallel thread blocks used by the device when executing kernel functions. It can be understood that the runtime system creates copies of N and functions and runs them in parallel. Each parallel execution environment is called a thread block. In this example, there are n thread blocks;
- How to know which thread block is running: the answer is implemented through blocIdx.x. blockIdx is a built-in variable, which has been defined in cuda runtime and is located in device_launch_parameters.h; cuda supports two-dimensional thread block array, which can solve the calculation problems in two-dimensional space, such as image processing;
- The value of blockIdx.x of each thread block is different. Note that TID < n, otherwise it is easy to illegally access memory;
- It should be noted that the maximum number of each dimension in the thread block array cannot exceed 65535;
- polo
polo

polo

Chapter 5 thread cooperation

Decomposition of parallel thread blocks: cuda runtime system decomposes thread blocks into multiple threads. Add < < N, 1 > > > indicates N thread blocks (N thread blocks can be parallel), and each thread block has one thread.

We use parallel threads to perform the same vector addition task, that is, add < < 1, n > >. Although this has no advantage over using multiple thread blocks, it can do the work that thread blocks cannot do. The code is as follows:

// CUDATest.cpp: this file contains the "main" function. Program execution will begin and end here.
//

#include <iostream>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>

#define N 10

void add_cpu(int *a, int *b, int *c) {
	int tid = 0;
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 1;
	}
}

__global__ void add(int *a, int *b, int *c) {
	int tid = threadIdx.x;
	if (tid < N) {
		c[tid] = a[tid] + b[tid];
	}
}
int main()
{
	int a[N], b[N], c[N];
	int *dev_a, *dev_b, *dev_c;

	//Allocate memory on GPU
	cudaMalloc((void**)&dev_a, N * sizeof(int));
	cudaMalloc((void**)&dev_b, N * sizeof(int));
	cudaMalloc((void**)&dev_c, N * sizeof(int));

	//Assign values to arrays a and b on the cpu
	for (int i = 0; i < N; ++i) {
		a[i] = -i;
		b[i] = i * i;
	}

	//Copy data to gpu
	cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

	add << <1, N >> > (dev_a, dev_b, dev_c);

	//dev_c copy from Gpu to cpu
	cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);

	//result
	for (int i = 0; i < N; ++i) {
		std::cout << "a[" << i << "] +b[" << i << "] = " << c[i] << std::endl;
	}

	//Release cuda video memory
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);
	return 0;
}

It can be seen that there are only two differences, threadIdx.x and add < < 1, n > >.

The hardware device limits the number of thread blocks to 65535, and the number of threads in each thread block to 512. If one thread block is used to add vectors exceeding 512, it needs to be combined with thread block and thread. There are two areas that need to be changed, namely, the index calculation method in the kernel function and the calling method of the kernel function.
Multiple thread blocks contain multiple threads
- Change of index mode: int tid = threadIdx.x +blockIdx.x*blockDim.x. For all thread blocks, blockDim is a constant, indicating the number of threads per dimension in the thread block;
- Call of kernel function: when summing vectors of any length, the length of the vector may exceed the total number of threads = number of thread blocks * number of threads in the thread block, which leads to the need to find other solutions. The scheme is that the thread block is also operated similar to cpu multi-core processing, that is, tid is self incremented. The code is as follows:
```
__global__ void add(int *a, int *b, int *c){
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	while(tid < N){
		c[tid] = a[tid] + b[tid];
		tid += blockIdx.x * gridDim.x; //Incrementing is the number of running threads in the current thread grid
	}
}
```
  The upper number modification means that the number of increments each time is the number of thread blocks * the number of grids. Because the number of thread blocks per grid is gridDim.x, and the number of threads per thread block is blockIdx.x. We only need to know the initialization index of each parallel thread and how to determine the incremental value. Therefore, we need to linearize the thread index and thread block index, that is, first find the thread block index, and then find the thread index. The code is:
```
int tid = threadIdx.x + blockIdx.x * blockDim.x;
```
  After each thread calculates the tasks on the current index, it needs to increment the index. The increment step is the number of running threads in the thread lattice, and the number is the number of threads in each thread block multiplied by the number of thread blocks in the thread lattice. The increment code is:
```
tid += blockIdx.x * gridDim.x; //The increment is the number of running threads in the current thread grid. blockIdx.x is the number of threads in each thread block, and gridDim.x is the number of thread blocks in the thread grid
```
- polo
Using threads to achieve ripple effect on GPU
- slightly

Shared memory and synchronization

The decomposition of thread blocks into threads can be completely completed by cuda runtime behind the scenes. But there are other important reasons to break up a thread block into multiple threads. Use__ share__ Keyword can reside declared variables in memory;
The shared memory buffer resides on the physical GPU, not memory outside the GPU. Every time a thread block is started on the GPU, the cuda c compiler will create a copy of the variable. Each variable in the thread block shares this memory, so that multiple threads in a thread block can communicate and cooperate;
Where there is sharing, there is competition. Like multithreading in cpu, shared memory in gou also needs to set competition strategy. How to and synchronize is a problem?

Kernel function calling method: set the size of thread block to a fixed value, that is, the number of threads in a single thread block is fixed, such as 128, N/128 thread blocks can be started. Generally, N/128 needs to be rounded up to avoid that n is less than the fixed size of the thread block and the thread cannot be started; The implementation is (N+127)/128, and the code is: add < < (N+127)/128, 128 > > (dev_a, dev_b, dev_c);

The code of shared memory is:

// CUDATest.cpp: this file contains the "main" function. Program execution will begin and end here.
//

#include <iostream>
#include<cuda_runtime.h>
#include<device_launch_parameters.h>

#define imin(a, b)(a<b?a:b)

const int N = 33 * 1024;
const int threadsPerBlock = 256;
const int blocksPerGrid = imin(32, (N + threadsPerBlock - 1) / threadsPerBlock);

__global__ void dot(float *a, float *b, float *c) {
	__shared__ float cache[threadsPerBlock];//Memory buffer
	int tid = threadIdx.x + blockIdx.x * blockDim.x; //blockIdx.x represents the number of threads in the block, and blockDim.x represents the thread block index
	int cacheIndex = threadIdx.x;

	float temp = 0;
	while (tid < N) {
		temp = a[tid] * b[tid];
		tid += blockIdx.x * gridDim.x;//The number of thread blocks in the thread grid * the number of threads in the thread block = the number of running threads in the current thread grid
	}
	cache[cacheIndex] = temp;
	__syncthreads(); //Thread synchronization in thread block

	//reduction
	int i = blockDim.x / 2;
	while (i != 0) {
		if (cacheIndex < i) {
			cache[cacheIndex] += cache[cacheIndex + i];
		}
		__synctheads();
		i /= 2;
	}
	if (cacheIndex == 0) {//You can select any thread to write cache[0] to global memory
		c[blockIdx.x] = cache[0]; 
	}
}


int main()
{
	float *a, *b, c, *partial_c;
	float *dev_a, *dev_b, *dev_partial_c;

	//Allocate memory on cpu
	//a = (float*)malloc(N * sizeof(float));
	//b = (float*)malloc(N * sizeof(float));
	//partial_c = (float*)malloc(blocksPerGrid * sizeof(float));
	a = new float[N];
	b = new float[N];
	partial_c = new float[blocksPerGrid];

	cudaMalloc((void**)&dev_a, N * sizeof(float));
	cudaMalloc((void**)&dev_b, N * sizeof(float));
	cudaMalloc((void**)&dev_partial_c, blocksPerGrid * sizeof(float));

	//Populate host memory
	for (int i = 0; i < N; ++i) {
		a[i] = i;
		b[i] = i * 2;
	}

	//Copy data to gpu
	cudaMemcpy(dev_a, a, N * sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(float), cudaMemcpyHostToDevice);

	dot << <blocksPerGrid, threadsPerBlock >> > (dev_a, dev_b, dev_partial_c);

	//dev_c copy from Gpu to cpu
	cudaMemcpy(partial_c, dev_partial_c, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost);

	//result
	for (int i = 0; i < N; ++i) {
		std::cout << "a[" << i << "] +b[" << i << "] = " << partial_c[i] << std::endl;
	}

	//Release cuda video memory
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_partial_c);
	//Free up memory on cpu
	delete[] a;
	delete [] b;
	delete[] partial_c;
	return 0;
}

Shared memory means that threads in the same thread block share a block of memory and use__ share__ Make a declaration;
The size of shared memory is the number of threads in a thread block, and its index when used is the id of the current thread, that is, threadIdx.x;
Thread synchronization in thread block***__ syncthreads() function, which is used to ensure that the write operation of cache array is completed before the read operation. When a thread uses__ After syncthreads (), you can ensure that each thread in the thread block has finished executing__ syncthreads() executes the following statement after the preceding statement;
Finally, the * * reduction * * algorithm is used to add the numbers in the cache array;
polo

Optimization of dot product operation
- When some threads need to execute an instruction and other threads do not need to execute, it is called thread divergence;
- If will__ Syncthreads() is placed in the if statement. If the conditions become that some threads can enter and some threads cannot, the threads will diverge. This means that some threads cannot complete__ The operation before syncthreads() causes it to never be executed__ Syncthreads (), which further causes the hardware to keep these threads waiting;
- The above analysis shows that**__ Where syncthreads() is placed is critical * *;
About the relationship between gridDim, blockDim, blockId, threadId, etc
- gridDim.x and gridDim.y represent the number of thread block s in the horizontal and vertical directions of a grid respectively;
- blockDim.x and blockDim.y represent the number of threads in a thread block horizontally and vertically respectively;
- blockIdx.x and blockIdx.y indicate the number of horizontal and vertical positions of the block;
- threadIdx.x and threadIdx.y indicate the rows and columns of the block where the thread is located;
polo

Chapter 6 constant memory and events

Constant memory is a special memory on GPU, which is used to speed up the execution of application programs; Event can test the performance of an application, and can be used to quantitatively analyze whether the modification of an application will improve the performance;

constant memory
cuda program supports not only global memory and shared memory, but also constant memory. It is used to save data that will not change during kernel function execution. It is located in nvidia hardware and has a size of 64KB;
- Raytracing: generates a 2D image from a 3D scene. The principle is that an imaginary camera is placed at a position in the scene, which contains a light sensor to generate an image. It is necessary to judge which light contacts the sensor. Each pixel in the image has the same color and intensity as the optical fiber hitting the sensor. Generally, reverse calculation is adopted, that is, assuming that a ray emitted from the pixel enters the scene, the color of the pixel is calculated after the ray hits an object, that is, the color is set according to the color seen;
- Raytracing on GPU:
  - ...
- The throughput of computing speed is greater than that of memory bandwidth, so it is necessary to reduce memory traffic;
Constant memory is used to store data that will not change during the execution of kernel functions. Nvidia provides 64KB constant memory, which is different from the processing of global memory. Replacing global memory with it can effectively reduce memory bandwidth;
Events: used to analyze program performance
Reference: GPU high performance CUDA Programming Practice

Topics: C C++

Programmer Think