opecv cuda acceleration official tutorial 2: Using a cv::cuda::GpuMat with thrust

Posted by JonnySnip3r on Wed, 02 Feb 2022 01:50:00 +0100

Original address

Global

Thrust is a very powerful library of various cuda acceleration algorithms. However, thrust is designed for vector rather than tilt matrix. The following tutorial discusses how to wrap cv::cuda::GpuMat into a thrust iterator that can be used for thrust algorithms.
This tutorial will show you how to:

  • Wrap GpuMat in a thrust iterator

  • Fill GpuMat with random numbers

  • Sort the columns of GpuMat

  • Copy values greater than 0 to the new gpu matrix

  • Use flow with thrust

Wrapping a GpuMat into a thrust iterator

The following code will generate an iterator for GpuMat

/*
    @Brief GpuMatBeginItr returns a thrust compatible iterator to the beginning of a GPU mat's memory.
    @Param mat is the input matrix
    @Param channel is the channel of the matrix that the iterator is accessing.  If set to -1, the iterator will access every element in sequential order
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>>  GpuMatBeginItr(cv::cuda::GpuMat mat, int channel = 0)
{
    if (channel == -1)
    {
        mat = mat.reshape(1);
        channel = 0;
    }
    CV_Assert(mat.depth() == cv::DataType<T>::depth);
    CV_Assert(channel < mat.channels());
    return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
        thrust::make_transform_iterator(thrust::make_counting_iterator(0), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}
/*
@Brief GpuMatEndItr returns a thrust compatible iterator to the end of a GPU mat's memory.
@Param mat is the input matrix
@Param channel is the channel of the matrix that the iterator is accessing.  If set to -1, the iterator will access every element in sequential order
*/
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>>  GpuMatEndItr(cv::cuda::GpuMat mat, int channel = 0)
{
    if (channel == -1)
    {
        mat = mat.reshape(1);
        channel = 0;
    }
    CV_Assert(mat.depth() == cv::DataType<T>::depth);
    CV_Assert(channel < mat.channels());
    return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.ptr<T>(0) + channel),
        thrust::make_transform_iterator(thrust::make_counting_iterator(mat.rows*mat.cols), step_functor<T>(mat.cols, mat.step / sizeof(T), mat.channels())));
}

Our goal is to have an iterator that starts with a matrix and increments correctly to access successive matrix elements. This is trivial for a continuous row, but what about a column of a skew matrix? Therefore, we need the iterator to know the dimension and step size of the matrix. This information is embedded in step_ Function.

template<typename T> struct step_functor : public thrust::unary_function<int, int>
{
    int columns;
    int step;
    int channels;
    __host__ __device__ step_functor(int columns_, int step_, int channels_ = 1) : columns(columns_), step(step_), channels(channels_)  {   };
    __host__ step_functor(cv::cuda::GpuMat& mat)
    {
        CV_Assert(mat.depth() == cv::DataType<T>::depth);
        columns = mat.cols;
        step = mat.step / sizeof(T);
        channels = mat.channels();
    }
    __host__ __device__
        int operator()(int x) const
    {
        int row = x / columns;
        int idx = (row * step) + (x % columns)*channels;
        return idx;
    }
};

The step function accepts an index value and returns the appropriate offset from the matrix. The count iterator simply increments within the range of pixel elements. Combine to transform_ Among the iterators, we have an iterator, which counts from 0 to M*N and increments correctly to illustrate the inclined memory of GpuMat. Unfortunately, this does not include any memory location information, so we need a thrust::device_ptr. By combining the device pointer with the transformation iterator, we can point the thrust to the first element of the matrix and step accordingly.

Fill a GpuMat with random numbers

Now that we have some good functions to generate iterators for thrust, let's use them to do something OpenCV can't do. Unfortunately, at the time of writing, OpenCV did not generate any Gpu random numbers. Thankfully, struch did, and now the interoperability between the two has become insignificant. Example taken from http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust
First, we need to write a functor to generate our random values.

struct prg
{
  float a, b;
  __host__ __device__
    prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
  __host__ __device__
    float operator()(const unsigned int n) const
  {
    thrust::default_random_engine rng;
    thrust::uniform_real_distribution<float> dist(a, b);
    rng.discard(n);
    return dist(rng);
  }
};

This will accept an integer value and output a value between a and b. Now we will fill our matrix with values between 0 and 10 through thrust transformation

  {
    cv::cuda::GpuMat d_value(1, 100, CV_32F);
    auto valueBegin = GpuMatBeginItr<float>(d_value);
    auto valueEnd = GpuMatEndItr<float>(d_value);
    thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
    cv::Mat h_value(d_value);
  }

Sort a column of a GpuMat in place

Let's fill the matrix elements with random values and indexes. Then we will sort random numbers and indexes.

  {
    cv::cuda::GpuMat d_data(1, 100, CV_32SC2);
    // Thrust compatible begin and end iterators to channel 1 of this matrix
    auto keyBegin = GpuMatBeginItr<int>(d_data, 1);
    auto keyEnd = GpuMatEndItr<int>(d_data, 1);
    // Thrust compatible begin and end iterators to channel 0 of this matrix
    auto idxBegin = GpuMatBeginItr<int>(d_data, 0);
    auto idxEnd = GpuMatEndItr<int>(d_data, 0);
    // Fill the index channel with a sequence of numbers from 0 to 100
    thrust::sequence(idxBegin, idxEnd);
    // Fill the key channel with random numbers between 0 and 10.  A counting iterator is used here to give an integer value for each location as an input to prg::operator()
    thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_data.cols), keyBegin, prg(0, 10));
    // Sort the key channel and index channel such that the keys and indecies stay together
    thrust::sort_by_key(keyBegin, keyEnd, idxBegin);
    cv::Mat h_idx(d_data);
  }

Copy values greater than 0 to a new gpu matrix while using streams

In this example, we'll see how cv::cuda::Streams works with thrust. Unfortunately, the function used in this particular example must return the result to the CPU, so it is not the best use of the stream.

  {
    cv::cuda::GpuMat d_value(1, 100, CV_32F);
    auto valueBegin = GpuMatBeginItr<float>(d_value);
    auto valueEnd = GpuMatEndItr<float>(d_value);
    cv::cuda::Stream stream;
    //! [random_gen_stream]
    // Same as the random generation code from before except now the transformation is being performed on a stream
    thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
    //! [random_gen_stream]
    // Count the number of values we are going to copy
    int count = thrust::count_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, pred_greater<float>(0.0));
    // Allocate a destination for copied values
    cv::cuda::GpuMat d_valueGreater(1, count, CV_32F);
    // Copy values that satisfy the predicate.
    thrust::copy_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, GpuMatBeginItr<float>(d_valueGreater), pred_greater<float>(0.0));
    cv::Mat h_greater(d_valueGreater);
  }

First, we will populate the GPU mat with randomly generated data between - 1 and 1 on the stream.

// Same as the random generation code from before except now the transformation is being performed on a stream
    thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));

Pay attention to the use of thrust:: System:: CUDA:: par On (...), which creates an execution policy for executing code on the stream. Due to version 7.5, there is an error in the thrust version distributed with CUDA Toolkit, which has not been fixed. This error causes code to fail to execute on the stream. However, you can fix this error by using the latest version of thrust from the Git repository. ( http://github.com/thrusththust.git )Next, we'll use throw:: count by using the following predicate_ If to determine that values greater than 0 are greater than 0:

template<typename T> struct pred_greater
{
  T value;
  __host__ __device__ pred_greater(T value_) : value(value_){}
  __host__ __device__ bool operator()(const T& val) const
  {
    return val > value;
  }
};

We will use these results to create an output buffer to store the copied values, and then use copy with the same predicate_ If to fill the output buffer. Finally, we will download these values into CPU mat for viewing.

Topics: C++ OpenCV CUDA