62505

Thrust inside user written kernels

Question:

I am a newbie to Thrust. I see that all Thrust presentations and examples only show host code.

I would like to know if I can pass a device_vector to my own kernel? How? If yes, what are the operations permitted on it inside kernel/device code?

Answer1:

As it was originally written, Thrust is purely a host side abstraction. It cannot be used inside kernels. You can pass the device memory encapsulated inside a thrust::device_vector to your own kernel like this:

thrust::device_vector< Foo > fooVector; // Do something thrust-y with fooVector Foo* fooArray = thrust::raw_pointer_cast( &fooVector[0] ); // Pass raw array and its size to kernel someKernelCall<<< x, y >>>( fooArray, fooVector.size() );

and you can also use device memory not allocated by thrust within thrust algorithms by instantiating a thrust::device_ptr with the bare cuda device memory pointer.

<strong>Edited four and half years later</strong> to add that as per @JackOLantern's answer, thrust 1.8 adds a sequential execution policy which means you can run single threaded versions of thrust's alogrithms on the device. Note that it still isn't possible to directly pass a thrust device vector to a kernel and device vectors can't be directly used in device code.

Note that it is also possible to use the thrust::device execution policy in some cases to have parallel thrust execution launched by a kernel as a child grid. This requires separate compilation/device linkage and hardware which supports dynamic parallelism. I am not certain whether this is actually supported in all thrust algorithms or not, but certainly works with some.

Answer2:

I would like to provide an updated answer to this question.

Starting from Thrust 1.8, CUDA Thrust primitives can be combined with the thrust::seq execution policy to run sequentially within a single CUDA thread (or sequentially within a single CPU thread). Below, an example is reported.

If you want parallel execution within a thread, then you may consider using <a href="http://nvlabs.github.io/cub/" rel="nofollow">CUB</a> which provides reduction routines that can be called from within a threadblock, provided that your card enables dynamic parallelism.

Here is the example with Thrust

#include <stdio.h> #include <thrust/reduce.h> #include <thrust/execution_policy.h> /********************/ /* CUDA ERROR CHECK */ /********************/ #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); } } __global__ void test(float *d_A, int N) { float sum = thrust::reduce(thrust::seq, d_A, d_A + N); printf("Device side result = %f\n", sum); } int main() { const int N = 16; float *h_A = (float*)malloc(N * sizeof(float)); float sum = 0.f; for (int i=0; i<N; i++) { h_A[i] = i; sum = sum + h_A[i]; } printf("Host side result = %f\n", sum); float *d_A; gpuErrchk(cudaMalloc((void**)&d_A, N * sizeof(float))); gpuErrchk(cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice)); test<<<1,1>>>(d_A, N); }

Answer3:

This is an update to my previous answer.

Starting from Thrust 1.8.1, CUDA Thrust primitives can be combined with the thrust::device execution policy to run in parallel within a single CUDA thread exploiting CUDA <em>dynamic parallelism</em>. Below, an example is reported.

#include <stdio.h> #include <thrust/reduce.h> #include <thrust/execution_policy.h> #include "TimingGPU.cuh" #include "Utilities.cuh" #define BLOCKSIZE_1D 256 #define BLOCKSIZE_2D_X 32 #define BLOCKSIZE_2D_Y 32 /*************************/ /* TEST KERNEL FUNCTIONS */ /*************************/ __global__ void test1(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) { const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::seq, d_data + tid * Ncols, d_data + (tid + 1) * Ncols); } __global__ void test2(const float * __restrict__ d_data, float * __restrict__ d_results, const int Nrows, const int Ncols) { const unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < Nrows) d_results[tid] = thrust::reduce(thrust::device, d_data + tid * Ncols, d_data + (tid + 1) * Ncols); } /********/ /* MAIN */ /********/ int main() { const int Nrows = 64; const int Ncols = 2048; gpuErrchk(cudaFree(0)); // size_t DevQueue; // gpuErrchk(cudaDeviceGetLimit(&DevQueue, cudaLimitDevRuntimePendingLaunchCount)); // DevQueue *= 128; // gpuErrchk(cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, DevQueue)); float *h_data = (float *)malloc(Nrows * Ncols * sizeof(float)); float *h_results = (float *)malloc(Nrows * sizeof(float)); float *h_results1 = (float *)malloc(Nrows * sizeof(float)); float *h_results2 = (float *)malloc(Nrows * sizeof(float)); float sum = 0.f; for (int i=0; i<Nrows; i++) { h_results[i] = 0.f; for (int j=0; j<Ncols; j++) { h_data[i*Ncols+j] = i; h_results[i] = h_results[i] + h_data[i*Ncols+j]; } } TimingGPU timerGPU; float *d_data; gpuErrchk(cudaMalloc((void**)&d_data, Nrows * Ncols * sizeof(float))); float *d_results1; gpuErrchk(cudaMalloc((void**)&d_results1, Nrows * sizeof(float))); float *d_results2; gpuErrchk(cudaMalloc((void**)&d_results2, Nrows * sizeof(float))); gpuErrchk(cudaMemcpy(d_data, h_data, Nrows * Ncols * sizeof(float), cudaMemcpyHostToDevice)); timerGPU.StartCounter(); test1<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols); gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); printf("Timing approach nr. 1 = %f\n", timerGPU.GetCounter()); gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost)); for (int i=0; i<Nrows; i++) { if (h_results1[i] != h_results[i]) { printf("Approach nr. 1; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]); return 0; } } timerGPU.StartCounter(); test2<<<iDivUp(Nrows, BLOCKSIZE_1D), BLOCKSIZE_1D>>>(d_data, d_results1, Nrows, Ncols); gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); printf("Timing approach nr. 2 = %f\n", timerGPU.GetCounter()); gpuErrchk(cudaMemcpy(h_results1, d_results1, Nrows * sizeof(float), cudaMemcpyDeviceToHost)); for (int i=0; i<Nrows; i++) { if (h_results1[i] != h_results[i]) { printf("Approach nr. 2; Error at i = %i; h_results1 = %f; h_results = %f", i, h_results1[i], h_results[i]); return 0; } } printf("Test passed!\n"); }

The above example performs reductions of the rows of a matrix in the same sense as <a href="https://stackoverflow.com/questions/17862078/reduce-matrix-rows-with-cuda" rel="nofollow">Reduce matrix rows with CUDA</a>, but it is done differently from the above post, namely, by calling CUDA Thrust primitives directly from user written kernels. Also, the above example serves to compare the performance of the same operations when done with two execution policies, namely, thrust::seq and thrust::device. Below, some graphs showing the difference in performance.

<a href="https://i.stack.imgur.com/Vzc3L.jpg" rel="nofollow"><img alt="Timings" class="b-lazy" data-src="https://i.stack.imgur.com/Vzc3L.jpg" data-original="https://i.stack.imgur.com/Vzc3L.jpg" src="https://etrip.eimg.top/images/2019/05/07/timg.gif" /></a>

<a href="https://i.stack.imgur.com/ajYbj.jpg" rel="nofollow"><img alt="Speedups" class="b-lazy" data-src="https://i.stack.imgur.com/ajYbj.jpg" data-original="https://i.stack.imgur.com/ajYbj.jpg" src="https://etrip.eimg.top/images/2019/05/07/timg.gif" /></a>

The performance has been evaluated on a Kepler K20c and on a Maxwell GeForce GTX 850M.

Answer4:

If you mean to use the data allocated / processed by thrust yes you can, just get the raw pointer of the allocated data.

int * raw_ptr = thrust::raw_pointer_cast(dev_ptr);

if you want to allocate thrust vectors in the kernel I never tried but I don't think will work and also if it works I don't think it will provide any benefit.

Recommend

  • How can I add each element of one array to another one's corresponding element using a Parallel
  • Does java.lang.System.arraycopy() use a shallow copy?
  • Alpha Animation in Aframe for a-object-model
  • Matlab multiple cores
  • Making Custom Object Thread Safe
  • Clojure Maps, order of key value creation
  • Is it really necessary to provide all the different app icon sizes for an iOS app?
  • .NET Core RC2 - Consuming External WCF
  • PyCuda / Multiprocessing Issue on OS X 10.8
  • resetting a Tensorflow graph after OutOfRangeError when using Dataset
  • 2D median filtering in CUDA: how to efficiently copy global memory to shared memory
  • Setting Nsight to run with existing Makefile project
  • How to add definitions for cuda source code in cmake
  • Helgrind for Windows?
  • Implementation of monitors with semaphores
  • adb device not listed for Gionee E7 mini
  • OpenOptionsMenu not working with ActionBarSherlock Custom SubMenu
  • Special chars in Amazon S3 keys?
  • Does SmartGit support git-svn?
  • Return null in boolean to checkbox state converter in XAML
  • Azure table query partial partitionkey guid match
  • How to create virtual printer with iOS Simulator?
  • custom string delimiters stringtemplate-4
  • SIP API media codecs
  • Does Apportable support to build library binary (.a/.so)?
  • Silverlight DependencyProperty.SetCurrentValue Equivalent
  • Zurb Foundation _global.scss meta styles for js?
  • Hardware Accelerated Image Scaling in windows using C++
  • Get data from AJAX - How to
  • Adding a button at the bottom of a table view
  • Resize panoramic image to fixed size
  • How to recover from a Spring Social ExpiredAuthorizationException
  • ILMerge & Keep Assembly Name
  • Join two tables and save into third-sql
  • How to make Safari send if-modified-since header?
  • Large data - storage and query
  • WOWZA + RTMP + HTML5 Playback?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • Android Heatmap on canvas or ImageView
  • Conditional In-Line CSS for IE and Others?