57631

kernel using AoS is faster than using SoA

Question:

I have two versions of a kernel that performs the same task -fill a linked cell list-, the difference between both kernels is the datatype to store particle position, the first one using a float array to store the positions (4 float per particle due to 128bit reads/writes), and the second uses a vec3f structure array to store the positions (a structure which holds 3 floats).

Doing some tests using nvprof, I found that the second kernel (which uses vec3f) ran faster than the first one:

Time(%) Time Calls Avg Min Max Name 42.88 37.26s 2 18.63s 23.97us 37.26s adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i) 11.00 3.93s 2 1.97s 25.00us 3.93s adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)

The tests are done trying to fill a linked cell list using 256 and 512000 particles.

My question is, what happened here? I supposed that float array should do a better memory access due to coalesced memory, versus the use of vec3f structure array which has unaligned memory. I missunderstood anything?

These are the kernels, the first kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head, int *linked, int *cellnAtoms, float *pos, int nAtoms, vec3f origin, vec3f h, vec3i nCell) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx >= nAtoms) return; vec3i cell; vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]}; cell.x = floor ((_pos.x - origin.x)/h.x); cell.y = floor ((_pos.y - origin.y)/h.y); cell.z = floor ((_pos.z - origin.z)/h.z); int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x; int i; if (atomicCAS (&head[c], -1, idx) != -1){ i = head[c]; while (atomicCAS (&linked[i], -1, idx) != -1) i = linked[i]; } atomicAdd (&cellnAtoms[c], 1); }

And this is the second kernel:

__global__ void adentu_grid_cuda_filling_kernel (int *head, int *linked, int *cellNAtoms, vec3f *pos, int nAtoms, vec3f origin, vec3f h, vec3i nCell) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx >= nAtoms) return; vec3i cell; vec3f _pos = pos[idx]; cell.x = floor ((_pos.x - origin.x)/h.x); cell.y = floor ((_pos.y - origin.y)/h.y); cell.z = floor ((_pos.z - origin.z)/h.z); int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x; int i; if (atomicCAS (&head[c], -1, idx) != -1){ i = head[c]; while (atomicCAS (&linked[i], -1, idx) != -1) i = linked[i]; } atomicAdd (&cellNAtoms[c], 1); }

This is the vec3f structure:

typedef struct _vec3f {float x, y, z} vec3f;

Answer1:

This is not an example of AoS vs. SoA. Let's look at the important lines of code and the data structures implicit in them.

Your first, "SoA" or "slow" case:

vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]}; ^ ^ ^ | | | These values are stored in *adjacent* memory locations

So an individual thread is accessing successively pos[idx*4] plus the 2 locations right after it. This is how a structure gets stored! What you're calling a structure of Arrays is in fact an array of structures, the way it is stored in memory. To have a valid "SoA" case, your code would need to look something like this:

vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]}; ^ | Adjacent threads will read adjacent values for pos1, pos2, and pos3 leading to *coalesced* access.

Your "AoS" or "fast" doesn't really have a different storage format.

Answer2:

To my mind both of your approaches are actually AoS, the only difference is that the first approach is AoS with a structure of four elements while the second approach only uses three elements. This is why your second solution is preferable.

If you really want to have SoA in your first solution you would have to organize the pos array as follows:

vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

Recommend

  • how to divide time php
  • Getting Min and Max Value of a column and other column values
  • SQL: Show average and min/max within standard deviations
  • Printing in IE8 Has @href contents inline
  • How can I organize this data into the objects I want with LINQ?
  • How can I use scipy optimization to find the minimum chi-squared for 3 parameters and a list of data
  • Jquery ajax request not able to return data to other function
  • Variant from android-autofittextview library : scaling makes strange behaviour
  • Accessing Windows Azure Queues from client side javascript/jquery
  • C++ accessing vector
  • Shopify API CARTS - Changing line_item line_price for price Override
  • Making Cross Site Asynchronous HTTP Post from GWT Client
  • Skip Characters in Oracle TO_DATE function
  • Refactoring advice: maps to POJOs
  • How to make stdcall from Go
  • Creating NSCollectionView with datasource programatically
  • What is the default HTTP verb in WebApi ? GET or POST?
  • Updating product post meta data in admin meta box field
  • Universal Image Loader reuse images
  • Iron Router: How do I send data to the layout?
  • Open Existing DB in MySQL WorkBench
  • abstracting over a collection
  • TextToSpeech.setEngineByPackageName() triggers NullPointerException
  • Sequential (transactional) API calls in angular 4 with state management
  • D3 get axis values on zoom event
  • WPF ICommand CanExecute(): RaiseCanExecuteChanged() or automatic handling via DispatchTimer?
  • What's the purpose of QString?
  • onBackPressed() not being executed
  • Custom validator control occupying space even though display set to dynamic
  • Admob requires api-13 or later can I not deploy on old API-8 phones?
  • JSON response opens as a file, but I can't access it with JavaScript
  • Unity3D & Android: Difference between “UnityMain” and “main” threads?
  • Does CUDA 5 support STL or THRUST inside the device code?
  • PHP: When would you need the self:: keyword?
  • How get height of the a view with gone visibility and height defined as wrap_content in xml?
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • How to get Windows thread pool to call class member function?
  • apache spark aggregate function using min value
  • Sorting a 2D array using the second column C++
  • java string with new operator and a literal