efficiency of CUDA Scalar and SIMD video instructions

The throughput of SIMD instruction is lower that 32-bits integer arithmetic. In case of SM2.0 (Scalar instruction only versions) is 2 time lower. In case of SM3.0 is 6 time lower.

What is a cases when suitable to use them ?


If your data is already packed in a format that is handled natively by a SIMD video instruction, then it would require multiple steps to unpack it so that it can be handled by an ordinary instruction.

Furthermore, the throughput of a SIMD video instruction should also be multiplied by the number of actual operations performed when comparing it with ordinary arithmetic operations.

For example, for the instruction vadd4, 4 integer adds are being performed, on a packed 32-bit quantity (4 byte integer quantities). In order to duplicate this behavior with ordinary integer adds, a fairly complicated sequence of instructions would be needed to unpack the data into 4 int quantities, do 4 integer adds, and then re-pack the result. If you attempted to do it with a single integer add, carry from one result could corrupt the next result. vadd4 also offers clamping abilities and other behavior not available with integer add.

In the case of SM2.0, just the ratio of 4 operations performed by the vadd4 vs. the 4 integer adds necessary on unpacked data would make it attractive. In the case of SM3.0, when the unpacking and packing are added to the ordinary integer add routine, the vadd4 looks attractive. The situation becomes even more attractive with cc 5.0.


  • Read and write file bit by bit
  • How to enable SSL/HTTPS on bokeh 0.12.5?
  • python: using raw socket with OSX
  • How to install R on a linux cluster?
  • Adding Extra Data to Auth Cookie after login - MVC 5
  • How to embed flash in Mono?
  • is uninitialized_copy/fill(In first, In last, For dest, A &a) an oversight in the c++ standard?
  • How do I write to registers in hardware using Python?
  • Is there any point in using DI for class injections
  • wxWidgets: Detecting click event on custom controls
  • How to limit cursor movement in WPF based app?
  • Add a div to replace Video after Video Plays Through
  • what is browser's native support according to selenium webdriver
  • Are there any supported high bit-depth video or image formats in DirectShow
  • How are 32 bit JavaScript numbers resulting from a bit-wise operation converted back to 64 bit numbe
  • Converting raw frames into webm live stream
  • CUDA NSight is not installed with CUDA 5.0 installation file on Windows 8? [closed]
  • Create a multiple screen android application
  • SIP API media codecs
  • How do I import an existing Grails 3 (3.0.12) project in IntelliJ 15
  • Python cosine function precision [duplicate]
  • vectorized indexing/slicing in numpy/scipy?
  • Rest Services conventions
  • Sequential (transactional) API calls in angular 4 with state management
  • Access Android Market through SSH tunnel
  • rspec simple example getting error on request variable in integration test
  • Make VS2015 use angular-cli ng at build time in a .NET project
  • How to attach a node.js readable stream to a Sendgrid email?
  • Exception “firebase.functions() takes … no argument …” when specifying a region for a Cloud Function
  • Repeat a vertical line on every page in Report Builder / SSRS
  • Android screen density dpi vs ppi
  • Spray.io: When (not) to use non-blocking route handling?
  • Bug in WPF DataGrid
  • Updating server-side rendering client-side
  • Convert array of 8 bytes to signed long in C++
  • Font Awesome Showing Box instead of Icons
  • Properly structure and highlight a GtkPopoverMenu using PyGObject
  • Is it possible to post an object from jquery to bottle.py?
  • Python/Django TangoWithDjango Models and Databases
  • java string with new operator and a literal