88320

OpenMP parallel spiking

Question:

I'm using OpenMP in Visual Studio 2010 to speed up loops.

I wrote a very simple test to see the performance increase using OpenMP. I use omp parallel on an empty loop

int time_before = clock(); #pragma omp parallel for for(i = 0; i < 4; i++){ } int time_after = clock(); std::cout << "time elapsed: " << (time_after - time_before) << " milliseconds" << std::endl;

Without the omp pragma it consistently takes 0 milliseconds to complete (as expected), and with the pragma it usually takes 0 as well. The problem is that with the opm pragma it spikes occasionally, anywhere from 10 to 32 milliseconds. Every time I tried parallel with OpenMP I get these random spikes, so I tried this very basic test. Are the spikes an inherent part of OpenMP, or can they be avoided?

The parallel for gives me great speed boosts on some loops, but these random spikes are too big for me to be able to use it.

Answer1:

Thats pretty normal behiavor. Sometimes your operation system is busy and need more time to spawn new threads.

Answer2:

I want to complement the answer of kukis: I'd also say, that the reason for the spikes are due to the additional overhead that comes with OpenMP.

Furthermore, as you are doing performance-sensitive measurements, I hope that you compiled your code with optimizations turned on. In that case, the loop without OpenMP simply gets optimized out by the compiler, so there is no code in between time_before and time_after. With OpenMP, however, at least g++ 4.8.1 (-O3) is unable to optimize the code: The loop is still there in the assembler, and contains additional statements to manage the work-sharing. (I cannot try it with VS at the moment.)

So, the comparison is not really fair, as the one without OpenMP gets optimized out completely.

Edit: You also have to keep in mind, that OpenMP doesn't re-create threads everytime. Rather it uses a thread-pool. So, if you execute an omp-construct before your loop, the threads will already be created when it encounters another one:

// Dummy loop: Spawn the threads. #pragma omp parallel for for(int i = 0; i < 4; i++){ } int time_before = clock(); // Do the actual measurement. OpenMP re-uses the threads. #pragma omp parallel for for(int i = 0; i < 4; i++){ } int time_after = clock();

In this case, the spikes should vanish.

Answer3:

If "OpenMP parallel spiking", which I would call "parallel overhead", is a concern in your loop, this infers <strong>you probably don't have enough workload to parallelize</strong>. Parallelization yields a speedup only if you have a sufficient problem size. You already showed an extreme example: no work in a parallelized loop. In such case, you will see highly fluctuating time due to parallel overhead.

The parallel overhead in OpenMP's omp parallel for includes several factors:

<ul><li>First, omp parallel for is the sum of omp parallel and omp for.</li> <li>The overhead of spawning or awakening threads (many OpenMP implementations won't create/destroy every omp parallel.</li> <li>Regarding omp for, overhead of (a) dispatching workloads to worker threads, (b) scheduling (especially, if dynamic scheduling is used).</li> <li>The overhead of implicit barrier at the end of omp parallel unless nowait is specified.</li> </ul>

FYI, in order to measure OpenMP's parallel overhead, the following would be more effective:

double measureOverhead(int tripCount) { static const size_t TIMES = 10000; int sum = 0; int startTime = clock(); for (size_t k = 0; k < TIMES; ++k) { for (int i = 0; i < tripCount; ++i) { sum += i; } } int elapsedTime = clock() - startTime; int startTime2 = clock(); for (size_t k = 0; k < TIMES; ++k) { #pragma omp parallel for private(sum) // We don't care correctness of sum // Otherwise, use "reduction(+: sum)" for (int i = 0; i < tripCount; ++i) { sum += i; } } int elapsedTime2 = clock() - startTime2; double parallelOverhead = double(elapsedTime2 - elapsedTime)/double(TIMES); return parallelOverhead; }

Try to run such small code may times, then take an average. Also, put at least minimum workload in loops. In the above code, parallelOverhead is an approximated overhead of OpenMP's omp parallel for construct.

Recommend

  • How to display busy image when actual image is loading in client machine
  • Google Calendar API v3 404 for Events
  • busy indicator during long wpf interface drawing operation
  • PHP MYSQL event listener
  • Selecting a dropdown list when inserting data from web (VBA)
  • OR instruction in assembly into ECX register
  • Assigning a variable directly from a userform
  • C# NOT (~) bit wise operator returns negative values
  • React delayed rendering
  • How to show loading page indicator in ASP.net? [closed]
  • What does Queue() function do in Chisel?
  • Shift operation implementation in java
  • overhead of reserving address space using mmap
  • Capture reload/endrequest event after server redirect to download file
  • byte, char, int in Java - bit representation
  • Ionic run android - Internal Error
  • global:whereis_name() returns different Pid from different terminals
  • Why are `colMeans()` and `rowMeans()` functions faster than using the mean function with `lapply()`?
  • What is the difference between Socket.Send and Stream.Write? (in relation to tcp ip connections)
  • Is there a way to get the process ID of a console program I've just started in the background?
  • In Akka, is ActorContext thread safe?
  • Simple Distributed Erlang
  • Xmonad multiple submap key combos
  • Most efficient way to move table rows from one table to another
  • Consuming a WCF service in a Java Client using wsHttpBinding
  • Disable Kendo Autocomplete
  • Using Sax parsing to edit and write XML in VB6
  • How does document.ready work with angular element directives?
  • Debug.DrawLine not showing in the GameView
  • Record samples being played with OpenAL
  • Yii2: Config params vs. const/define
  • Lost migrations and Azure database is now out of sync
  • Spray.io: When (not) to use non-blocking route handling?
  • Jquery - Jquery Wysiwyg return html as a string
  • Arrays break string types in Julia
  • Transpose CSV data with awk (pivot transformation)
  • SetUp method failed while running tests from teamcity
  • WPF Applying a trigger on binding failure
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed
  • Java static initializers and reflection