View Source

h3. Comparing Nested Parallel Regions and Tasking in OpenMP 3.0

_by Yuan Lin, Sun Studio OpenMP Development Team_

Many codes have nested parallelism. Exploiting nested parallelism can be crucial for getting scalable speed up.

In OpenMP 2.5, the only way to exploit nested parallelism is to use nested parallel regions. In OpenMP 3.0, _tasks_ provide another and often more efficient way of exploiting nested parallelism.

h5. Limitations of Using Nested Parallel Regions

* Entering and exiting parallel regions often have higher overhead than executing tasks.
* The number of threads that are used for each parallel region is determined when the parallel region is created. Once a parallel region is created, no threads in the team can leave the parallel region until the end of the parallel region, and no more threads can join the parallel region. This can lead to low resource utilization and poor scalability.
bq. For example, when a parallel region is created, if the runtime system can only supply 2 threads, then only two threads will be used for this parallel region. When at a later point more resources are available and some threads become free, those threads cannot join the parallel region even if there are lots of concurrent jobs in that parallel region.
* The user needs to decide how many threads to use for parallel regions at each level, since threads cannot be shared between different parallel regions. Using either too many or too few can cause waste of resources or poor scalability.

h3. An Example

Here we illustrate the differences by implementing Quick Sort.

Quick Sort uses the first element of an array as the pivot point and partitions the array into a left subarray whose elements are smaller than the pivot point, and a right subarray whose elements are greater than the pivot point. The algorithm then recursively works on the two subarrays. At each level the two subarrays can be processed in parallel, exhibiting nested parallelism.
-----
{panel:title=The source code: [https://wikis.sun.com/download/attachments/38210769/task-nested-7-08.tar.gz] }
{panel}

-----

{code:title=Using a Nested Parallel Region}
void quick_sort (int p, int r, float *data)
{
if (p < r) {
int q = partition (p, r, data);
#pragma omp parallel sections firstprivate(data, p, q, r)
{
#pragma omp section
quick_sort (p, q-1, data, low_limit);
#pragma omp section
quick_sort (q+1, r, data, low_limit);
}
}
}
}

void par_quick_sort (int n, float *data)
{
quick_sort (0, n, data);
}



{code}

-----

{code:title=Using an OpenMP 3.0 Task}
void quick_sort (int p, int r, float *data)
{
if (p < r) {
int q = partition (p, r, data);
#pragma omp task
quick_sort (p, q-1, data, low_limit);
#pragma omp task
quick_sort (q+1, r, data, low_limit);
}
}
}

void par_quick_sort (int n, float *data)
{
#pragma omp parallel
{
#pragma omp single nowait
quick_sort (0, n, data);
}
}
{code}

-----

h5. Experimental Result

First, we set environment variable OMP_THREAD_LIMIT so that the program will use a max of 8 threads.

The experiment was performed on a machine with 8 processors. The number of threads for the parallel region example is set by setting the OMP_NUM_THREADS environment variable.

We obtain these performance numbers:
|| OMP_NUM_THREADS || task || parreg ||
| 2 | 2.6s | 1.8s |
| 4 | 1.7s | 2.1s |
| 8 | 1.2s | 2.6s |
The program written with tasks achieves a good scalable speedup, while the program written with nested parallel regions does not.

In the parreg version, when OMP_NUM_THREADS is set 2, each parallel region will try to get 2 threads. Since there are nested parallel regions, each inner parallel region will also try to get 2 threads. Eventually all 8 threads will be created and used. Although 8 threads are used, the execution time is only comparable to the task version that uses 4 threads, and is longer than that of the task version using all 8 threads. The reason is that the data in the input array is set so that the recursive sort tree is unbalanced. A thread may finish its work early, and then must wait at the barrier even though it could help other threads.

For parreg, setting OMP_NUM_THREADS to 2 is the most efficient way to use threads. Since the degree of parallelism at each parallel region is two in quick sort, setting the OMP_NUM_THREADS to a number higher than 2 will cause more threads to be wasted. For example, when OMP_NUM_THREADS is 4. Only 2 threads at the outer-most level will get real jobs, the other two will be waiting at the barrier. Then at the second level, each of the two active threads will ask for 3 more slave thread. Since there are 4 threads left, one may get 3 threads and the other one get 1 threads, or each of them get 2 threads. No matter which case happens, two more threads will have no real work to do and merely wait at the barrier. Effectively, only 4 of the 8 threads have any real work to do when OMP_NUM_THREADS=4. When OMP_NUM_THREADS=8, then all 8 threads will be at the outermost level, 2 will have real work, while 6 will be wasted.

The task version does not have these problems. Any thread can work on any tasks, and therefore the threads can be used efficiently. There is only one parallel region, so the user can effectively control how many threads to be used by only setting the OMP_NUM_THREADS env variable.

----

The individuals who post here are part of the extended Sun Microsystems community and they might not be employed or in any way formally affiliated with Sun Microsystems. The opinions expressed here are their own, are not necessarily reviewed in advance by anyone but the individual authors, and neither Sun nor any other party necessarily agrees with them.

Copyright 1994-2009 Sun Microsystems, Inc.
Powered by Atlassian Confluence
Sun Guidelines on Public Discourse Privacy Policy Terms of Use Trademarks Site Map Employment Investor Relations Contact