h1. Five Tips, Tricks and Idioms in Using the OpenMP 3.0 Tasking Feature
_by Yuan Lin, Sun Microsystems OpenMP team_
h2. 1. Use {{single}} to start parallel region with a root task
To start a parallel region that has one initial root task, use the following idiom,
{code}
#pragma omp parallel
{
#pragma omp single nowait
{
/* this is the initial root task */
#pragma omp task
{
/* this is first child task */
}
#pragma omp task
{
/* this is second child task */
}
}
}
{code}
h2. 2. Avoiding creating extra tasks by noticing a child task is executed concurrently with its parent
Given the following piece of code and assume that {{foo()}} is called within an OpenMP task. Now you want make {{A()}} and {{B()}} execute concurrently by creating more tasks, how would you do that?
{code}
void foo ()
{
A();
B();
}
{code}
You could do the following,
{code}
void foo ()
{
#pragma omp task
A();
#pragma omp task
B();
}
{code}
But the following is often more efficient,
{code}
void foo ()
{
#pragma omp task
A();
B();
}
{code}
There's no need to put {{B()}} in it's own task because it already is in the task that includes {{foo()}}.
h2. 3. Create your own _taskgroup_ construct
According to the OpenMP 3.0 specifications, "_the {{taskwait}} construct specifies a wait on the completion of child tasks generated since the beginning of the current task_". What to do if you want a point that waits on the completion of a subset of the child tasks generated instead of all child tasks generated so far?
For example, suppose we have the following piece of code.
{code}
/* Compute f2 (A, f1 (B, C))
*/
int foo ()
{
int a, b, c, x, y;
a = A();
b = B();
c = C();
x = f1(b, c);
y = f2(a, x);
return y;
}
{code}
Let's assume that {{A(), B(), C(), f1(), f2()}} can all be executed concurrently. However, because of the data flow, we cannot start executing {{f1()}} until {{B()}} and {{C()}} finish, and cannot start executing {{f2()}} until {{A()}} and {{f1()}} finish, as illustrated below.
{noformat}
A -----------------+
B ------+ |--> f2
|---> f1 --+
C ------*
{noformat}
If we create three tasks (one for {{A(), B()}} and {{C()}} each), then we want to have a point that waits for {{B()}} and {{C()}}, but not for {{A()}}. So we can start {{f1()}} after this point. In otherwords, we want a _taskgroup_ construct that allows us to do the following
{code}
/* Compute f2 (A, f1 (B, C))
*/
void foo ()
{
int a, b, c, x, y;
#pragma omp task shared(a)
a = A();
#pragma omp _taskgroup_
{
#pragma omp task shared(b)
b = B();
#pragma omp task shared(c)
c = C();
}
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x);
}
{code}
However, OpenMP 3.0 tasking does not provide such a _taskgroup_ construct. All is not lost, however. We can simulate the _taskgroup_ construct by using a task construct with an {{if (0)}} clause and a {{taskwait}} construct as illustrated below
{code}
#pragma omp _taskgroup_ #pragma omp task if (0)
{ {
... ====> ...
#pragma omp taskwait
} }
{code}
We can now write our code like the following,
{code}
/* Compute f2 (A, f1 (B, C))
*/
void foo ()
{
int a, b, c, x, y;
#pragma omp task shared(a)
a = A();
#pragma omp task if (0) shared (b, c, x)
{
#pragma omp task shared(b)
b = B();
#pragma omp task shared(c)
c = C();
#pragma omp taskwait
}
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x);
}
{code}
h2. 4. Do not use tasks where you can use worksharing {{for}} loops
In most programs, each of iteration of an OpenMP worksharing {{for}} loop can be executed concurrently and still gives you the correct result. Therefore, if you convert a worksharing {{for}} loop into a loop with tasks (see the following example), you will probably still get the correct result.
{code}
/* An OpenMP worksharing for loop */
#pragma omp for
for (i=0; i<n; i++) {
foo(i);
}
/* The above loop converted to use tasks */
#pragma omp single nowait
{
for (i=0; i<n; i++) {
#pragma omp task firstprivate(i)
foo(i);
}
}
{code}
But don't do that just because you can. A worksharing {{for}} loop (especially one that uses static scheduling) usually has lower overhead per iteration than a task loop.
Use a task construct for parallel _while_ loops, such as pointer chasing loops, that can not be expressed efficiently using worksharing {{for}} loops. Do not replace worksharing {{for}} loops with task loops.
h2. 5. How to do reductions in tasks?
Say you have a set of items, some of which are 'good' and the others are 'bad'. You want to find out the number of 'good' items. The set is implemented using a linked list. The sequential loop is
{code}
int count_good (item_t *item)
{
int n = 0;
while (item) {
if (is_good(item))
n ++;
item = item->next;
}
return n;
}
{code}
You can use a task loop to parallelize the while loop. How about the increment of {{n}}?
The {{atomic}} construct is probably the most efficient way of doing reductions in tasks.
{code}
int count_good (item_t *item)
{
int n = 0;
#pragma omp parallel
{
#pragma omp single nowait
{
while (item) {
#pragma omp task firstprivate(item)
{
if (is_good(item)) {
#pragma omp atomic
n ++;
}
}
item = item->next;
}
}
}
return n;
}
{code}
Or you can use a thread specific variable to hold the partial sum per-thread and then do a cross-thread reduction to get the total sum.
{code}
int count_good (item_t *item)
{
int n = 0;
int pn[P]; /* P is the number of threads used. */
#pragma omp parallel
{
pn[omp_get_thread_num()] = 0;
#pragma omp single nowait
{
while (item) {
#pragma omp task firstprivate(item)
{
if (is_good(item)) {
pn[omp_get_thread_num()] ++;
}
}
item = item->next;
}
}
#pragma omp atomic
n += pn[omp_get_thread_num()];
}
return n;
}
{code}
----
_by Yuan Lin, Sun Microsystems OpenMP team_
h2. 1. Use {{single}} to start parallel region with a root task
To start a parallel region that has one initial root task, use the following idiom,
{code}
#pragma omp parallel
{
#pragma omp single nowait
{
/* this is the initial root task */
#pragma omp task
{
/* this is first child task */
}
#pragma omp task
{
/* this is second child task */
}
}
}
{code}
h2. 2. Avoiding creating extra tasks by noticing a child task is executed concurrently with its parent
Given the following piece of code and assume that {{foo()}} is called within an OpenMP task. Now you want make {{A()}} and {{B()}} execute concurrently by creating more tasks, how would you do that?
{code}
void foo ()
{
A();
B();
}
{code}
You could do the following,
{code}
void foo ()
{
#pragma omp task
A();
#pragma omp task
B();
}
{code}
But the following is often more efficient,
{code}
void foo ()
{
#pragma omp task
A();
B();
}
{code}
There's no need to put {{B()}} in it's own task because it already is in the task that includes {{foo()}}.
h2. 3. Create your own _taskgroup_ construct
According to the OpenMP 3.0 specifications, "_the {{taskwait}} construct specifies a wait on the completion of child tasks generated since the beginning of the current task_". What to do if you want a point that waits on the completion of a subset of the child tasks generated instead of all child tasks generated so far?
For example, suppose we have the following piece of code.
{code}
/* Compute f2 (A, f1 (B, C))
*/
int foo ()
{
int a, b, c, x, y;
a = A();
b = B();
c = C();
x = f1(b, c);
y = f2(a, x);
return y;
}
{code}
Let's assume that {{A(), B(), C(), f1(), f2()}} can all be executed concurrently. However, because of the data flow, we cannot start executing {{f1()}} until {{B()}} and {{C()}} finish, and cannot start executing {{f2()}} until {{A()}} and {{f1()}} finish, as illustrated below.
{noformat}
A -----------------+
B ------+ |--> f2
|---> f1 --+
C ------*
{noformat}
If we create three tasks (one for {{A(), B()}} and {{C()}} each), then we want to have a point that waits for {{B()}} and {{C()}}, but not for {{A()}}. So we can start {{f1()}} after this point. In otherwords, we want a _taskgroup_ construct that allows us to do the following
{code}
/* Compute f2 (A, f1 (B, C))
*/
void foo ()
{
int a, b, c, x, y;
#pragma omp task shared(a)
a = A();
#pragma omp _taskgroup_
{
#pragma omp task shared(b)
b = B();
#pragma omp task shared(c)
c = C();
}
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x);
}
{code}
However, OpenMP 3.0 tasking does not provide such a _taskgroup_ construct. All is not lost, however. We can simulate the _taskgroup_ construct by using a task construct with an {{if (0)}} clause and a {{taskwait}} construct as illustrated below
{code}
#pragma omp _taskgroup_ #pragma omp task if (0)
{ {
... ====> ...
#pragma omp taskwait
} }
{code}
We can now write our code like the following,
{code}
/* Compute f2 (A, f1 (B, C))
*/
void foo ()
{
int a, b, c, x, y;
#pragma omp task shared(a)
a = A();
#pragma omp task if (0) shared (b, c, x)
{
#pragma omp task shared(b)
b = B();
#pragma omp task shared(c)
c = C();
#pragma omp taskwait
}
x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x);
}
{code}
h2. 4. Do not use tasks where you can use worksharing {{for}} loops
In most programs, each of iteration of an OpenMP worksharing {{for}} loop can be executed concurrently and still gives you the correct result. Therefore, if you convert a worksharing {{for}} loop into a loop with tasks (see the following example), you will probably still get the correct result.
{code}
/* An OpenMP worksharing for loop */
#pragma omp for
for (i=0; i<n; i++) {
foo(i);
}
/* The above loop converted to use tasks */
#pragma omp single nowait
{
for (i=0; i<n; i++) {
#pragma omp task firstprivate(i)
foo(i);
}
}
{code}
But don't do that just because you can. A worksharing {{for}} loop (especially one that uses static scheduling) usually has lower overhead per iteration than a task loop.
Use a task construct for parallel _while_ loops, such as pointer chasing loops, that can not be expressed efficiently using worksharing {{for}} loops. Do not replace worksharing {{for}} loops with task loops.
h2. 5. How to do reductions in tasks?
Say you have a set of items, some of which are 'good' and the others are 'bad'. You want to find out the number of 'good' items. The set is implemented using a linked list. The sequential loop is
{code}
int count_good (item_t *item)
{
int n = 0;
while (item) {
if (is_good(item))
n ++;
item = item->next;
}
return n;
}
{code}
You can use a task loop to parallelize the while loop. How about the increment of {{n}}?
The {{atomic}} construct is probably the most efficient way of doing reductions in tasks.
{code}
int count_good (item_t *item)
{
int n = 0;
#pragma omp parallel
{
#pragma omp single nowait
{
while (item) {
#pragma omp task firstprivate(item)
{
if (is_good(item)) {
#pragma omp atomic
n ++;
}
}
item = item->next;
}
}
}
return n;
}
{code}
Or you can use a thread specific variable to hold the partial sum per-thread and then do a cross-thread reduction to get the total sum.
{code}
int count_good (item_t *item)
{
int n = 0;
int pn[P]; /* P is the number of threads used. */
#pragma omp parallel
{
pn[omp_get_thread_num()] = 0;
#pragma omp single nowait
{
while (item) {
#pragma omp task firstprivate(item)
{
if (is_good(item)) {
pn[omp_get_thread_num()] ++;
}
}
item = item->next;
}
}
#pragma omp atomic
n += pn[omp_get_thread_num()];
}
return n;
}
{code}
----