... h1. Five Tips, Tricks and Idioms in Using the OpenMP 3.0 Tasking Feature _by Yuan Lin, Sun Microsystems OpenMP team_ h2. 1. Use {{single}} to start parallel region with a root task
To start a parallel region that has one initial root task, use the following idiom, {code} #pragma omp parallel { #pragma omp single nowait { /* this is the initial root task */
#pragma omp task { /* this is first child task */ }
#pragma omp task { /* this is second child task */ } } }
{code}
h2. 2. Avoiding creating extra tasks by noticing a child task is executed concurrently with its parent
Given the following piece of code and assume that {{foo()}} is called within an OpenMP task. Now you want make {{A()}} and {{B()}} execute concurrently by creating more tasks, how would you do that? {code} void foo () { A(); B(); } {code} You could do the following, {code} void foo () { #pragma omp task A();
#pragma omp task B(); } {code} But the following is often more efficient, {code} void foo () { #pragma omp task A();
B(); } {code} There's no need to put {{B()}} in it's own task because it already is in the task that includes {{foo()}}.
h2. 3. Create your own _taskgroup_ construct
According to the OpenMP 3.0 specifications, "_the {{taskwait}} construct specifies a wait on the completion of child tasks generated since the beginning of the current task_". What to do if you want a point that waits on the completion of a subset of the child tasks generated instead of all child tasks generated so far?
For example, suppose we have the following piece of code. {code} /* Compute f2 (A, f1 (B, C)) */ int foo () { int a, b, c, x, y;
a = A(); b = B(); c = C(); x = f1(b, c); y = f2(a, x);
return y; } {code} Let's assume that {{A(), B(), C(), f1(), f2()}} can all be executed concurrently. However, because of the data flow, we cannot start executing {{f1()}} until {{B()}} and {{C()}} finish, and cannot start executing {{f2()}} until {{A()}} and {{f1()}} finish, as illustrated below. {noformat} A -----------------+ B ------+ |--> f2 |---> f1 --+ C ------* {noformat} If we create three tasks (one for {{A(), B()}} and {{C()}} each), then we want to have a point that waits for {{B()}} and {{C()}}, but not for {{A()}}. So we can start {{f1()}} after this point. In otherwords, we want a _taskgroup_ construct that allows us to do the following {code} /* Compute f2 (A, f1 (B, C)) */ void foo () { int a, b, c, x, y;
#pragma omp task shared(a) a = A();
#pragma omp _taskgroup_ { #pragma omp task shared(b) b = B();
#pragma omp task shared(c) c = C(); } x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x); } {code}
However, OpenMP 3.0 tasking does not provide such a _taskgroup_ construct. All is not lost, however. We can simulate the _taskgroup_ construct by using a task construct with an {{if (0)}} clause and a {{taskwait}} construct as illustrated below
{code} #pragma omp _taskgroup_ #pragma omp task if (0) { { ... ====> ... #pragma omp taskwait } } {code}
We can now write our code like the following, {code} /* Compute f2 (A, f1 (B, C)) */ void foo () { int a, b, c, x, y;
#pragma omp task shared(a) a = A();
#pragma omp task if (0) shared (b, c, x) { #pragma omp task shared(b) b = B();
#pragma omp task shared(c) c = C();
#pragma omp taskwait } x = f1 (b, c);
#pragma omp taskwait
y = f2 (a, x); } {code}
h2. 4. Do not use tasks where you can use worksharing {{for}} loops
In most programs, each of iteration of an OpenMP worksharing {{for}} loop can be executed concurrently and still gives you the correct result. Therefore, if you convert a worksharing {{for}} loop into a loop with tasks (see the following example), you will probably still get the correct result. {code} /* An OpenMP worksharing for loop */ #pragma omp for for (i=0; i<n; i++) { foo(i); }
/* The above loop converted to use tasks */ #pragma omp single nowait { for (i=0; i<n; i++) { #pragma omp task firstprivate(i) foo(i); } } {code} But don't do that just because you can. A worksharing {{for}} loop (especially one that uses static scheduling) usually has lower overhead per iteration than a task loop.
Use a task construct for parallel _while_ loops, such as pointer chasing loops, that can not be expressed efficiently using worksharing {{for}} loops. Do not replace worksharing {{for}} loops with task loops.
h2. 5. How to do reductions in tasks?
Say you have a set of items, some of which are 'good' and the others are 'bad'. You want to find out the number of 'good' items. The set is implemented using a linked list. The sequential loop is {code} int count_good (item_t *item) { int n = 0; while (item) { if (is_good(item)) n ++; item = item->next; } return n; } {code} You can use a task loop to parallelize the while loop. How about the increment of {{n}}?
The {{atomic}} construct is probably the most efficient way of doing reductions in tasks. {code} int count_good (item_t *item) { int n = 0; #pragma omp parallel { #pragma omp single nowait { while (item) { #pragma omp task firstprivate(item) { if (is_good(item)) { #pragma omp atomic n ++; } } item = item->next; } } } return n; } {code} Or you can use a thread specific variable to hold the partial sum per-thread and then do a cross-thread reduction to get the total sum. {code} int count_good (item_t *item) { int n = 0; int pn[P]; /* P is the number of threads used. */ #pragma omp parallel { pn[omp_get_thread_num()] = 0; #pragma omp single nowait { while (item) { #pragma omp task firstprivate(item) { if (is_good(item)) { pn[omp_get_thread_num()] ++; } } item = item->next; } |