How are firstprivate and lastprivate different than private clauses in OpenMP

openmp

I've looked at the official definitions, but I'm still quite confused.

firstprivate: Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct.

To me, that sounds a lot like private. I've looked for examples, but I don't seem to understand how it's special or how it can be used.

lastprivate: Specifies that the enclosing context's version of the variable is set equal to the private version of whichever thread executes the final iteration (for-loop construct) or last section (#pragma sections).

I feel like I understand this one a bit better because of the following example:

#pragma omp parallel
{
   #pragma omp for lastprivate(i)
      for (i=0; i<n-1; i++)
         a[i] = b[i] + b[i+1];
}
a[i]=b[i];

So, in this example, I understand that lastprivate allows for i to be returned outside of the loop as the last value it was.

I just started learning OpenMP today.

Best Answer

private variables are not initialised, i.e. they start with random values like any other local automatic variable (and they are often implemented using automatic variables on the stack of each thread). Take this simple program as an example:

#include <stdio.h>
#include <omp.h>

int main (void)
{
    int i = 10;

    #pragma omp parallel private(i)
    {
        printf("thread %d: i = %d\n", omp_get_thread_num(), i);
        i = 1000 + omp_get_thread_num();
    }

    printf("i = %d\n", i);

    return 0;
}

With four threads it outputs something like:

thread 0: i = 0
thread 3: i = 32717
thread 1: i = 32717
thread 2: i = 1
i = 10

(another run of the same program)

thread 2: i = 1
thread 1: i = 1
thread 0: i = 0
thread 3: i = 32657
i = 10

This clearly demonstrates that the value of i is random (not initialised) inside the parallel region and that any modifications to it are not visible after the parallel region (i.e. the variable keeps its value from before entering the region).

If i is made firstprivate, then it is initialised with the value that it has before the parallel region:

thread 2: i = 10
thread 0: i = 10
thread 3: i = 10
thread 1: i = 10
i = 10

Still modifications to the value of i inside the parallel region are not visible after it.

You already know about lastprivate (and it is not applicable to the simple demonstration program as it lacks worksharing constructs).

So yes, firstprivate and lastprivate are just special cases of private. The first one results in bringing in values from the outside context into the parallel region while the second one transfers values from the parallel region to the outside context. The rationale behind these data-sharing classes is that inside the parallel region all private variables shadow the ones from the outside context, i.e. it is not possible to use an assignment operation to modify the outside value of i from inside the parallel region.

Related Solutions

How to ensure a dynamically allocated array is private in openmp

I don't think you do - what I did to solve this problem was used a parallel region #pragma omp parallel shared(...) private(...) and allocated the array dynamically inside the parallel region. Try this:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

/* compile with gcc -o test2 -fopenmp test2.c */

int main(int argc, char** argv)
{
    int i = 0;
    int size = 20;
    int* a = (int*) calloc(size, sizeof(int));
    int* b = (int*) calloc(size, sizeof(int));
    int* c;

    for ( i = 0; i < size; i++ )
    {
        a[i] = i;
        b[i] = size-i;
        printf("[BEFORE] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }

    #pragma omp parallel shared(a,b) private(c,i)
    {
        c = (int*) calloc(3, sizeof(int));

        #pragma omp for
        for ( i = 0; i < size; i++ )
        {
            c[0] = 5*a[i];
            c[1] = 2*b[i];
            c[2] = -2*i;
            a[i] = c[0]+c[1]+c[2];

            c[0] = 4*a[i];
            c[1] = -1*b[i];
            c[2] = i;
            b[i] = c[0]+c[1]+c[2];
        }

        free(c);
    }

    for ( i = 0; i < size; i++ )
    {
        printf("[AFTER] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }
}

That to me produced the same results as my earlier experiment program:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

/* compile with gcc -o test1 -fopenmp test1.c */

int main(int argc, char** argv)
{
    int i = 0;
    int size = 20;
    int* a = (int*) calloc(size, sizeof(int));
    int* b = (int*) calloc(size, sizeof(int));

    for ( i = 0; i < size; i++ )
    {
        a[i] = i;
        b[i] = size-i;
        printf("[BEFORE] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }

    #pragma omp parallel for shared(a,b) private(i)
    for ( i = 0; i < size; i++ )
    {
        a[i] = 5*a[i]+2*b[i]-2*i;
        b[i] = 4*a[i]-b[i]+i;
    }

    for ( i = 0; i < size; i++ )
    {
        printf("[AFTER] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }
}

At a guess I'd say because OpenMP can't deduce the size of the array it can't be private - only compile-time arrays can be done this way. I get segfaults when I try to private a dynamically allocated array, presumably because of access violations. Allocating the array on each thread as if you'd written this using pthreads makes sense and solves the issue.

Trouble with nested loops and openmp

This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).

The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.

The reason you get wrong answers without the #pragma omp for is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like an omp for (a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updating S2 simultatneously.

You have to be careful with OpenMP threads updating shared variables. OpenMP has atomic operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In the mysumallatomic, each thread updates S2 as before, but this time it's done safely:

#include <omp.h>
#include <math.h>
#include <stdio.h>

double mysumorig() {

    double S2 = 0;
    int a, b;
    for(a=0;a<128;a++){
        for(b=0;b<128;b++){
            S2=S2+a*b;
        }
    }

    return S2;
}


double mysumallatomic() {

    double S2 = 0.;
#pragma omp parallel for shared(S2)
    for(int a=0; a<128; a++){
        for(int b=0; b<128;b++){
            double myterm = (double)a*b;
            #pragma omp atomic
            S2 += myterm;
        }
    }

    return S2;
}


double mysumonceatomic() {

    double S2 = 0.;
#pragma omp parallel shared(S2)
    {
        double mysum = 0.;
        #pragma omp for
        for(int a=0; a<128; a++){
            for(int b=0; b<128;b++){
                mysum += (double)a*b;
            }
        }
        #pragma omp atomic
        S2 += mysum;
    }
    return S2;
}

int main() {
    printf("(Serial)      S2 = %f\n", mysumorig());
    printf("(All Atomic)  S2 = %f\n", mysumallatomic());
    printf("(Atomic Once) S2 = %f\n", mysumonceatomic());
    return 0;
}

However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable S2!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's the mysumonceatomic() routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.

But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by @ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).

As you can see, either approach works:

$ ./foo
(Serial)      S2 = 66064384.000000
(All Atomic)  S2 = 66064384.000000
(Atomic Once) S2 = 66064384.00000

Best Answer

Related Solutions

How to ensure a dynamically allocated array is private in openmp

Trouble with nested loops and openmp

Related Topic