How to use omp parallel for and omp simd together

copenmpparallel-processingsimdx86

I want to test #pragma omp parallel for and #pragma omp simd for a simple matrix addition program. When I use each of them separately, I get no error and it seems fine. But, I want to test how much performance can be gained using both of them. If I use #pragma omp parallel for before the outer loop and #pragma omp simd before the inner loop I get no error as well. The error occures when I use both of them before the outer loop. I get an error at runtime not compile time. ICC and GCC return error but Clang doesn't. It might be because Clang regect the parallelization. In my experiments, Clang does not parallelize and run the program with only one thread.

The program is here:

#include <stdio.h>
//#include <x86intrin.h>
#define N 512
#define M N

int __attribute__(( aligned(32))) a[N][M],
    __attribute__(( aligned(32))) b[N][M],
    __attribute__(( aligned(32))) c_result[N][M];

int main()
{
    int i, j;
    #pragma omp parallel for
    #pragma omp simd
    for( i=0;i<N;i++){
        for(j=0;j<M;j++){
            c_result[i][j]= a[i][j] + b[i][j];
        }
    }

    return 0;
}

The error for:
ICC:

IMP1.c(20): error: omp directive is not followed by a parallelizable
for loop #pragma omp parallel for ^

compilation aborted for IMP1.c (code 2)

GCC:

IMP1.c: In function ‘main’:

IMP1.c:21:10: error: for statement
expected before ‘#pragma’ #pragma omp simd

Because in my other testes pragma omp simd for outer loop gets better performance I need to put that there (don't I?).

Platform: Intel Core i7 6700 HQ, Fedora 27

Tested compilers: ICC 18, GCC 7.2, Clang 5

Compiler command line:

icc -O3 -qopenmp -xHOST -no-vec

gcc -O3 -fopenmp -march=native -fno-tree-vectorize -fno-tree-slp-vectorize

clang -O3 -fopenmp=libgomp -march=native -fno-vectorize -fno-slp-vectorize

Best Answer

From OpenMP 4.5 Specification:

2.11.4 Parallel Loop SIMD Construct

The parallel loop SIMD construct is a shortcut for specifying a parallel construct containing one loop SIMD construct and no other statement.

The syntax of the parallel loop SIMD construct is as follows:

#pragma omp parallel for simd ...

You can also write:

#pragma omp parallel
{
   #pragma omp for simd
   for ...
}

Setting a bit

Use the bitwise OR operator (|) to set a bit.

number |= 1UL << n;

That will set the nth bit of number. n should be zero, if you want to set the 1st bit and so on upto n-1, if you want to set the nth bit.

Use 1ULL if number is wider than unsigned long; promotion of 1UL << n doesn't happen until after evaluating 1UL << n where it's undefined behaviour to shift by more than the width of a long. The same applies to all the rest of the examples.

Clearing a bit

Use the bitwise AND operator (&) to clear a bit.

number &= ~(1UL << n);

That will clear the nth bit of number. You must invert the bit string with the bitwise NOT operator (~), then AND it.

Toggling a bit

The XOR operator (^) can be used to toggle a bit.

number ^= 1UL << n;

That will toggle the nth bit of number.

Checking a bit

You didn't ask for this, but I might as well add it.

To check a bit, shift the number n to the right, then bitwise AND it:

bit = (number >> n) & 1U;

That will put the value of the nth bit of number into the variable bit.

Changing the nth bit to x

Setting the nth bit to either 1 or 0 can be achieved with the following on a 2's complement C++ implementation:

number ^= (-x ^ number) & (1UL << n);

Bit n will be set if x is 1, and cleared if x is 0. If x has some other value, you get garbage. x = !!x will booleanize it to 0 or 1.

To make this independent of 2's complement negation behaviour (where -1 has all bits set, unlike on a 1's complement or sign/magnitude C++ implementation), use unsigned negation.

number ^= (-(unsigned long)x ^ number) & (1UL << n);

unsigned long newbit = !!x;    // Also booleanize to force 0 or 1
number ^= (-newbit ^ number) & (1UL << n);

It's generally a good idea to use unsigned types for portable bit manipulation.

number = (number & ~(1UL << n)) | (x << n);

(number & ~(1UL << n)) will clear the nth bit and (x << n) will set the nth bit to x.

It's also generally a good idea to not to copy/paste code in general and so many people use preprocessor macros (like the community wiki answer further down) or some sort of encapsulation.

Omp parallel vs. omp parallel for

These are equivalent.

#pragma omp parallel spawns a group of threads, while #pragma omp for divides loop iterations between the spawned threads. You can do both things at once with the fused #pragma omp parallel for directive.