A simple answer:
OpenMP only used to exploit multiple threads for multiple cores. This new simd
extention allows you to explicitly use SIMD instructions on modern CPUs, such as Intel's AVX/SSE and ARM's NEON.
(Note that a SIMD instruction is executed in a single thread and a single core, by design. However, the meaning of SIMD can be quite expanded for GPGPU. But, but I don't think you need to consider GPGPU for OpenMP 4.0.)
So, once you know SIMD instructions, you can use this new construct.
In a modern CPU, roughly there are three types of parallelism: (1) instruction-level parallelism (ILP), (2) thread-level parallelism (TLP), and (3) SIMD instructions (we could say this is vector-level or so).
ILP is done automatically by your out-of-order CPUs, or compilers. You can exploit TLP using OpenMP's parallel for
and other threading libraries. So, what about SIMD? Intrinsics were a way to use them (as well as compilers' automatic vectorization). OpenMP's simd
is a new way to use SIMD.
Take a very simple example:
for (int i = 0; i < N; ++i)
A[i] = B[i] + C[i];
The above code computes a sum of two N-dimensional vectors. As you can easily see, there is no (loop-carried) data dependency on the array A[]
. This loop is embarrassingly parallel.
There could be multiple ways to parallelize this loop. For example, until OpenMP 4.0, this can be parallelized using only parallel for
construct. Each thread will perform N/#thread
iterations on multiple cores.
However, you might think using multiple threads for such simple addition would be a overkill. That is why there is vectorization, which is mostly implemented by SIMD instructions.
Using a SIMD would be like this:
for (int i = 0; i < N/8; ++i)
VECTOR_ADD(A + i, B + i, C + i);
This code assumes that (1) the SIMD instruction (VECTOR_ADD
) is 256-bit or 8-way (8 * 32 bits); and (2) N
is a multiple of 8.
An 8-way SIMD instruction means that 8 items in a vector can be executed in a single machine instruction. Note that Intel's latest AVX provides such 8-way (32-bit * 8 = 256 bits) vector instructions.
In SIMD, you still use a single core (again, this is only for conventional CPUs, not GPU). But, you can use a hidden parallelism in hardware. Modern CPUs dedicate hardware resources for SIMD instructions, where each SIMD lane can be executed in parallel.
You can use thread-level parallelism at the same time. The above example can be further parallelized by parallel for
.
(However, I have a doubt how many loops can be really transformed to SIMDized loops. The OpenMP 4.0 specification seems a bit unclear on this. So, real performance and practical restrictions would be dependent on actual compilers' implementations.)
To summarize, simd
construct allows you to use SIMD instructions, in turn, more parallelism can be exploited along with thread-level parallelism. However, I think actual implementations would matter.
Best Answer
I would suggest that you have a look at the OpenMP tutorial from Lawrence Livermore National Laboratory, available here.
Your particular example is one that should not be implemented using OpenMP tasks. The second code creates
N
times the number of threads tasks (because there is an error in the code beside the missing}
; I would come back to it later), and each task is only performing a very simple computation. The overhead of tasks would be gigantic, as you can see in my answer to this question. Besides the second code is conceptually wrong. Since there is no worksharing directive, all threads would execute all iterations of the loop and instead ofN
tasks,N
times the number of threads tasks would get created. It should be rewritten in one of the following ways:Single task producer - common pattern, NUMA unfriendly:
The
single
directive would make the loop run inside a single thread only. All other threads would skip it and hit the implicit barrier at the end of thesingle
construct. As barriers contain implicit task scheduling points, the waiting threads will start processing tasks immediately as they become available.Parallel task producer - more NUMA friendly:
In this case the task creation loop would be shared among the threads.
If you do not know what NUMA is, ignore the comments about NUMA friendliness.