In practice, the difference is in the location where the preprocessor searches for the included file.
For #include <filename>
the preprocessor searches in an implementation dependent manner, normally in search directories pre-designated by the compiler/IDE. This method is normally used to include standard library header files.
For #include "filename"
the preprocessor searches first in the same directory as the file containing the directive, and then follows the search path used for the #include <filename>
form. This method is normally used to include programmer-defined header files.
A more complete description is available in the GCC documentation on search paths.
Setting a bit
Use the bitwise OR operator (|
) to set a bit.
number |= 1UL << n;
That will set the n
th bit of number
. n
should be zero, if you want to set the 1
st bit and so on upto n-1
, if you want to set the n
th bit.
Use 1ULL
if number
is wider than unsigned long
; promotion of 1UL << n
doesn't happen until after evaluating 1UL << n
where it's undefined behaviour to shift by more than the width of a long
. The same applies to all the rest of the examples.
Clearing a bit
Use the bitwise AND operator (&
) to clear a bit.
number &= ~(1UL << n);
That will clear the n
th bit of number
. You must invert the bit string with the bitwise NOT operator (~
), then AND it.
Toggling a bit
The XOR operator (^
) can be used to toggle a bit.
number ^= 1UL << n;
That will toggle the n
th bit of number
.
Checking a bit
You didn't ask for this, but I might as well add it.
To check a bit, shift the number n to the right, then bitwise AND it:
bit = (number >> n) & 1U;
That will put the value of the n
th bit of number
into the variable bit
.
Changing the nth bit to x
Setting the n
th bit to either 1
or 0
can be achieved with the following on a 2's complement C++ implementation:
number ^= (-x ^ number) & (1UL << n);
Bit n
will be set if x
is 1
, and cleared if x
is 0
. If x
has some other value, you get garbage. x = !!x
will booleanize it to 0 or 1.
To make this independent of 2's complement negation behaviour (where -1
has all bits set, unlike on a 1's complement or sign/magnitude C++ implementation), use unsigned negation.
number ^= (-(unsigned long)x ^ number) & (1UL << n);
or
unsigned long newbit = !!x; // Also booleanize to force 0 or 1
number ^= (-newbit ^ number) & (1UL << n);
It's generally a good idea to use unsigned types for portable bit manipulation.
or
number = (number & ~(1UL << n)) | (x << n);
(number & ~(1UL << n))
will clear the n
th bit and (x << n)
will set the n
th bit to x
.
It's also generally a good idea to not to copy/paste code in general and so many people use preprocessor macros (like the community wiki answer further down) or some sort of encapsulation.
Best Answer
This is a classic example of a race condition. Each of your openmp threads is accessing and updating a shared value at the same time, and there's no guaantee that some of the updates won't get lost (at best) or the resulting answer won't be gibberish (at worst).
The thing with race conditions is that they depend sensitively on the timing; in a smaller case (eg, with smaller NREC and NLIG) you might sometimes miss this, but in a larger case, it'll eventually always come up.
The reason you get wrong answers without the
#pragma omp for
is that as soon as you enter the parallel region, all of your openmp threads start; and unless you use something like anomp for
(a so-called worksharing construct) to split up the work, each thread will do everything in the parallel section - so all the threads will be doing the same entire sum, all updatingS2
simultatneously.You have to be careful with OpenMP threads updating shared variables. OpenMP has
atomic
operations to allow you to safely modify a shared variable. An example follows (unfortunately, your example is so sensitive to summation order it's hard to see what's going on, so I've changed your sum somewhat:). In themysumallatomic
, each thread updatesS2
as before, but this time it's done safely:However, that atomic operation really hurts parallel performance (after all, the whole point is to prevent parallel operation around the variable
S2
!) so a better approach is to do the summations and only do the atomic operation after both summations rather than doing it 128*128 times; that's themysumonceatomic()
routine, which only incurs the synchronization overhead once per thread rather than 16k times per thread.But this is such a common operation that there's no need to implment it yourself. One can use an OpenMP built-in functionality for reduction operations (a reduction is an operation like calculating a sum of a list, finding the min or max of a list, etc, which can be done one element at a time only by looking at the result so far and the next element) as suggested by @ejd. OpenMP will work and is faster (it's optimized implementation is much faster than what you can do on your own with other OpenMP operations).
As you can see, either approach works: