Extracting bits with a single multiplication

bit-manipulationcmultiplication

I saw an interesting technique used in an answer to another question, and would like to understand it a little better.

We're given an unsigned 64-bit integer, and we are interested in the following bits:

1.......2.......3.......4.......5.......6.......7.......8.......

Specifically, we'd like to move them to the top eight positions, like so:

12345678........................................................

We don't care about the value of the bits indicated by ., and they don't have to be preserved.

The solution was to mask out the unwanted bits, and multiply the result by 0x2040810204081. This, as it turns out, does the trick.

How general is this method? Can this technique be used to extract any subset of bits? If not, how does one figure out whether or not the method works for a particular set of bits?

Finally, how would one go about finding the (a?) correct multiplier to extract the given bits?

Best Answer

Very interesting question, and clever trick.

Let's look at a simple example of getting a single byte manipulated. Using unsigned 8 bit for simplicity. Imagine your number is xxaxxbxx and you want ab000000.

The solution consisted of two steps: a bit masking, followed by multiplication. The bit mask is a simple AND operation that turns uninteresting bits to zeros. In the above case, your mask would be 00100100 and the result 00a00b00.

Now the hard part: turning that into ab.......

A multiplication is a bunch of shift-and-add operations. The key is to allow overflow to "shift away" the bits we don't need and put the ones we want in the right place.

Multiplication by 4 (00000100) would shift everything left by 2 and get you to a00b0000 . To get the b to move up we need to multiply by 1 (to keep the a in the right place) + 4 (to move the b up). This sum is 5, and combined with the earlier 4 we get a magic number of 20, or 00010100. The original was 00a00b00 after masking; the multiplication gives:

000000a00b000000
00000000a00b0000 +
----------------
000000a0ab0b0000
xxxxxxxxab......

From this approach you can extend to larger numbers and more bits.

One of the questions you asked was "can this be done with any number of bits?" I think the answer is "no", unless you allow several masking operations, or several multiplications. The problem is the issue of "collisions" - for example, the "stray b" in the problem above. Imagine we need to do this to a number like xaxxbxxcx. Following the earlier approach, you would think we need {x 2, x {1 + 4 + 16}} = x 42 (oooh - the answer to everything!). Result:

00000000a00b00c00
000000a00b00c0000
0000a00b00c000000
-----------------
0000a0ababcbc0c00
xxxxxxxxabc......

As you can see, it still works, but "only just". They key here is that there is "enough space" between the bits we want that we can squeeze everything up. I could not add a fourth bit d right after c, because I would get instances where I get c+d, bits might carry, ...

So without formal proof, I would answer the more interesting parts of your question as follows: "No, this will not work for any number of bits. To extract N bits, you need (N-1) spaces between the bits you want to extract, or have additional mask-multiply steps."

The only exception I can think of for the "must have (N-1) zeros between bits" rule is this: if you want to extract two bits that are adjacent to each other in the original, AND you want to keep them in the same order, then you can still do it. And for the purpose of the (N-1) rule they count as two bits.

There is another insight - inspired by the answer of @Ternary below (see my comment there). For each interesting bit, you only need as many zeros to the right of it as you need space for bits that need to go there. But also, it needs as many bits to the left as it has result-bits to the left. So if a bit b ends up in position m of n, then it needs to have m-1 zeros to its left, and n-m zeros to its right. Especially when the bits are not in the same order in the original number as they will be after the re-ordering, this is an important improvement to the original criteria. This means, for example, that a 16 bit word

a...e.b...d..c..

Can be shifted into

abcde...........

even though there is only one space between e and b, two between d and c, three between the others. Whatever happened to N-1?? In this case, a...e becomes "one block" - they are multiplied by 1 to end up in the right place, and so "we got e for free". The same is true for b and d (b needs three spaces to the right, d needs the same three to its left). So when we compute the magic number, we find there are duplicates:

a: << 0  ( x 1    )
b: << 5  ( x 32   )
c: << 11 ( x 2048 )
d: << 5  ( x 32   )  !! duplicate
e: << 0  ( x 1    )  !! duplicate

Clearly, if you wanted these numbers in a different order, you would have to space them further. We can reformulate the (N-1) rule: "It will always work if there are at least (N-1) spaces between bits; or, if the order of bits in the final result is known, then if a bit b ends up in position m of n, it needs to have m-1 zeros to its left, and n-m zeros to its right."

@Ternary pointed out that this rule doesn't quite work, as there can be a carry from bits adding "just to the right of the target area" - namely, when the bits we're looking for are all ones. Continuing the example I gave above with the five tightly packed bits in a 16 bit word: if we start with

a...e.b...d..c..

For simplicity, I will name the bit positions ABCDEFGHIJKLMNOP

The math we were going to do was

ABCDEFGHIJKLMNOP

a000e0b000d00c00
0b000d00c0000000
000d00c000000000
00c0000000000000 +
----------------
abcded(b+c)0c0d00c00

Until now, we thought anything below abcde (positions ABCDE) would not matter, but in fact, as @Ternary pointed out, if b=1, c=1, d=1 then (b+c) in position G will cause a bit to carry to position F, which means that (d+1) in position F will carry a bit into E - and our result is spoilt. Note that space to the right of the least significant bit of interest (c in this example) doesn't matter, since the multiplication will cause padding with zeros from beyone the least significant bit.

So we need to modify our (m-1)/(n-m) rule. If there is more than one bit that has "exactly (n-m) unused bits to the right (not counting the last bit in the pattern - "c" in the example above), then we need to strengthen the rule - and we have to do so iteratively!

We have to look not only at the number of bits that meet the (n-m) criterion, but also the ones that are at (n-m+1), etc. Let's call their number Q0 (exactly n-m to next bit), Q1 (n-m+1), up to Q(N-1) (n-1). Then we risk carry if

Q0 > 1
Q0 == 1 && Q1 >= 2
Q0 == 0 && Q1 >= 4
Q0 == 1 && Q1 > 1 && Q2 >=2
...

If you look at this, you can see that if you write a simple mathematical expression

W = N * Q0 + (N - 1) * Q1 + ... + Q(N-1)

and the result is W > 2 * N, then you need to increase the RHS criterion by one bit to (n-m+1). At this point, the operation is safe as long as W < 4; if that doesn't work, increase the criterion one more, etc.

I think that following the above will get you a long way to your answer...

Setting a bit

Use the bitwise OR operator (|) to set a bit.

number |= 1UL << n;

That will set the nth bit of number. n should be zero, if you want to set the 1st bit and so on upto n-1, if you want to set the nth bit.

Use 1ULL if number is wider than unsigned long; promotion of 1UL << n doesn't happen until after evaluating 1UL << n where it's undefined behaviour to shift by more than the width of a long. The same applies to all the rest of the examples.

Clearing a bit

Use the bitwise AND operator (&) to clear a bit.

number &= ~(1UL << n);

That will clear the nth bit of number. You must invert the bit string with the bitwise NOT operator (~), then AND it.

Toggling a bit

The XOR operator (^) can be used to toggle a bit.

number ^= 1UL << n;

That will toggle the nth bit of number.

Checking a bit

You didn't ask for this, but I might as well add it.

To check a bit, shift the number n to the right, then bitwise AND it:

bit = (number >> n) & 1U;

That will put the value of the nth bit of number into the variable bit.

Changing the nth bit to x

Setting the nth bit to either 1 or 0 can be achieved with the following on a 2's complement C++ implementation:

number ^= (-x ^ number) & (1UL << n);

Bit n will be set if x is 1, and cleared if x is 0. If x has some other value, you get garbage. x = !!x will booleanize it to 0 or 1.

To make this independent of 2's complement negation behaviour (where -1 has all bits set, unlike on a 1's complement or sign/magnitude C++ implementation), use unsigned negation.

number ^= (-(unsigned long)x ^ number) & (1UL << n);

unsigned long newbit = !!x;    // Also booleanize to force 0 or 1
number ^= (-newbit ^ number) & (1UL << n);

It's generally a good idea to use unsigned types for portable bit manipulation.

number = (number & ~(1UL << n)) | (x << n);

(number & ~(1UL << n)) will clear the nth bit and (x << n) will set the nth bit to x.

It's also generally a good idea to not to copy/paste code in general and so many people use preprocessor macros (like the community wiki answer further down) or some sort of encapsulation.

How to count the number of set bits in a 32-bit integer

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86's popcnt (on CPUs where it's supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that's good for the specific CPU you're compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler's choice of fallback for target CPUs that don't have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

Portable algorithms that don't need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0's or mostly 1's then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

GCC10 and clang 10.0 can recognize this pattern / idiom and compile it to a hardware popcnt or equivalent instruction when available, giving you the best of both worlds. (https://godbolt.org/z/qGdh1dvKK)

int numberOfSetBits(uint32_t i)
{
     // Java: use int, and use >>> instead of >>. Or use Integer.bitCount()
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
     i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
     return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn't get any faster with "simple" inputs, but it's still pretty decent.)

References:

How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn't possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough "elements", a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn't produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn't take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn't enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

https://github.com/WojciechMula/sse-popcount state-of-the-art x86 SIMD popcount for SSSE3, AVX2, AVX512BW, AVX512VBMI, or AVX512 VPOPCNT. Using Harley-Seal across vectors to defer popcount within an element. (Also ARM NEON)
Counting 1 bits (population count) on large data using AVX-512 or AVX-2
related: https://github.com/mklarqvist/positional-popcount - separate counts for each bit-position of multiple 8, 16, 32, or 64-bit integers. (Again, x86 SIMD including AVX-512 which is really good at this, with vpternlogd making Harley-Seal very good.)