C++ – Is masking before unsigned left shift in C/C++ too paranoid

cinteger-arithmeticlanguage-lawyerundefined-behavior

This question is motivated by me implementing cryptographic algorithms (e.g. SHA-1) in C/C++, writing portable platform-agnostic code, and thoroughly avoiding undefined behavior.

Suppose that a standardized crypto algorithm asks you to implement this:

b = (a << 31) & 0xFFFFFFFF

where a and b are unsigned 32-bit integers. Notice that in the result, we discard any bits above the least significant 32 bits.

As a first naive approximation, we might assume that int is 32 bits wide on most platforms, so we would write:

unsigned int a = (...);
unsigned int b = a << 31;

We know this code won't work everywhere because int is 16 bits wide on some systems, 64 bits on others, and possibly even 36 bits. But using stdint.h, we can improve this code with the uint32_t type:

uint32_t a = (...);
uint32_t b = a << 31;

So we are done, right? That's what I thought for years. … Not quite. Suppose that on a certain platform, we have:

// stdint.h
typedef unsigned short uint32_t;

The rule for performing arithmetic operations in C/C++ is that if the type (such as short) is narrower than int, then it gets widened to int if all values can fit, or unsigned int otherwise.

Let's say that the compiler defines short as 32 bits (signed) and int as 48 bits (signed). Then these lines of code:

uint32_t a = (...);
uint32_t b = a << 31;

will effectively mean:

unsigned short a = (...);
unsigned short b = (unsigned short)((int)a << 31);

Note that a is promoted to int because all of ushort (i.e. uint32) fits into int (i.e. int48).

But now we have a problem: shifting non-zero bits left into the sign bit of a signed integer type is undefined behavior. This problem happened because our uint32 was promoted to int48 – instead of being promoted to uint48 (where left-shifting would be okay).

Here are my questions:

Is my reasoning correct, and is this a legitimate problem in theory?
Is this problem safe to ignore because on every platform the next integer type is double the width?
Is a good idea to correctly defend against this pathological situation by pre-masking the input like this?: b = (a & 1) << 31;. (This will necessarily be correct on every platform. But this could make a speed-critical crypto algorithm slower than necessary.)

Clarifications/edits:

I'll accept answers for C or C++ or both. I want to know the answer for at least one of the languages.
The pre-masking logic may hurt bit rotation. For example, GCC will compile b = (a << 31) | (a >> 1); to a 32-bit bit-rotation instruction in assembly language. But if we pre-mask the left shift, it is possible that the new logic is not translated into bit rotation, which means now 4 operations are performed instead of 1.

Best Answer

Speaking to the C side of the problem,

Is my reasoning correct, and is this a legitimate problem in theory?

It is a problem that I had not considered before, but I agree with your analysis. C defines the behavior of the << operator in terms of the type of the promoted left operand, and it it conceivable that the integer promotions result in that being (signed) int when the original type of that operand is uint32_t. I don't expect to see that in practice on any modern machine, but I'm all for programming to the actual standard as opposed to my personal expectations.

Is this problem safe to ignore because on every platform the next integer type is double the width?

C does not require such a relationship between integer types, though it is ubiquitous in practice. If you are determined to rely only on the standard, however -- that is, if you are taking pains to write strictly conforming code -- then you cannot rely on such a relationship.

Is a good idea to correctly defend against this pathological situation by pre-masking the input like this?: b = (a & 1) << 31;. (This will necessarily be correct on every platform. But this could make a speed-critical crypto algorithm slower than necessary.)

The type unsigned long is guaranteed to have at least 32 value bits, and it is not subject to promotion to any other type under the integer promotions. On many common platforms it has exactly the same representation as uint32_t, and may even be the same type. Thus, I would be inclined to write the expression like this:

uint32_t a = (...);
uint32_t b = (unsigned long) a << 31;

Or if you need a only as an intermediate value in the computation of b, then declare it as an unsigned long to begin with.

Related Solutions

C++ – Why does left shift operation invoke Undefined Behaviour when the left side operand has negative value

The paragraph you copied is talking about unsigned types. The behavior is undefined in C++. From the last C++0x draft:

The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

EDIT: got a look at C++98 paper. It just doesn't mention signed types at all. So it's still undefined behavior.

Right-shift negative is implementation defined, right. Why? In my opinion: It's easy to implementation-define because there is no truncation from the left issues. When you shift left you must say not only what's shifted from the right but also what happens with the rest of the bits e.g. with two's complement representation, which is another story.

The real difference between “int” and “unsigned int”

Hehe. You have an implicit cast here, because you're telling printf what type to expect.

Try this on for size instead:

unsigned int x = 0xFFFFFFFF;
int y = 0xFFFFFFFF;

if (x < 0)
    printf("one\n");
else
    printf("two\n");
if (y < 0)
    printf("three\n");
else
    printf("four\n");

Best Answer

Related Solutions

C++ – Why does left shift operation invoke Undefined Behaviour when the left side operand has negative value

The real difference between “int” and “unsigned int”

Related Topic