Floating point rounding when truncating

cfloating pointfpux86

This is probably a question for an x86 FPU expert:

I am trying to write a function which generates a random floating point value in the range [min,max]. The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) – ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest.

What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. Or any other ideas, of course.

Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000:

float MersenneFloat( float min, float max )
{
    //genrand returns a double in [1,2)
    const float random = (float)genrand_close1_open2(); 
    //return in desired range
    return min + ( random - 1.0f ) * (max - min);
}

If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. Also, will using any versions of the SSE optimizations change the answer to this?

Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. See Crashworks' answer for more.

Best Answer

The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math.

I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg

static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;

Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2.