Floating Point Comparison – Choosing an Epsilon When Comparing Floating Point Numbers Across Different Systems

floating pointunit testing

I am transcribing thousands of lines of computational code from MATLAB to C++. I don't fully understand the math myself, but I can run it with MATLAB, assume it's correct, and compare the results to my C++ code.

The issue is that the precision I get is very inconsistent. Sometimes MATLAB gives me four significant digits for a number on the order of 1e6, so I have to use an epsilon of 100. Other times, I will be using an epsilon of 1e-3, but then suddenly there will be one value that just different enough to require me to bump it up to 1e-2.

Are there any potential problems with adjusting the epsilon of my tests to a passing value? Is there a more reliable approach?

Best Answer

From the comments we exchanged, it seems you are not just unfamiliar with the maths, but also with basic numeric computing discipline.

First, for god's sake, don't automatically pick an epsilon to makes your tests "pass". If you fudge the epsilons until the error is below epsilon, your tests don't test anything at all and you may ignore really bad precision problems (perhaps even a wrong algorithm if it happens to produce similar results in your test cases).

Instead, pick a-priori reasonable epsilons and stick to them. If your get greater errors than you need or expect, that means you need to fix your code, not lower your expectations. Sadly, what expectations are "reasonable" depends on the application and the algorithm, but by default I'd expect, say, a relative error of less than 1e-12 if the floats are 64 bits.

Second, the keyword is relative error. Absolute error is, as you've seen, very sensitive to the magnitude of the values, and thus often useless. Moreover, due to how floating point works, there is a fixed relative error that you can't beat unless you happen to produce the exact same bit pattern (for 64 bit floats, this is about 1e-17). Thus, if numbers are large enough, you will find that your absolute error is either zero (unlikely) or quite large, even though the calculation may be only be off by one unit in the last place (ULP).

For this, you should also force MATLAB to output more digits (17 is the maximum number that makes sense for 64 bit floats). Oh, and relative error calculations can be quite tricky, as @CodeInChaos also points out, so you may want to rely on existing algorithms that handle edge cases better than the naive approaches, such as http://floating-point-gui.de/errors/comparison/

Related Solutions

Floating Point Numbers – Why Are They Used in Science and Engineering?

Computation in science and engineering requires tradeoffs in precision, range, and speed. Fixed point arithmetic provides precision, and decent speed, but it sacrifices range. BigNum, arbitrary precision libraries, win on range and precision, but lose on speed.

The crux of the matter is that most scientific and engineering calculations need high speed, and huge range, but have relatively modest needs for precision. The most well determined physical constant is only known to about 13 digits, and many values are known with far less certainty. Having more than 13 digits of precision on the computer isn't going to help that. The fly in the ointment is that sequences of floating point operations can gradually lose precision. The bread and butter of numerical analysis is figuring out which problems are particularly susceptible to this, and figuring out clever ways of rearranging the sequence of operations to reduce the problem.

An exception to this is number theory in mathematics which needs to perform arithmetic operations on numbers with millions of digits but with absolute precision. Numerical number theorists often use BigNum libraries, and they put up with their calculations taking a long time.

Floating-Point – How to Calculate Min/Max Values of Floating Point Numbers

For 32-bit floating point, the maximum value is shown in Table III:

0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.

We can decompose the mantissa/exponent into a (close) decimal value as follows:

7FFFFF <base-16> = 8,388,607 <base-10>.

There are 23 bits of significance, so we divide 8,388,607 by 2^23.

8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)

as far as the exponent:

7F <base-16> = 127 <base-10>

and now we multiply the mantissa by 2^127 (the exponent)

8,388,607 / 2^23 * 2^127 = 
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38

This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.

The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as

mansissa=7FFFFFFFFF, exponent=7F.

again, we can compute

7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>

the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:

549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38

This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.

So, the max values are:

1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit

The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.

(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)

(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)

FYI

Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.

Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.

8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.

The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.

mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.

Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.

(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).

Best Answer

Related Solutions

Floating Point Numbers – Why Are They Used in Science and Engineering?

Floating-Point – How to Calculate Min/Max Values of Floating Point Numbers

Related Topic