For 32-bit floating point, the maximum value is shown in Table III:
0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.
We can decompose the mantissa/exponent into a (close) decimal value as follows:
7FFFFF <base-16> = 8,388,607 <base-10>.
There are 23 bits of significance, so we divide 8,388,607 by 2^23.
8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)
as far as the exponent:
7F <base-16> = 127 <base-10>
and now we multiply the mantissa by 2^127 (the exponent)
8,388,607 / 2^23 * 2^127 =
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38
This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.
The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as
mansissa=7FFFFFFFFF, exponent=7F.
again, we can compute
7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>
the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39.
127-39=88, so just multiply by 2^88:
549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38
This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.
So, the max values are:
1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit
The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.
(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)
(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)
FYI
Some floating point formats use an unrepresented hidden 1
bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1
, therefore we don't have to store that 1
, and we have an extra bit of precision). This particular format doesn't seem to do this.
Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.
8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.
The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111
(a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.
mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.
Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.
(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).
Best Answer
From the comments we exchanged, it seems you are not just unfamiliar with the maths, but also with basic numeric computing discipline.
First, for god's sake, don't automatically pick an epsilon to makes your tests "pass". If you fudge the epsilons until the error is below epsilon, your tests don't test anything at all and you may ignore really bad precision problems (perhaps even a wrong algorithm if it happens to produce similar results in your test cases).
Instead, pick a-priori reasonable epsilons and stick to them. If your get greater errors than you need or expect, that means you need to fix your code, not lower your expectations. Sadly, what expectations are "reasonable" depends on the application and the algorithm, but by default I'd expect, say, a relative error of less than 1e-12 if the floats are 64 bits.
Second, the keyword is relative error. Absolute error is, as you've seen, very sensitive to the magnitude of the values, and thus often useless. Moreover, due to how floating point works, there is a fixed relative error that you can't beat unless you happen to produce the exact same bit pattern (for 64 bit floats, this is about 1e-17). Thus, if numbers are large enough, you will find that your absolute error is either zero (unlikely) or quite large, even though the calculation may be only be off by one unit in the last place (ULP).
For this, you should also force MATLAB to output more digits (17 is the maximum number that makes sense for 64 bit floats). Oh, and relative error calculations can be quite tricky, as @CodeInChaos also points out, so you may want to rely on existing algorithms that handle edge cases better than the naive approaches, such as http://floating-point-gui.de/errors/comparison/