Electronic – How we got exponent as \$2^{127-64}\$

floating point

Information: Consider a 16-bit register of the following format is used to store a floating point number. Mantissa (M) is denoted as normalized signed magnitude fraction, Exponent (E) is expressed in excess-64 form. Base of the system is 2.

If we calculate then, we get, exponent is allotted 7 bits and Mantissa is allotted 8 bits.

Therefore, largest number that can be represented using this information is as follows:

| 0 | 1 1 1 1 1 1 1 | 1 1 1 1 1 1 1 1 |

i.e. filling every bit with 1 and as we are looking for largest number we are having sign of the number as 0.

What is the value of the largest number that can be represented in base 10?

We will use following formula: \$(-1)^S$ * 1.M * 2^{E-B}\$ i.e. implicit normalization with biasing.

I don't understand the exponent part of the number

How we got exponent as \$2^{127-64}\$. Why we are subtracting bias 64 from exponent 127?

Can someone explain me with proper derivation/explanation that how we arrived at \$2^{127-64}\$? Please explain it as you are explaining to naive person.

I am missing something very obvious!

Waiting for explanation!

Best Answer

The exponent is biased so that the format can better represent fractional numbers between 0 and 1. It's a way to extend the lower extent in range of precision that the format can handle. It turns out that values from 0 to 1 are quite important in most floating point calculation, more important than representing bigger magnitudes, so sacrificing half the upper range is a reasonable trade-off.

But there's another, more important reason for using bias (as opposed to 2’s complement) that I'll get to later, a reason that goes back to the very beginnings of floating point.

Anyway, in this format you basically have these key values and ranges:

 - zero                                 (sign, exp 0x00-64, mant 0.0x00)
 - denormals                            (sign, exp 0x00-64, mant 0.0x01 ~ 0.0xff)
 - smallest normalized less than one    (sign, exp 0x01-64, mant 1.0x00)
 - largest nomalized less than one      (sign, exp 0x3e-64, mant 1.0xff)
 - one                                  (sign, exp 0x3f-64, mant 1.0x00)
 - smallest normalized greater than one (sign, exp 0x40-64, mant 1.0x00)
 - largest normalized greater than one  (sign, exp 0x7f-64, mant 1.0xff)

Some fine points:

  • For all the cases except zero and denormal, the mantissa value is 1.mant, which gives a range of 1 to just 1 mantissa LSB less than 2 (that is, 1 + 0/256 to 1 + 255/256).
  • Because of the way sign is handled, there are two representations of zero: +0 and -0.

This example format is something like what IEEE754 does. IEEE754 also reserves special values for -infinity, +infinity, and not-a-number (NaN). Play around with it here: https://www.h-schmidt.net/FloatConverter/IEEE754.html


And now, the buried lede: Why use bias at all? Because it avoids needing to use 2's complement in the exponent, which would make simple greater- and less-than comparisons between float values harder.

With bias, you can do a magnitude compare with just a single integer subtract of the mantissa and exponent fields (sign bit is masked off and handled separately.) That’s not possible if 2’s were used for the exponent, as negative exponents would look like large integer values to an integer compare, giving a wrong result.

In other words, a biased exponent yields an always-increasing integer value from zero to positive infinity. (Try it in that app I linked.)

The side-effect of using bias is that it complicates float-to-fixed and fixed-to-float, but this is usually a rare operation that in any event is efficiently dealt with by the FPU.

And I mentioned a history of bias. The IBM 709 used biased exponents, way back in 1957, as did its predecessor, the 704, in 1954.