Electronic – Normalized and denormalized floating point numbers

binaryfloating point

Regarding the normalized and denormalized representations of binary floating-point numbers (as described in the book by Patterson), I want to know how a denormalized number is really shown.

See the table below which is saying that \$e=0\$, \$f\neq0\$ is a denormalized number.

Floating Point Numbers - Special Numbers in IEEE 754 Standard

(Image source)

Consider the number \$1.10\cdot2^0\$. Isn't that normal?

Best Answer

What it means to be normalized is dependent on the particular floating point format. Some formats have no way of expressing unnormalized values.

Decimal example

I'll illustrate normalization using decimal. Suppose you store floating point values as 6 signed digits with a signed 2 digit power of 10 exponent. for example, 123456 07 means 123456x107. The 6 digits is called the mantissa, and the 2 digits the exponent.

To get the most precision, you use the minimum exponent such that the number still fits into the 6 digits. Another way of saying this is that you adjust the exponent so that the left-most mantissa digit is not zero without losing any digits to its left. For example, if you were trying to represent 12.34, then you'd encode it as 123400 -04. This is called "normalized". In this case since the lower two digits are zero, you could have expressed the value as 012340 -03 or 001234 -02 equivalently. That would be called "denormalized".

In general, you want all the numbers to be normalized because it maximizes the precision. It can also simplify computations and minimize roundoff loss in intermediate calculations if you know where the left-most non-zero digit is.

Sometimes the term "normalized" is used a little differently when performing operations on these numbers. For example, when adding you have to "normalize" both numbers to the same exponent, which will be the larger of the two. That causes the digits of both numbers to line up so that you can add them properly. It may cause digits to be lost off the right end of the smaller number. That's life with floating point numbers, although hardware often carries intermediate calculations to more digits to minimize accumulated roundoff.

Binary

Most hardware implementations use binary because this makes the hardware simpler. As before, normalized means that the left-most mantissa digit must not be 0. Since it can only be 0 or 1, that means it must be 1. If you're clever, you're now thinking "But if it's always 1, then what's the point of actually storing it?". Good question. You can get away without storing it. This is often done and is called the vestigial one.

Let's use 8 bit mantissa and 4 bit exponent format as example. Let's say you wanted to represent 7.5. In binary, that is 111.1. To normalize this into 8 bits with vestigial one, we shift this value left until the first 1 falls off the end: (1)11100000. This needs to be shifted right 6 bits to recover the original value, so the exponent is -6, which is 1010 in 4-bit twos-complement binary. The final floating point binary representation of 7.5 is therefore 11100000 1010.

Note that when using vestigial one, it is not possible to have unnormalized numbers. Shifting the mantissa right one and incrementing the exponent by one to compensate doesn't work because the vestigial one is now for a bit that is supposed to be zero.

There are two more wrinkles often used in real binary floating point formats having to do with how the signs of the mantissa and exponent are handled.

For various internal reasons, it is often more convenient to store the mantissa in sign/magnitude format, as apposed to the common twos-complement format used for integers. The absolute value is stored in the mantissa, then a separate bit used to indicate negative.

The exponent is often offset so that it is never negative. Instead of storing it in the common twos-complement form, the half-way value is added to the exponent, then that stored as a unsigned integer. In the case of a 4-bit exponent as in the example above, you would add 7 to the actual exponent to get the 0-15 unsigned raw exponent value. This is called "excess 7" notation. For example, if the actual exponent was -3, you'd add 7 to that to get 4, so the final 4-bit exponent would be 0100.

It turns out that if you store the sign, exponent in excess notation, then mantissa with vestigial one in order from most to least significant, you can sort the resulting string of bits as if they were twos complement binary numbers.

Binary example

Here is a example to illustrate the above. We'll use a binary floating point format with a sign bit, 4 digit exponent in excess-7 notation, and 8 bit mantissa with vestigial one. These fields are in most to least significant bit order. The binary point is assumed to be immediately to the left of the mantissa, which is also immediately to the right of the vestigial 1.

Let's figure out the encoding for Pi, which is 11.0010010000111111... in binary. The mantissa will store the part in brackets: 1[1.0010010]000111111. When the binary point is moved just to the left of the mantissa, it needs to be multiplied by 21 recover the original value. The exponent is therefore 1. In excess 7 notation, the exponent field will have a value of 8, which is 1000 in binary. Since the overall number is not negative, the sign bit is 0. Putting this all together yields:

  0 1000 10010010

Let's convert this back to decimal to verify it and see the level of rounding. Restoring the vestigial one, the mantissa represents 1.10010010. To make the arithmetic simpler, we can say this is 110010010x2-8. The exponent field is 1000, which is 8 in decimal. Converting from offset 7 notation by subtracting 7 yields 1, meaning the mantissa is multiplied by 21. The mantissa and exponent together therefore represent 110010010x2-7 = 402 / 128 = 3.14063. Since the sign bit is one, we don't negate this. The overall value is 3.14063, which is as close to Pi at this floating point representation can get.