Binary floating point math is like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1
, which is 1/10
) whose denominator is not a power of two cannot be exactly represented.
For 0.1
in the standard binary64
format, the representation can be written exactly as
0.1000000000000000055511151231257827021181583404541015625
in decimal, or
0x1.999999999999ap-4
in C99 hexfloat notation.
In contrast, the rational number 0.1
, which is 1/10
, can be written exactly as
0.1
in decimal, or
0x1.99999999999999...p-4
in an analogue of C99 hexfloat notation, where the ...
represents an unending sequence of 9's.
The constants 0.2
and 0.3
in your program will also be approximations to their true values. It happens that the closest double
to 0.2
is larger than the rational number 0.2
but that the closest double
to 0.3
is smaller than the rational number 0.3
. The sum of 0.1
and 0.2
winds up being larger than the rational number 0.3
and hence disagreeing with the constant in your code.
A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.
Side Note: All positional (base-N) number systems share this problem with precision
Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...
You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system, but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.
** Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation, since 5/7 can't be represented exactly with any decimal number).
So no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)
Side Side Note: Working with Floats in Programming
In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.
You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:
Do not do if (x == y) { ... }
Instead do if (abs(x - y) < myToleranceValue) { ... }
.
where abs
is the absolute value. myToleranceValue
needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These are not to be used as tolerance values.
Best Answer
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now to some meat.
The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an
unsigned int
in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question.IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates:
The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127.
(Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, the IEEE-754 number 0x40000000 is interpreted as follows:
So the value is 1.0 x 21 = 2.0.
To convert an
unsigned int
in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps:reinterpret_cast
, converts the resulting bit-pattern to afloat
. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing aunion
. Some platforms provide an intrinsic operation (such as_itof
) to make this reinterpretation less ugly.There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient:
You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like
clz
for "count leading zeros", ornorm
for "normalize".)You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number.
For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss.
You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.