What’s the difference between a single precision and double precision floating point operation

floating pointoperationsprecisionprocessor

What is the difference between a single precision floating point operation and double precision floating operation?

I'm especially interested in practical terms in relation to video game consoles. For example, does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations? Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of (if they exist?).

Best Answer

Note: the Nintendo 64 does have a 64-bit processor, however:

Many games took advantage of the chip's 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games, as well as the fact that processing 64-bit data uses twice as much RAM, cache, and bandwidth, thereby reducing the overall system performance.

From Webopedia:

The term double precision is something of a misnomer because the precision is not really double.
The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number.
For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.

The extra bits increase not only the precision but also the range of magnitudes that can be represented.
The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values.
Most computers use a standard format known as the IEEE floating-point format.

The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format, as well as a much greater range.

From the IEEE standard for floating point arithmetic

Single Precision

The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.

The first bit is the sign bit, S,
the next eight bits are the exponent bits, 'E', and

the final 23 bits are the fraction 'F':

S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1      8 9                    31

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")
If E=255 and F is zero and S is 1, then V=-Infinity
If E=255 and F is zero and S is 0, then V=Infinity
If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F). These are "unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

In particular,

0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                     0.00000000000000000000001 = 
                                     2**(-149)  (Smallest positive value)

Double Precision

The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.

The first bit is the sign bit, S,
the next eleven bits are the exponent bits, 'E', and

the final 52 bits are the fraction 'F':

S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1        11 12                                                63

The value V represented by the word may be determined as follows:

If E=2047 and F is nonzero, then V=NaN ("Not a number")
If E=2047 and F is zero and S is 1, then V=-Infinity
If E=2047 and F is zero and S is 0, then V=Infinity
If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

Reference:
ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic.

Related Solutions

.net – Difference between decimal, float and double in .NET

float and double are floating binary point types. In other words, they represent a number like this:

10001.10010110011

The binary number and the location of the binary point are both encoded within the value.

decimal is a floating decimal point type. In other words, they represent a number like this:

12345.65789

Again, the number and the location of the decimal point are both encoded within the value – that's what makes decimal still a floating point type instead of a fixed point type.

The important thing to note is that humans are used to representing non-integers in a decimal form, and expect exact results in decimal representations; not all decimal numbers are exactly representable in binary floating point – 0.1, for example – so if you use a binary floating point value you'll actually get an approximation to 0.1. You'll still get approximations when using a floating decimal point as well – the result of dividing 1 by 3 can't be exactly represented, for example.

As for what to use when:

For values which are "naturally exact decimals" it's good to use decimal. This is usually suitable for any concepts invented by humans: financial values are the most obvious example, but there are others too. Consider the score given to divers or ice skaters, for example.
For values which are more artefacts of nature which can't really be measured exactly anyway, float/double are more appropriate. For example, scientific data would usually be represented in this form. Here, the original values won't be "decimally accurate" to start with, so it's not important for the expected results to maintain the "decimal accuracy". Floating binary point types are much faster to work with than decimals.

Javascript – How to deal with floating point number precision in JavaScript

From the Floating-Point Guide:

What can I do to avoid this problem?

That depends on what kind of calculations you’re doing.

If you really need your results to add up exactly, especially when you work with money: use a special decimal datatype.

If you just don’t want to see all those extra decimal places: simply format your result rounded to a fixed number of decimal places when displaying it.

If you have no decimal datatype available, an alternative is to work with integers, e.g. do money calculations entirely in cents. But this is more work and has some drawbacks.

Note that the first point only applies if you really need specific precise decimal behaviour. Most people don't need that, they're just irritated that their programs don't work correctly with numbers like 1/10 without realizing that they wouldn't even blink at the same error if it occurred with 1/3.

If the first point really applies to you, use BigDecimal for JavaScript, which is not elegant at all, but actually solves the problem rather than providing an imperfect workaround.

Best Answer

Related Solutions

.net – Difference between decimal, float and double in .NET

Javascript – How to deal with floating point number precision in JavaScript

Related Topic