Fabulous Adventures In Coding

Eric Lippert's Blog

Floating Point Arithmetic, Part One

A month ago I was discussing some of the issues in integer arithmetic, and I said that issues in floating point arithmetic were a good subject for another day. Over the weekend I got some questions from a reader about floating point arithmetic, so this seems like as good a time as any to address them.

Before I talk about some of the things that can go terribly wrong with floating point arithmetic, it's helpful (and character building) to understand how exactly a floating point number is represented internally.

To distinguish between decimal and binary numbers, I'm going to do all binary numbers in blue fixed-width.

Here's how floating point numbers work.  A float is 64 bits.  Of that, one bit represents the sign: 0 is positive, 1 is negative.  

Eleven bits represent the exponent.  To determine the exponent value, treat the exponent field as an eleven-bit unsigned integer, then subtract 1023.  However, note that the exponent fields 00000000000 and 11111111111 have special meaning, which we'll come to later.

The remaining 52 bits represent the mantissa.

To compute the value of a float, here's what you do.  You take the mantissa, and you stick a "1." onto its left hand side.  Then you compute that value as a 53 bit fraction with 52 fractional places.  Then you multiply that by two to the power of the given exponent value, and sign it appropriately.

So for example, the number -5.5 is represented like this: (sign, exponent, mantissa)

(1, 10000000001, 0110000000000000000000000000000000000000000000000000)

The sign is 1, so its a negative number.  The exponent is 1025 - 1023 = 2.  Put a 1. on the top of the mantissa and you get 1.0110000000000000000000000000000000000000000000000000 = 1.375 and sure enough, -1.375 x 22 = -5.5

This system is nice because it means that every number in the range of a float has a unique representation, and therefore doesn't waste bits on duplicates. 

However, you might be wondering how zero is represented, since every bit pattern has 1. plunked onto the beginning.  That's where the special values for the exponent come in.  If the exponent is 00000000000, then the float is considered a "denormal".  It gets 0. plunked onto the beginning, not 1., and the exponent is assumed to be -1022.  This has the nice property that if all bits in the float are zero, it's representing zero. Note that this lets you represent smaller numbers than you would be able to otherwise, as we'll see, though you pay the price of lower precision.  Essentially, denormals exist so that the chip can do "graceful underflow" -- represent tiny values without having to go straight to zero.

If the exponent 11111111111 and the fraction is all zeros, that's Infinity.  If the exponent is 11111111111 and the fraction is not all zeros, that's considered to be Not A Number -- this is a bit pattern reserved for errors.

So the biggest and smallest positive normalized floats are

(0, 11111111110, 1111111111111111111111111111111111111111111111111111)

which is 1.1111111111111111111111111111111111111111111111111111 x 21023, and

(0, 00000000001, 0000000000000000000000000000000000000000000000000000)

which is  1.000 x 2-1022

The biggest and smallest positive denormalized floats are

(0, 00000000000, 0000000000000000000000000000000000000000000000000001)

which is 0.0000000000000000000000000000000000000000000000000001 x 2-1022  = 2-1074, and

(0, 00000000000, 1111111111111111111111111111111111111111111111111111)

which is 0.1111111111111111111111111111111111111111111111111111 x 2-1022

Next time: floating point math is nothing like real number math.

Published Monday, January 10, 2005 11:34 AM by Eric Lippert

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Raymond Chen said:

No discussion of floating point is complete without a link to "What every computer scientist should know about floating-point arithmetic".

http://docs.sun.com/source/806-3568/ncg_goldberg.html
January 10, 2005 12:18 PM
 

RichB said:

Perhaps a future entry could also discuss situations when today's developers should be using floating point and when they should be using something like a Decimal/Currency class. I regularly see developers writing financial software and using single/double datatypes.
January 10, 2005 1:24 PM
 

mschaef said:

William Kahan also has written a bunch on floating point math.

http://www.cs.berkeley.edu/~wkahan/MathSand.pdf

He was involved in a lot of the work HP put into its calcualtors in the 70's and 80's, and IIRC was on the IEEE-754/854 committe.
January 10, 2005 2:26 PM
 

Mat Hall said:

A useless bit of knowledge for you: 21963283741 is the only number that produces the same bit pattern when represented as an integer or float on a PDP-10...
January 10, 2005 3:22 PM
 

Norman Diamond said:

1/10/2005 3:22 PM Mat Hall

> 21963283741 is the only number that produces
> the same bit pattern when represented as an
> integer or float on a PDP-10...

But then Mr. Lippert would have to write a Part Zero, explaining that there still exist machines whose arithmetic was spec'ed out prior to IEEE 754, and there still exist databases containing floats and doubles in those formats.

Meanwhile on IBM mainframes 0 was a number with the same bit pattern when represented as an integer or float.

1/10/2005 1:24 PM RichB

> I regularly see developers writing financial
> software and using single/double datatypes.

Argh. Meanwhile, have you seen any using floating datatypes for timestamps?
January 10, 2005 5:04 PM
 

Nicholas Allen said:

A very large number of developers use floating point datatypes for their timestamps, even if they don't know it.

http://weblogs.asp.net/ericlippert/archive/2003/09/16/53013.aspx
January 10, 2005 5:42 PM
 

Centaur said:

Borland Delphi and C++Builder have this:

type TDateTime = type Double;

where the integer part is the date and the fractional part is the time.

Thus, it is easiest to use floats in the database if your application needs dates, times and timestamps.

The documentation states that Double has 15 to 16 significant digits (assuming decimal), and 2005-01-11 is day 38363, so we have 10 decimal digits accuracy for the time, which amounts to 8.64 microseconds.

Of course, if one is sloppy and defines a table column as, for example, FLOAT in Interbase (7 decimals), this will be accurate only to ~5 minutes. Which is why some applications store the date part and the time part in separate columns.
January 10, 2005 9:51 PM
 

sch said:

How do you get those "-1074" and "-1022" in the biggest and smallest positive denormalized floats? Shouldn't it be "-1023"?
January 11, 2005 6:04 AM
 

Eric Lippert said:

Sorry, you're right, I should have mentioned that. Denormals are automatically assumed to have an exponent of -1022, rather than subtracting 1023 from 0. Otherwise you end up being unable to represent anything between 2^-1023 and 2^-1022 by either a normal or denormal float. Since the whole point of denormals is to degrade into an underflow situation gracefully, it would be bad if there were such a gap.

January 11, 2005 8:48 AM
 

Peter Ibbotson said:

January 11, 2005 9:59 AM
 

Mike said:

Reading this brings back memories of my maths teacher who taught us 6502 assembly when I was 10.
We wrote a floating point extension to the BASIC on the Acorn Atom in assembly as a "class project".

He made it seem so easy, too!
January 12, 2005 2:18 AM
 

James said:

Argh - I've just realised that I understand things even less than I thought I did. Is epsilon the smallest value greater than zero, the smallest normalised value greater than zero, the smallest value greater than one, or none of these? Should I care? Can zero and infinity be both positive and negative? Help.
January 12, 2005 3:50 PM
 

Eric Lippert said:

"Machine Epsilon" is the smallest value such that 1+e > 1

Floating point math does admit both +/- infinity and +/- 0, though of course +0 always equals -0.

January 12, 2005 4:26 PM
 

Kernel Mustard said:

January 22, 2005 8:21 PM
 

Microsoft Press said:

One of the great things about being the Microsoft Scripting Guy , is answering the hundreds of e-mails

June 12, 2009 9:42 AM
 

TheCPUWizard said:

  Mat Hall said:  "....n a PDP-10..".

Do you have a PDP-10 to show proof of that????

If you are interested in old DEC hardware [I have a PDP-8 a few PDP-11's and Vaxen], please drop me a note: david.corbin@dynconcepts.com

June 12, 2009 10:25 AM

Leave a Comment

(required) 
(optional)
(required) 
Submit

About Eric Lippert

Eric Lippert is a senior developer on the Microsoft C# compiler team. Before that he worked on the framework of Visual Studio Tools For Office. Before that, he worked on the compilers, runtimes and tools for VBScript, JScript, Windows Script Host and other Microsoft Scripting technologies. He lives in Seattle and spends his free time editing books about programming languages, playing the piano, and trying to keep his tiny sailboat upright in Puget Sound.

This Blog

Syndication


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker