Some very common questions I get from customers regarding floating point are:

 

-          I get different results when compiling with optimizations vs without optimizations!

-          My == checks are failing when the expressions are the same!

 

The answer for this question is most of the time ‘This is by design’, but it does seem that we’re not doing a good job explaining why it is by design or why things work this way. I’ll try to cover the most common causes for this behavior in this post.

 

X87 FPU

 

X86’s old CPU is based on the following

 

- 8 registers laid out in the form of a stack. The registers are 80 bit wide, although the precision at which operations happen (significant bits of the mantissa) can be modified via a control register. The range cannot be changed though (the exponent will always be 15 bit wide)

- Some control registers that control the precision, exception modes, etc of the FPU.

- Set of instructions to operate with stack registers and/or memory. Instructions that operate with memory can only use a register operand if the register is the top of the stack. Operations that involve 2 FP registers can access any element of the stack. This aspect of the FPU is what makes code generation for it more ‘interesting’

 

Most of the non intuitive behavior people encounter comes from the fact that the registers are 80 bit wide. Precision is set by default in VC++ and CLR apps to ‘double precision’, which means that if you are operating with operands of type float, results of operations done with floats actually exist in the x87 stack as if there were of type double. In fact, it’s even weirder than that. They will have the mantissa of a double, but the range (exponent) of an extended double (80 bit).

 

That means that a sequence of operations like the following

 

-          Load float A

-          Load float B

-          Multiply loaded arguments

-          Store float result  in memory

-          load float A

-          load float B

-          Multiply loaded arguments

-          Compare with stored result in memory

 

May result in a result of ‘not equal’. Why? If the result of the first multiplication was not representable exactly as a float, it lost bits of precision when stored in memory, thus, when we compared what we stored in memory with what was in the FP stack, we see that  the numbers are similar, but not equal, resulting in a surprising ‘not equal’ result.

 

Even more subtle

 

-          load double MAX_DOUBLE

-          load double MAX_DOUBLE

-          add

-          store double A

-          load double A

-          load double MAX_DOUBLE

-          Substract

 

Is different than

 

-          load double MAX_DOUBLE

-          load double MAX_DOUBLE

-          add

-          load double MAX_DOUBLE

-          substract

 

 

 

 

At the end of this sequence we have +Infinity in the floating point stack, because 2*MAX_DOUBLE is bigger than MAX_DOUBLE, which is the biggest number that can be represented in a double. Substracting MAX_DOUBLE from Inifinity still results in infinity. However, in the second example, everything happened in the floating point stack, which is operating in double extended range, and can represent 2*MAX_DOUBLE, so when we substract, we get MAX_DOUBLE, instead of Infinity.

 

What CLR has to say

 

How does the CLR play with this weird HW? Pulled out from the ECMA spec:

 

Storage locations for floating-point numbers (statics, array elements, and fields of classes) are of fixed size. The supported storage sizes are float32 and float64. Everywhere else (on the evaluation stack, as arguments, as return types, and as local variables) floating-point numbers are represented using an internal floating-point type. In each such instance, the nominal type of the variable or expression is either R4 or R8, but its value can be represented internally with additional range and/or precision.  The size of the internal floating-point representation is implementation-dependent, can vary, and shall have precision at least as great as that of the variable or expression being represented. An implicit widening conversion to the internal representation from float32 or float64 is performed when those types are loaded from storage. The internal representation is typically the native size for the hardware, or as required for efficient implementation of an operation.  The internal representation shall have the following characteristics:

·              The internal representation shall have precision and range greater than or equal to the nominal type.

·              Conversions to and from the internal representation shall preserve value.

[Note: This implies that an implicit widening conversion from float32 (or float64) to the internal representation, followed by an explicit conversion from the internal representation to float32 (or float64), will result in a value that is identical to the original float32 (or float64) value. end note]

[Rationale: This design allows the CLI to choose a platform-specific high-performance representation for floating-point numbers until they are placed in storage locations.  For example, it might be able to leave floating-point variables in hardware registers that provide more precision than a user has requested.  At the same time, CIL generators can force operations to respect language-specific rules for representations through the use of conversion instructions. end rationale]

When a floating-point value whose internal representation has greater range and/or precision than its nominal type is put in a storage location, it is automatically coerced to the type of the storage location.  This can involve a loss of precision or the creation of an out-of-range value (NaN, +infinity, or ‑infinity). However, the value might be retained in the internal representation for future use, if it is reloaded from the storage location without having been modified.  It is the responsibility of the compiler to ensure that the retained value is still valid at the time of a subsequent load, taking into account the effects of aliasing and other execution threads (see memory model section).  This freedom to carry extra precision is not permitted, however, following the execution of an explicit conversion (conv.r4 or conv.r8), at which time the internal representation must be exactly representable in the associated type.

[Note: To detect values that cannot be converted to a particular storage type, a conversion instruction (conv.r4, or conv.r8) can be used, followed by a check for a non-finite value using ckfinite. Underflow can be detected  by converting to a particular storage type, comparing to zero before and after the conversion. end note]

[Note: The use of an internal representation that is wider than float32 or float64 can cause differences in computational results when a developer makes seemingly unrelated modifications to their code, the result of which can be that a value is spilled from the internal representation (e.g., in a register) to a location on the stack. end note]

 

This spec clearly had in mind the x87 FPU. The spec is basically saying that a CLR implementation is allowed to use an internal representation (in our case, the x87 80 bit representation) as long as there is no explicit storage to a coerced location (a class or valuet type field), that forces narrowing. Also, at any point, the IL stream may have conv.r4 and conv.r8 instructions, which will force the narrowing to happen.

 

Why did the spec implementers decide to go down this route? Imagine they hadn’t, and said that the precision/range of FP results would always have to be of the type of their operands. In x87 it would mean having to spill to memory (in order to narrow down to the operand precision/range) after every operation. This can become a very high price to pay for that extra predictability, which is not really cross platform, and that not everybody needs.

 

Typical problems: It’s a language choice

 

Note that with current spec, it’s still a language choice to give ‘predictability’. The language may insert conv.r4 or conv.r8 instructions after every FP operation to get a ‘predictable’ behavior. Obviously, this is really expensive, and different languages have different compromises. C#, for example, does nothing, if you want narrowing, you will have to insert (float) and (double) casts by hand. On the other hand, VC++ Whidbey has a new floating point model, which by default will do narrowing on assignment boundaries (it’s more complex than that, see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/floapoint.asp for more details).

 

Let’s look what can happen with current C# semantics in a recent question we got:

 

 

int Compare(Point x, Point y)

{

float thetaX = Worker(pivot, x);

      float thetaY = Worker (pivot, y);

 

      if (thetaX == thetaY)

      {

            return (0);

      }

      else

      {

            return (-1);

      }

}

 

Worker was a ‘pure’ function that returned a float value, as in it depended only of it’s input for it’s result, and produced no side effects. The surprise was coming from the fact that if you called Compare with an x == y, the function would still result in a -1. Let’s look at the disassembly to see what may have happened:

 

 

            …. Prolog

            lea     EAX, bword ptr [ESI+4]

            cmp     ECX, dword ptr [EAX]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            lea     EAX, bword ptr [ESP+14H+08H]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            call    [Worker(struct,struct):float]

            fstp    dword ptr [ESP]

            add     ESI, 4

            sub     ESP, 8

            movq    XMM0, qword ptr [ESI]

            movq    qword ptr [ESP], XMM0

            lea     EAX, bword ptr [ESP+0CH+08H]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            call    [Worker(struct,struct):float]

            fld     dword ptr [ESP]

            fcomip  ST(0), ST(1)

            fstp    ST(0)

            jpe     SHORT G_M004_IG03

            jne     SHORT G_M004_IG03

            xor     EAX, EAX

            …. Epilog

 

 

 

What is happening is that Worker returned a value in the ST(0) register of the FP stack, which was narrowed down to float precision in the fstp (memory store) following the first call. But then, for doing the comparison, it’s comparing ST(0) returned by the second call with the narrowed down to float result of the first call. So if what Worker did was not exactly representable in a float, the comparison (fcomip) would result in false.

 

How could this be solved? By narrowing by hand the result of the Worker() calls:

 

int Compare(Point x, Point y)

{

float thetaX = (float) Worker(pivot, x);

      float thetaY = (float) Worker (pivot, y);

 

      if (thetaX == thetaY)

      {

            return (0);

      }

      else

      {

            return (-1);

      }

}

 

Which generates:

 

            …prolog

            lea     EAX, bword ptr [ESI+4]

            cmp     ECX, dword ptr [EAX]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            lea     EAX, bword ptr [ESP+18H+08H]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            call    [ComputationalGeometryUtils.theta(struct,struct):float]

            fstp    dword ptr [ESP]

            add     ESI, 4

            sub     ESP, 8

            movq    XMM0, qword ptr [ESI]

            movq    qword ptr [ESP], XMM0

            lea     EAX, bword ptr [ESP+10H+08H]

            sub     ESP, 8

            movq    XMM0, qword ptr [EAX]

            movq    qword ptr [ESP], XMM0

            call    [ComputationalGeometryUtils.theta(struct,struct):float]

            fstp    dword ptr [ESP+04H] ; Narrow down

            fld     dword ptr [ESP+04H] ; and reload result of call

            fld     dword ptr [ESP] ; result of first call

            fcomip  ST(0), ST(1) ; comparison is now done on narrowed down results

            fstp    ST(0)

            jpe     SHORT G_M004_IG03

            jne     SHORT G_M004_IG03

            xor     EAX, EAX

           .epilog

 

This will now be more in line with what programmer expects. All this that is happening is in 100% compliance with CLR ECMA specification. I would agree that it is a behavior that is not 100% intuitive, but the problem is that being intuitive will result in a performance penalty, which all users may not be happy with. Maybe the solution is for C# to adopt FP models just like C++ has now, which, although don’t solve the problem completely, do help in the most common situations.