Hello ARM: Exploring Undefined, Unspecified, and Implementation-defined Behavior in C++

Michael from Redmond

October 25th, 20120 1

With the introduction of Windows RT for ARM devices, many Windows software developers will be encountering ARM processors for the first time. For the native C++ developer this means the potential for running afoul of undefined, unspecified, or implementation-defined behavior–as defined by the C++ language–that is expressed differently on the ARM architecture than on the x86 or x64 architectures that most Windows developers are familiar with.

In C++, the results of certain operations, under certain circumstances, are intentionally ambiguous according to the language specification. In the specification this is known as “undefined behavior” and, simply put, means “anything can happen, or nothing at all. You’re on your own”. Having this kind of ambiguity in a programming language might seem undesirable at first, but it’s actually what allows C++ to be supported on so many different hardware platforms without sacrificing performance. By not requiring a specific behavior in these circumstances, compiler vendors are free to do whatever makes reasonable sense. Usually that means whatever is easiest for the compiler or most efficient for the target hardware platform. However, because undefined behavior can’t be anticipated by the language, it’s an error for a program to rely on the expressed behavior of a particular platform. As far as the C++ specification is concerned, it’s perfectly standards-conforming for the behavior to be expressed differently on another processor architecture, to change between microprocessor generations, to be affected by compiler optimization settings, or for any other reason at all.

Another category of behavior, known as “implementation-defined behavior”, is similar to undefined behavior in that the C++ language specification doesn’t prescribe a particular expression, but is different in that the specification requires the compiler vendor to define and document how their implementation behaves. This gives the vendor freedom to implement the behavior as they see fit, while also giving users of their implementation a guarantee that the behavior can be relied upon, even if the behavior might be non-portable or can change in the next version of the vendor’s compiler.

Finally, there is “unspecified behavior” which is just implementation-defined behavior without the requirement for the compiler vendor to document it.

All of this undefined, unspecified, and implementation-defined behavior can make porting code from one platform to another a tricky proposition. Even writing new code for an unfamiliar platform might seem at times like you’ve been dropped into a parallel universe that’s just slightly askew from the one you know. Some developers might routinely cite the C++ language specification from memory, but for the rest of us the border between portable C++ and undefined, unspecified, or implementation-defined behavior might not always be at the forefront of our minds. It can be easy to write code that relies on undefined, unspecified or implementation-defined behavior without even realizing it.

To help you make a smooth transition to Windows RT and ARM development, we’ve compiled some of the most common ways that developers might encounter (or stumble into) undefined, unspecified, or implementation-defined behavior in “working” code–complete with examples of how the behavior is expressed on ARM, x86 and x64 platforms using the Visual C++ tool chain. The list below is by no means exhaustive, and although the specific behaviors cited in these examples can be demonstrated on particular platforms, the behaviors themselves should not be relied upon in your own code. We include the observed behaviors only because this information might help you recognize how your own code might rely on them.

Floating point to integer conversions

On the ARM architecture, when a floating-point value is converted to a 32-bit integer, it saturates; that is, it converts to the nearest value that the integer can represent, if the floating-point value is outside the range of the integer. For example, when converting to an unsigned integer, a negative floating-point value always converts to 0, and to 4294967295 if the floating-point value is too large for an unsigned integer to represent. When converting to a signed integer, the floating-point value is converted to -2147483648 if it is too small for a signed integer to represent, or to 2147483637 if it is too large. On x86 and x64 architectures, floating point conversion does not saturate; instead, the conversion wraps around if the target type is unsigned, or is set to -2147483648 if the target type is signed.

The differences are even more pronounced for integer types smaller than 32 bits. None of the architectures discussed have direct support for converting a floating-point value to integer types smaller than 32 bits, so the conversion is performed as if the target type is 32 bits wide, and then truncates to the correct number of bits. Here you can see the result of converting +/- 5 billion (5e009) to various signed and unsigned integer types on each platform:

Results of converting +5e+009 to different signed and unsigned integer sizes
+5e+009	ARM 32-bit	ARM 16-bit	ARM 8-bit	x86/x64 32-bit	x86/x64 16-bit	x86/x64 8-bit
unsigned	4294967295	65535	255	705032704	0	0
signed	+2147483647	-1	-1	-2147483648	0	0

Results of converting -5e+009 to different signed and unsigned integer sizes
-5e+009	ARM 32-bit	ARM 16-bit	ARM 8-bit	x86/x64 32-bit	x86/x64 16-bit	x86/x64 8-bit
unsigned	0	0	0	3589934592	0	0
signed	-2147483648	0	0	-2147483648	0	0

As you can see, there’s no simple pattern to what’s going on because saturation doesn’t take place in all cases, and because truncation doesn’t preserve the sign of a value.

Still other values introduce more arbitrary conversions. On ARM, when you convert a NaN (Not-a-Number) floating point value to an integer type, the result is 0x00000000. On x86 and x64, the result is 0x80000000.

The bottom line for floating point conversion is that you can’t rely on a consistent result unless you know that the value fits within a range that the target integer type can represent.

Shift operators

On the ARM architecture, the shift operators always behave as if they take place in a 256-bit pattern space, regardless of the operand size–that is, the pattern repeats, or “wraps around”, only every 256 positions. Another way of thinking of this is that the pattern is shifted the specified number of positions modulo 256. Then, of course, the result contains just the least-significant bits of the pattern space.

On the x86 and x64 architectures, the behavior of the shift operator depends on both the size of the operand and on whether the code is compiled for x86 or x64. On both x86 and x64, operands that are 32 bits in size or smaller behave the same–the patterns space repeats every 32 positions. However, operands that are larger than 32 bits in size behave differently when compiled for x86 and x64 architecture. Because the x64 architecture has an instruction for shifting 64-bit values, the compiler emits this instruction to carry out the shift; but the x86 architecture doesn’t have a 64-bit shift instruction, so the compiler emits a small software routine to shift the 64-bit operand instead. The pattern space of this routine repeats every 256 positions. As a result, the x86 platform behaves less like its x64 sibling and more like ARM when shifting 64-bit operands.

Widths of the pattern spaces on each architecture:
Variable size	ARM	x86	x64
8	256	32	32
16	256	32	32
32	256	32	32
64	256	256	64

Let’s look at some examples. Notice that in the first table the x86 and x64 columns are identical, while in the second table its the x86 and ARM columns.

Given a 32-bit integer with a value of 1:
Shift amount	ARM	x86	x64
0	1	1	1
16	32768	32768	32768
32	0	1	1
48	0	32768	32768
64	0	1	1
96	0	1	1
128	0	1	1
256	1	1	1

Given a 64-bit integer with a value of 1:
Shift amount	ARM	x86	x64
0	1	1	1
16	32768	32768	32768
32	4294967296	4294967296	4294967296
48	2^48	2^48	2^48
64	0	0	1
96	0	0	4294967296
128	0	0	1
256	1	1	1

To help you avoid this error, the compiler emits warning C4295 to let you know that your code uses shifts that are too large (or negative) to be safe, but only if the shift amount is a constant or literal value.

‘volatile’ behavior

On the ARM architecture, the memory model is weakly-ordered. This means that a thread observes its own writes to memory in-order, but that writes to memory by other threads can be observed in any order unless additional measures are taken to synchronize the threads. The x86 and x64 architectures, on the other hand, have a strongly-ordered memory model. This means that a thread observes both its own memory writes, and the memory writes of other threads, in the order that the writes are made. In other words, a strongly-ordered architecture guarantees that if a thread, B, writes a value to location X, and then writes again to location Y, then another thread, A, will not see the update to Y before it sees the update to X. Weakly-ordered memory models make no such guarantee.

Where this intersects with the behavior of volatile variables is that, combined with the strongly-ordered memory model of x86 and x64, it was possible to (ab)use volatile variables for certain kinds of inter-process communication in the past. This is the traditional semantics of the volatile keyword in Microsoft’s compiler, and a lot of software exists that relies on those semantics to function. However, the C++11 language specification does not required that such memory accesses are strongly-ordered across threads, so it is an error to rely on this behavior in portable, standards-conforming code.

For this reason, the Microsoft C++ compiler now supports two different interpretations of the volatile storage qualifier that you can choose between by using a compiler switch. /volatile:iso selects the strict C++ standard volatile semantics that do not guarantee strong ordering. /volatile:ms selects the Microsoft extended volatile semantics that do guarantee strong ordering.

Because /volatile:iso implements the C++ standard volatile semantics and can open the door for greater optimization, it’s a best practice to use /volatile:iso whenever possible, combined with explicit thread synchronization primitives where required. /volatile:ms is only necessary when the program depends upon the extended, strongly-ordered semantics.

Here’s where things get interesting.

On the ARM architecture, the default is /volatile:iso because ARM software doesn’t have a legacy of relying on the extended semantics. However, on the x86 and x64 architectures the default is /volatile:ms because a lot of the x86 and x64 software written using Microsoft’s compiler in the past rely on the extended semantics. Changing the default to /volatile:iso for x86 and x86 would silently break that software in subtle and unexpected ways.

Still, it’s sometimes convenient or even necessary to compile ARM software using the /volatile:ms semantics–for example, it might be too costly to rewrite a program to use explicit synchronization primitives. But take note that in order to achieve the extended /volatile:ms semantics within the weakly-ordered memory model of the ARM architecture, the compiler has to insert explicit memory barriers into the program which can add significant runtime overhead.

Likewise, x86 and x64 code that doesn’t rely on the extended semantics should be compiled with /volatile:iso in order to ensure greater portability and free the compiler to perform more aggressive optimization.

Argument evaluation order

Code that relies on function call arguments being evaluated in a specific order is faulty on any architecture because the C++ standard says that the order in which function arguments are evaluated is unspecified. This means that, for a given function call F(A, B), it’s impossible to know whether A or B will be evaluated first. In fact, even when targeting the same architecture with the same compiler, things like calling convention and optimization settings can influence the order of evaluation.

While the standard leaves this behavior unspecified, in practice, the evaluation order is determined by the compiler based on properties of the target architecture, calling convention, optimization settings, and other factors. When these factors remain stable, it’s possible that code which inadvertently relies on a specific evaluation order can go unnoticed for quite some time. But migrate that same code to ARM, and you might shake things up enough to change the evaluation order, causing it to break.

Fortunately, many developers are already aware that argument evaluation order is unspecified and are careful not to rely on it. Even still, it can creep into code in some unintuitive places, such as member functions or overloaded operators. Both of these constructs are translated by the compiler into regular function calls, complete with unspecified evaluation order. Take the following code example:

 Foo foo;
 
 foo->bar(*p);

This looks well defined, but what if -> and * are actually overloaded operators? Then, this code expands to something like this:

 Foo::bar(operator->(foo), operator*(p));

Thus, if operator->(foo) and operator*(p) interact in some way, this code example might rely on a specific evaluation order, even though it would appear at first glance that bar() has only one argument.

Variable arguments

On the ARM architecture, all loads and stores are aligned. Even variables that are on the stack are subject to alignment. This is different than on x86 and x64, where there is no alignment requirement and variables pack tightly onto the stack. For local variables and regular parameters, the developer is well-insulated from this detail by the type system. But for variadic functions–those that take a variable number of arguments–the additional arguments are effectively typeless, and the developer is no longer insulated from the details of alignment.

This code example is actually a bug, regardless of platform. But what makes it interesting in this discussion is that the way that x86 and x64 architectures express the behavior happens to make the code function as the developer probably intended it to for a subset of potential values, while the same code running on the ARM architecture always produces the wrong result. Here’s an example using the cstdio function printf:

 // note that a 64-bit integer is being passed to the function, but '%d' is being used to read it.
 // on x86 and x64, this may work for small values since %d will "parse" the lower 32 bits of the argument.
 // on ARM, the stack is padded to align the 64-bit value and the code below will print whatever value
 // was previously stored in the padded position.
 printf("%d\n", 1LL);

In this case, the bug can be corrected by making sure that the correct format specification is used, which ensures that the alignment of the argument is considered. The following code is correct:

 // CORRECT: use %I64d for 64 bit integers
 printf("%I64d\n", 1LL)

Conclusion

Windows RT, powered by ARM processors, is an exciting new platform for Windows developers. Hopefully this blog post has exposed some of the subtle portability gotchas that might be lurking in your own code, and has made it easier for you to bring your code to Windows RT and the Windows Store.

Do you have questions, feedback, or perhaps your own portability tips? Leave a comment!