Welcome to MSDN Blogs Sign in | Join | Help

Job change

It's been a long time since I've posted (six months!), but I thought I'd share a little news.

Effective next Monday (April 2), I will be leaving my current position on the C++ tools, and begin a new position working on the HLSL compiler here at Microsoft. HLSL is the High Level Shader Language used with the Direct3D APIs.

After the move, I'm hoping to get back to posting here regularly, only this time it will be on topics related to HLSL, Direct3D, graphics, and perhaps some posts related to the compiler itself.

It's a tremendously exciting move for me, after spending nearly 12 years working on C++ tools.

I'll be working on a very small team with some great talent, and learning a lot every day (which I love). I've already started cracking the books again, reading about graphics, HLSL, Direct3D, etc. I'll be starting through the SDK samples soon enough.

Okay, time to get back to work and finish up fixing those last few issues before moving on... :) 

Posted by mlacey | 0 Comments

VS2005 SP1 Beta

Okay, I haven’t exactly kept up with the posting lately, but I did want to post a link to the announcement of the VS2005 SP1 Beta release. The C++ team has fixed quite a few bugs for the service pack, and it’s worth checking out!

Posted by mlacey | 1 Comments

Anatomy of a code generation bug

As I mentioned in a recent post, from time to time, bugs slip past us and make it into the compilers that we ship.

Code generation bugs are especially difficult to live with, because they have the opportunity to affect every component that is built with our compiler, including Microsoft products, and products of our partners, ISVs, etc.

When reported to us, these bugs are generally treated as must-fix right up until the last few weeks before shipping a product. We call them “silent bad code” because you have no indication bad things are happening until you actually run across the bad code in your compiled application (which may not be until after you’ve shipped).

Most of these bugs happen with optimizations enabled (e.g. /O2, /O1, /Ox, /Oxs), and don’t happen otherwise.  However, sometimes we generate bad code even without optimizations enabled.

Such a bug was the topic of not one, but two posts on Kernel Mustard, after appearing as a post on OSR’s NTDEV forum.

This particular bug causes certain members of the Interlocked family of intrinsics to fail to operate as expected when the result is used in a comparison. For example:

#include <stdio.h>

 

long Value = 1;

 

#define InterlockedAnd _InterlockedAnd

extern "C" long _InterlockedAnd(long volatile *, long);

#pragma intrinsic(_InterlockedAnd)

 

int

main()

{

    if (InterlockedAnd(&Value,1)) {

        puts("pass");

        return 0;

    } else {

        puts("fail");

        return 1;

    }

}

 

A compiler with the bug will generate code that when run, will always print fail and return 1.

The code generated should test the old value of Value and branch based on that. Instead, it branches based on the result of the CMPXCHG instruction that was generated as part of the intrinsic. Since the intrinsic is generated such that it loops until the CMPXCHG compares equal, the branch-on-equal (which is a synonym for branch-if-zero) that takes us to the fail path is always taken.

Here’s the bad code:

      mov   ecx, 1

      mov   edx, OFFSET ?Value@@3JA             ; Value

      mov   eax, DWORD PTR [edx]

$LN5@main:

      mov   esi, eax   

      and   esi, ecx   

      lock  cmpxchg DWORD PTR [edx], esi

      jne   SHORT $LN5@main  

      pop   esi

      je    SHORT $LN2@main  

 

There should be a TEST EAX, EAX before the JE.

 

It turns out we fixed this bug in the C/C++ compiler shipped in Visual Studio 2005.

So what was the issue?

Well, there is a machine-specific optimization that attempts to remove compares that it decides are not needed. I won’t go into all the details of how it does that and which cases it manages to handle, but I will describe what the bug was in this case.

But wait, I thought optimizations weren’t enabled?

Well, they aren’t but there are a few code generation tweaks like this one that happen even without optimizations enabled. Not many, but a few…perhaps a few too many? Certainly in this case…

The way the intrinsic is “modeled” in the code generator is as a special instruction which sets a return value and updates the condition code, e.g.:

temp, CC = InterlockedAnd &Value, 1

In this case the optimization made the erroneous assumption that if the value being tested by the CMP was set by the same instruction which previously set the condition code, and we’re comparing against zero, then the condition code was set by that previous instruction in the same way that the CMP would set it, and the CMP isn’t needed (this is vastly oversimplifying pages of code).  However, in this case the condition code (CC) is set not based on the value of temp, but rather on whether Value has changed since it was first read from memory.

I believe the simplest and most fool-proof work-around for the issue would be to create your own wrappers for these intrinsics, and declare them __declspec(noinline), or perhaps create a function that just returns the value passed to it, and declare that __declspec(noinline), and then call that with the intrinsic as the argument. More explicitly:

Work-around 1:

long __declspec(noinline) MyInterlockedAnd(long volatile *p, long v)

{

    return _InterlockedAnd(p, v);

}

 

Work-around 2:

long __declspec(noinline) MyValue(long v)

{

    return v;

}

if (MyValue(_InterlockedAnd(&Value,1))) {

        ...

}

So what intrinsics did this bug effect in Visual C++ 7.1 (aka Visual Studio 2003, aka cl.exe version 13.1)?

From browsing through the code and from a little bit of my own testing, it looks like the bug happens with:

·         _InterlockedAnd()

·         _InterlockedOr()

·         _InterlockedXor()

If you have found other functions which are affected with VC++ 7.1, please let me know and I’ll update this page.

If you find similar issues with Visual Studio 2005, please let us know by following the instructions in that previous post.

 

Design for efficiency, code for clarity. Measure twice, optimize once. Repeat.

Rico Mariani’s blog has a link today to an article in ACM Ubiquity called The Fallacy of Premature Optimization.

It’s a good read – I recommend it.

My motto has always been:

Design for efficiency, code for clarity.

That isn’t to say that you shouldn’t consider efficiency while coding, but rather that  your primary concern while coding should be clarity.

The corollary, inspired by a presentation (by RicoM) at a class I attended recently:

Measure twice, optimize once. Repeat.

When you decide you have a performance problem, make sure you understand where it really is, and then make the changes you suspect will fix it. Measure again, make sure your changes had an effect, and then identify the next area of concern.

 

Posted by mlacey | 2 Comments
Filed under:

"Hello, world!" - The Details - Part 2

Previously, I showed a portion of the code generated when compiling a variation of “Hello, world!” using the x64-targeting compiler from Visual Studio 2005.

I observed that some instructions, like JMP and JNE had offsets directly encoded into the instructions, whereas others like LEA, CALL, and CMP, didn’t.

The reason behind this is that at compile time we know the relative positions that the JMP and JNE are targeting, because they are within the same function, and we are generating the code at that time, so we know exactly where they will end up.

In theory, when compiling without /Gy (which if you’ll recall instructed the compiler to put each function in its own .text section), we could emit the displacements for any code that was being generated during the same compile and going into the same .text section.  We don’t, but in theory we could.

To refresh your memory, here is what the generated code for main looked like:

main  PROC                                ; COMDAT

; Line 10

$LN5:

  00000     48 83 ec 28 sub  rsp, 40                ; 00000028H

; Line 11

  00004     48 8d 0d 00 00

      00 00       lea  rcx, OFFSET FLAT:$SG2147

  0000b     e8 00 00 00 00    call puts

; Line 12

  00010     83 3d 00 00 00

      00 00       cmp  DWORD PTR i, 0

  00017     75 09       jne  SHORT $LN2@main

; Line 13

  00019     e8 00 00 00 00    call pass

  0001e     eb 07       jmp  SHORT $LN3@main

; Line 14

  00020     eb 05       jmp  SHORT $LN1@main

$LN2@main:

; Line 15

  00022     e8 00 00 00 00    call fail

$LN1@main:

$LN3@main:

; Line 17

  00027     48 83 c4 28 add  rsp, 40                ; 00000028H

  0002b     c3          ret  0

main  ENDP

Looking first at LEA, it’s loading the address of a symbol called $SG2147, which appears in our .cod listing and represents the “Hello, world!” string:

$SG2147     DB    'Hello, world!', 00H

The LEA instruction is currently encoded with a relative displacement of zero. The way that gets “fixed up” is by the linker coming back and patching the instruction once it knows where both the LEA instruction and $SG2147 are going to end up in the generated image.

How does the linker know what needs to be “fixed up”?

Well, the compiler emits this information when it emits the code. Specifically, it emits relocations, often called fix-ups, which the linker reads and uses to patch the code while it generates the final image.

By doing:

dumpbin /disasm /relocations /section:.text /symbols hello.obj > hello.obj.d

we can examine the relocations generated for the .text sections of hello.obj.

Here is what the relocations for main look like:

RELOCATIONS #C

                                                Symbol    Symbol

 Offset    Type              Applied To         Index     Name

 --------  ----------------  -----------------  --------  ------

 00000007  REL32                      00000000         A  $SG2147

 0000000C  REL32                      00000000        16  puts

 00000012  REL32_1                    00000000         5  i

 0000001A  REL32                      00000000         F  pass

 00000023  REL32                      00000000        1C  fail

 

Note that #C refers to the section that these relocations were generated for.  The offset specified is the code offset from the beginning of the section to which this relocation will be applied. The relocation type, in this case always REL32 or REL32_1, refers to how the final value that is patched into the machine code is going to be calculated. The “Applied To” column simply repeats the information that is currently encoded into the instruction, and is displayed for convenience. This is not in the relocation table stored in the object file. In this case it is always zero, and we will ignore this column. The symbol index is an index into the symbol table is stored in the object file. The symbol name is displayed for convenience only. Like the “Applied To” information, it is not stored in the relocation table.

So what does this table tell the linker?

Well, in each of the REL32 cases, it tells the linker to emit a displacement relative to the current instruction pointer and the final location of the symbol that is referred to in the relocation. Well, that cleared things up, or maybe not…let’s look at an example.

In the case of the LEA of $SG2147, assuming that $SG2147 ends up 0x0001EFC5 bytes after this instruction in the final image, it would replace the zeros at offset 7 with that displacement (0x0001EFC5). Note that in the case of REL32, these displacements are signed 32-bit values, so if $SG2147 ended up before the code in the image, we would have a negative displacement.

Recall from the last installment that by “current instruction pointer” or “after this instruction” we really mean the value the instruction pointer has after fetching this instruction, so the address of the first byte after the LEA, not the address of the LEA, or the address at the beginning of the displacement.

So what about REL32_1?

Well, that is a variant of REL32. This other relocation type simplifies the job of the linker by letting it know how many bytes follow the fixed-up data, but are still a part of the same instruction.  Otherwise, the linker would need to know where the start of the instruction is, and would need to decode the instruction and determine how long the instruction is.

In this case, the REL32_1 is being applied to the cmp instruction, which is comparing a value against zero, which is encoded as the final byte of the instruction. Because of this one byte zero that follows the fixed-up displacement in the instruction, we need to use the REL32_1 relocation type.  The relocations that the linker supports are documented in the Microsoft Portable Executable and Common Object File Format Specification.

I intend on coming back to this simple program to illustrate more aspects of the object file format and the x64 ABI, so stay tuned.

 

The ins and outs of importing and exporting functions of a DLL

In case you've missed it and are interested, Raymond Chen has started a series on how DLL imports/exports work. I was planning on doing a similar write-up, but he beat me to it - I will say, though, I wasn't about to go talking about 16-bit Windows...
Posted by mlacey | 0 Comments
Filed under:

Reporting compiler & linker bugs

Despite the significant testing that goes into the C++ compiler & linker that are shipped in Visual Studio, from time to time users stumble on bugs that got past us.

Depending on whether the bug is in the compiler front-end (aka parser), back-end (aka code generator/optimizer), or linker, we need different information in order to proceed with reproducing and fixing the issue.

We call the combination of files and other information that you pass onto us the “repro case”.

For compiler front-end issues the repro case that we are expecting typically consists of a pre-processed file, as well as the command-line options you are using, and an explanation of the issue you are seeing (e.g., is it a compiler crash, syntax that we accept but shouldn’t, syntax that we reject but should accept, etc.). Pre-processed files are typically easy to generate by using the /P command-line option. The resulting file has the extension .i replacing the original extension (i.e., myfile.cpp becomes myfile.i).

For compiler back-end issues, we need everything mentioned above for the front-end issues, and sometimes more. If it’s a crash, the items specified for the front-end are sufficient (of course generally end-users don’t know whether it’s the front-end or the back-end that is crashing, so…).

If you suspect that we are generating incorrect machine code, we really need you to point out exactly what about the code is wrong.

Why? Well, sending us your project and saying, “It crashes when I compile with optimization on.”, puts us in the position where we now have to sit down and understand your code before we can even make any progress understanding what’s wrong with ours. It’s also often the case that the crash you’re seeing only when you’ve turned on the optimizer is actually due to a bug in your code that was only “exposed” with the optimizer enabled, and if that is the case you will realize this as you narrow down the cause of the crash.

It’s ideal in the case of suspected code generation bugs for you to provide us with a small, executable test case, and a clear explanation of what you think is wrong with the generated code. If it’s not possible to reproduce the issue you are seeing in a small test case, the next best thing is usually a preprocessed source file as mentioned above.

There is one important caveat here. If you are using whole-program optimization via the /GL compiler option and linker /LTCG option, you need to follow the steps below for reporting a linker bug when creating the repro case. Since profile-guided optimization is only available when using whole-program optimization, you will need to follow the linker repro case steps when using profile-guided optimization as well.

For linker issues, we have a nice little linker command-line option (which is also available in the form of an environment variable) which allows you to create a repro case by just linking as you normally would.

The name of the linker option is /LINKREPRO, and the environment variable is LINK_REPRO. The way you use them is to first create a directory where you want the repro case to go, e.g. c:\repro, and then either specify that directory with the command-line option, or set the environment variable to point to that location. For example,

                link /linkrepro:c:\repro …

or

                set LINK_REPRO=c:\repro

      link …

If you use the environment variable, then after you link you want to make sure you clear it before you continue working.

Now, what does this do?

It takes all of the objects and libraries that are referenced when you link, copies them to the directory that you specify, and creates a file called link.rsp which has all of the command-line options that you specified while linking.  You can now give us those files and we should be able to reproduce the issue you are seeing. Of course once again, you need to tell us what behavior you’re seeing that you believe is a bug.

Before reporting the issue to us, it is important to make sure that you can reproduce the problem using the repro case that you are going to provide. For the compiler issues, you can compile the preprocessed file using /TP or /TC depending on whether it is C++ or C respectively, additionally specifying the other command-line options you used to reproduce the issue. For linker issues, you can do:

                link @link.rsp

in the link-repro directory.

The place to report these issues is https://connect.microsoft.com/VisualStudio/Feedback. It changes from time to time, so this link might be dead when you come across this page. Hopefully that doesn’t deter you from reporting the issue.

In summary, when reporting compiler/linker bugs, here’s what we need from you:

Front-end

Compile with /P, give us the resulting .i file with an explanation of what you believe the problem is, and the command-line options you used when compiling.

Back-end

If you are reporting a code-generation/optimization issue we need an explanation of where you think the bad code is being generated. Ideally, you provide us with a small executable test case that demonstrates the issue, but we won’t be counting on that.

If you are not using whole-program optimization or profile-guided optimization, we need the same information as specified above for a front-end bug.

If you are using whole-program optimization or profile-guided optimization, we need the same information as specified below for a linker bug.

Linker

Generate a link repro case using either the /linkrepro command-line option or the LINK_REPRO environment variable, and send us the files that are copied into the target directory. It’s best if you pack these files up into a well-known compressed archive format like .ZIP so it is easy to provide to us.

 

Variadic functions and portability

I tracked down a bug in one of the tests for our code generator today. It’s a great example of why not to use variadic functions (also known as varargs functions) if you can possibly avoid it, and if you do use them, make sure you are very careful, and consider future portability of your code.

The test code was something like this:

#include <stdarg.h>

#include <memory.h>

 

int global;

 

int

varargs(int i, ...)

{

    va_list args;

 

    va_start(args, i);

 

    for (;;) {

        void *p = va_arg(args, void *);

        if (p == 0) {

            break;

        }

        size_t l = va_arg(args, size_t);

        memset(p, 0, l);

    }

 

    return global;

}

 

int

test()

{

    char *buf = new char[global];

    int i, j;

    return varargs(1, &i, sizeof(i), &j, sizeof(j), buf, global, 0);

}

 

int

main()

{

    global = 16;

    if (test() == 16) {

        return 0;

    } else {

        return 1;

    }

}

 

Spot the defect?

Well, global is an int, but varargs() is expecting pairs of arguments of size void * and size_t after the first argument. We end up passing a pointer and an int rather than a pointer and a size_t.  On 64-bit Windows, int has a size of four bytes and size_t has a size of eight bytes.

 

As a result, when writing global to memory while pushing arguments, we decide to only write four bytes, but when retrieving it via va_arg we attempt to load eight bytes. The upper four bytes of that eight bytes has whatever garbage happened to be there before we got  to this point in the code.

 

Posted by mlacey | 0 Comments
Filed under: ,

The Soaring Price of Wine

As I sit at home drinking a glass of relatively inexpensive 2001 Castelnau de Suduiraut Sauternes (a sweet, white, Bordeaux wine for the uninitiated), I can’t help but consider the significant increase in prices seen in Bordeaux futures over the last five years. For that matter, wine from all over the world seems to be increasing in price at rate which is much higher than the rate of inflation.

 

Note that I said “relatively inexpensive”. I believe most people would consider spending $30, or about $1.18 an ounce, on a beverage very expensive, but compared to the $420, or $16.56/oz. that you might pay for 2001 Chateau d’Yquem, it seems like a bargain. Not that there is any comparison – Yquem is truly a singular experience from the few times I’ve had it.

 

You can buy many Bordeaux wines in a futures market that allows you to purchase wines two years in advance of release (one year after the vintage). Sometimes it would seem that this is an excellent opportunity – you purchase a wine at one price, and when it’s released it’s released at a higher price, and you’ve saved some money (well, perhaps, if you consider inflation).

 

Sometimes, though, you are effectively paying the same, or perhaps even a larger amount, ahead of time, and then the wine is actually released at or below what you paid two years prior. It’s a risk – you have to decide if you want to take it.

 

For the much-heralded 2000 and 2005 Bordeaux releases, I decided to take it, but in both cases I didn’t purchase any of the high scoring, top priced wines. I purchased lower-end wines knowing that even if the price didn’t go up, it guaranteed that I’d get my hands on some bottles that I could lay down for a couple decades, at a price that I thought was reasonable.

 

In 2000, you could have purchased some of the top wines, like Lafite-Rothschild or Marguax, for perhaps $300/bottle in futures (I think I saw more like $330, but I’ve heard people swear that they saw $279, so lets split the difference, more or less). For 2005, you could find these for $550. This is two years before you can even lay your hands on it, and perhaps 20+ years before you can even enjoy it (2005 Bordeaux are expected to be very long lived, but also very tannic, which means that they could be difficult to really enjoy for several years to come).

 

What motivates this 83% price increase in five years? This is around 13% per year increase in price.

 

Apparently, they are selling well, so the demand is out there. There are rumors that this is primarily coming from emerging markets like Russia and Asian countries, which have seen their incomes grow significantly in the last several years.

 

This isn’t even the top of the line, either. One of the most sought-after wines in the world, produced in very small quantities, is going for $2400 per bottle. In 1982, another wonderful year, it was $50 a bottle, which at the time I’m certain also seemed insane for a beverage that you might be enjoying now, 24 years later, over a meal.

 

What else has increased 48x in price over 24 years?

 

Posted by mlacey | 1 Comments
Filed under: ,

"Hello, world!" - The Details - Part 1

Today let’s take a peek into a variation on “Hello, world!” to see what is going on under the covers of this simple program with the goal of understanding a little about the process of compiling and linking code as well as what the machine is really doing when it runs this program. I will use x64 to illustrate the example. Even if you barely know what a register is, I think it shouldn’t be too hard to follow this example.

First, we can look at the source code:

#include <stdio.h>

 

int pass() { puts("pass"); return 0; }

int fail() { puts("fail"); return 1; }

 

int i;

 

int

main()

{

    puts("Hello, world!");

    if (i == 0) {

        return pass();

    } else {

        return fail();

    }

}

 

I’ve chosen to add a few embellishments to the “standard” version of this program to help illustrate some parts of the code generation and linking process.

I will save this file as a .c file and use the C compiler in Visual Studio 2005 to compile this via the command line: cl /Gy /FAc /Fm /Zi hello.c

Compiling as C rather than C++ is really arbitrary in this case.

First let’s look at the .cod file produced by using the /FAc option. We will focus on the body of the code for main().

main  PROC                                ; COMDAT

; Line 10

$LN5:

  00000     48 83 ec 28 sub  rsp, 40                ; 00000028H

; Line 11

  00004     48 8d 0d 00 00

      00 00       lea  rcx, OFFSET FLAT:$SG2147

  0000b     e8 00 00 00 00    call puts

; Line 12

  00010     83 3d 00 00 00

      00 00       cmp  DWORD PTR i, 0

  00017     75 09       jne  SHORT $LN2@main

; Line 13

  00019     e8 00 00 00 00    call pass

  0001e     eb 07       jmp  SHORT $LN3@main

; Line 14

  00020     eb 05       jmp  SHORT $LN1@main

$LN2@main:

; Line 15

  00022     e8 00 00 00 00    call fail

$LN1@main:

$LN3@main:

; Line 17

  00027     48 83 c4 28 add  rsp, 40                ; 00000028H

  0002b     c3          ret  0

main  ENDP

 

On the lines that have instructions, the numbers to the far left are the offset (in hex) of that instruction from the beginning of this section. In this case I’ve used the /Gy compiler option which puts each function in its own .text section, so the offset will start at 0.

The numbers right of this (and sometimes spilling onto the next line) are the instruction encoding. Following that are is text form of the “opcode” as well as its arguments. Some lines have a semi-colon followed by text – these are comments.

The first generated instruction in main (subtracting 40 from RSP) is part of the “prologue” of the function – code that follows an agreed-upon convention and sets up state for the remainder of the function.  For x64 on Windows 64, the agreed upon conventions can be found here. This first instruction just reserves some space on the stack which is pointed to by the register RSP, the stack pointer. The add of 40 back to RSP and final RET are the “epilogue” of this function.

Explanations of the instructions and instruction encodings for x64 (also known as AMD64 or EM64T) can be found on both the AMD and Intel sites.

The interesting thing to note in this example (and the motivating factor for modifying “Hello, world!” from its standard form) is that the JMPs and JNE here have relative offsets encoded into the instructions whereas the CALLs don’t.

What do I mean by that? Well…

  00017     75 09       jne  SHORT $LN2@main

is encoded as the JNE instruction (0x75) followed by a relative offset of 9, meaning that if the jump is taken, we want the processor to add 9 to the value of the instruction pointer (the RIP register).  Well, that raises the question of exactly at what point is this added to the instruction pointer? Well, it’s after the instruction pointer has been updated for the fact that the JNE has been fetched, so this is really adding 9 to 0x19 (the offset of the instruction after the JNE), resulting in 0x22, the code offset of the $LN2@main label.

In the next part, we’ll look at why the CALLs and CMP do not specify an address, and how we end up with addresses encoded into those instructions before the final image is generated.

Lets get transactional

Windows Vista has some exciting new technologies that aid developers in creating robust applications which isolate the impact that one failure can have on other parts of the application as well as making the “clean-up” paths of an application simpler.

The technologies I’m thinking of are the Transactional File System (TxF), Transactional Registry (TxR), and the technology those are built on, the Kernel Transaction Manager (KTM). A closely related technology is the Common Log File System (CLFS), which is also at the foundation of TxF and TxR.

MSDN Magazine has a nice article which discusses these technologies.  There is also a Channel 9 video on TxF, as well as an MSDN webcast. The team that produced this technology has a developer blog which you might want to follow if topics like this interest you.

The MSDN Magazine article gives a straightforward example that illustrates how this technology is useful. The gist of the example is that if you have two dependent actions, like updating the registry to point to the location of a file and moving or perhaps creating a file, you need to make sure that either both happen, or neither happen, and that nobody can see an intermediate state where one change or the other has happened, but not both.

The way this is achieved is by creating a transaction, making this transaction active for the current thread, performing the operations (which are now effectively queued rather than being performed at the time of the API call), and then if all is successful, committing the transaction. If any action fails during the transaction, it can be rolled back with one API call. The sorts of failures you might imagine could include a failed API call, a crash in your code, a crash in code that you’re calling (e.g. a DLL that is provided to you from an external vendor), a catastrophic event like hardware failure or power loss, etc. You can now focus on the work you are trying to perform rather than focusing on all of the failure logic.

Of course this is a somewhat simplified view of things – this applies only to file and reg