Welcome to MSDN Blogs Sign in | Join | Help

Runtime Code Patching - Not for the Faint of Heart

I have been involved in several conversations recently that have revolved around the joys of runtime code patching. I am always shocked to hear people say that they are ok with this idea of code patching at runtime. Moreover – it shocks me that they think it is easy to get right! I do think that code patching can have its place in a system if it is implemented correctly and if its intent and semantics are fully disclosed to any potential users.

So what exactly is code patching? There are tons of examples of it on the web (mostly hacker sites! :D ) and many books (again – mostly hacker stuff) that describe it in detail – just “Live Search” “runtime code patching” , but basically patching is the process of replacing a set of op codes in memory with a different set of op codes – at runtime. This seems simple enough but the devil is in the details. Let’s dig in!

Consider these op codes from a simple and mostly useless function:

   01001175 8bff             mov     edi,edi

   01001177 53               push    ebx

   01001178 50               push    eax

   01001179 53               push    ebx

   0100117a 51               push    ecx

   0100117b 52               push    edx

   0100117c 5a               pop     edx

   0100117d 59               pop     ecx

   0100117e 5b               pop     ebx

   0100117f 58               pop     eax

   01001180 5b               pop     ebx

   01001181 c3               ret

 

This code is kind of silly – but it has nice properties that expose the problems with patching, more on that in a bit. It is also made up of very commonly generated op codes. The format of the assembly listing is:

 [address] [op codes] [mnemonics for op codes]

 

The typical reason for installing a patch is to either bypass or modify the behavior of an existing function. Therefore the canonical patch is “op codes for a jmp to an absolute address”, written over the beginning of an existing routine like so:

 

   01001175 ea871100011b00   jmp     001b:01001187

   0100117c 5a               pop     edx

   0100117d 59               pop     ecx

   0100117e 5b               pop     ebx

   0100117f 58               pop     eax

   01001180 5b               pop     ebx

   01001181 c3               ret

 

Here we replaced the first few instructions of our function with an absolute jump to some other code, presumably code that we wrote and loaded into the system in some way.

 

So now that we know what a patch is – what is the problem? This seems like it will work? What exactly is it that we need to worry about?  Well, the most obvious thing we have to worry about is another thread running the code that we are patching, while we are patching it. The reason we need to worry about this is because if we modify the code that another thread is running, while it is running it – it will crash, or at the very least do very strange things. We need to make sure that any thread running the code we are patching either executes all of the old op codes or just the new one – but never a combination of the two. You can imagine a scenario like this:

 

1.       Thread T1 executes the first instruction from the old op codes at address 0x01001175

2.       The old op codes are overwritten with the new "jump" op code, by thread T2

3.       Thread T1 executes the op code at 0x01001177, which is now an address in the middle of the jump instruction that T2 wrote over old op codes

 

This would be really bad! J

 

So what to do? The next logical progression in analyzing the issue goes like this:

 

“Well, the problem is that we have a race with other threads running the code we are patching,  so I’ll just make sure that no other threads are in that code when I do the patch!”.

 

 Perfect! Well, not quite. Although we can corral all the other running processors (using DPCs, IPIs, etc.) and make sure that none of the running threads are in the code we are going to patch. But believe it or not this still isn’t enough to make our patch solid! To clarify, let’s restate the problem with our patching approach: we need to synchronize with all of the other threads currently running the code we are tying to patch. However, the latest incarnation of methodology, will only cause us to synchronize with all the threads currently running on other processors. What about the threads that aren’t running? Is there any synchronization required there? Actually and unfortunately – yes. The problem is – we can have threads that aren’t currently running but that have the address of the old op codes, where our new instruction is now residing, saved away for future use. This can occur if a thread was context swapped while executing the code we want to patch (when a thread is context swapped, the current instruction pointer is saved away so that it can be restored when the thread runs again). Other things that can cause the instruction pointer to be saved away are: exceptions, interrupts, etc..

 

So what is the moral of the story here? Don’t patch code? Well that may be a little extreme, but the moral is at least to never patch multiple instructions. Working in the Windows Online Crash Analysis (OCA ) data for a long time now, I can tell you that I have seen many failed attempts at doing this correctly that have ended in a BSOD. J

 

In fact, if you look at post Server 2003 Windows binaries, you will notice that the generated code follows this format almost exclusively:

 

7731a321 90               nop

7731a322 90               nop

7731a323 90               nop

7731a324 90               nop

7731a325 90               nop 

7731a326 8bff             mov     edi,edi

7731a328 55               push    ebp

7731a329 8bec             mov     ebp,esp

… <more op codes here>

 

This type of code generation allows for a patch to be installed safely at runtime. The 2 bytes at the start of the function (mov edi, edi) is enough space to hold the op code for a “relative short jump”, which be crafted to jump to the address of the 5 bytes of nop op codes. These bytes would have been previously overwritten with a “jmp ADDRESS” op code. This was a conscious change that was made to allow servicing of existing binaries on long running machines, without having to reboot to replace the binaries. The same rules apply though – the other processors must be corralled in order to make sure no thread is active in the code we want to patch at the time the patch is applied. Again – the reason this methodology works is because the op code boundaries match. In other words, either code will execute the “mov edi, edi” or the “short relative jump”. If it executes the former, then the routine runs normally through the code. If it executes the latter, then it will jump to the patch installed over the nop op codes. No problem here. There are issues with the instruction caches on processors as well, but those issues can be managed within the corralling mechanism. One last point is, we don’t have to worry about the “normal path” code running the patch that was written over the  nops because there is no path to that code except via the “short relative jump” that we wrote over the first instruction. Nice huh?!

 

Well, I hope this has been somewhat interesting or useful in some way. Please let me know if there are any other things that would be interesting to talk about.

Posted by theelvez | 10 Comments

Getting the Crashing Stack From a Bugcheck

Sorry for the long delay on posting - I have been slammed lately. I decided to write a post about debugging and take a short break from the bare metal stuff we have been discussing as of late. :)


When a bugcheck occurs in Windows, the following basic sequence of events occurs:
 
1. An exception ("exception" is used in the loosest sense of the term here) occurs that causes the system to call KeBugCheck()
 
2. The KeBugCheck routine runs on the thread that called KeBugCheck()
 
3. The BSOD shows up on the screen
 
This seems like totally useless information so far - and it is. However, for anyone that has ever had to debug a system crash - this sequence can lead to frustration. Why? Because sometimes you need to see what happened as the bugcheck occurred and you need to examine the stack to find valuable clues to aid your debug efforts. Unfortunately you can't see what the stack held because KeBugCheck() ran and overwrote the stack and all its juicy details. This is a real pain! After encountering this problem for the 11th time I added some code to the bugcheck path for Vista and forward that tries to rectify this problem. The additional code saves the current stack off to a separate area in non-paged pool before the bugcheck code runs. This is really useful for debugging in many situations as the bugcheck code pretty much re-uses the whole stack. This is especially useful for finding out what calls were active at the time of the bugcheck or for digging out stack values.
 
So how do you use it? Well - it is very easy actually. There is a global public symbol in ntoskrnl.exe (et al) called KiPreBugcheckStackSaveArea.
 
From the debugger you can just do this:
 
kd> dds KiPreBugcheckStackSaveArea KiPreBugcheckStackSaveArea+3000
 
This will dump the stack as dwords (you will need to use "dqs" for 64 bit and it will dump qwords) and resolve symbols if the dwords are within the range of a loaded module. The "+3000" comes from the public header value KERNEL_STACK_SIZE. This allows us to dump the entire stack. Here is a snippet from a dump from my machine:
 
...
8192bc64  807c70cb storport!RaidpAdapterContinueScatterGather
8192bc68  ad5df8b0
8192bc6c  818b9c58 nt!KeQueryCurrentStackInformation+0xb7
8192bc70  87087030
8192bc74  00000008
8192bc78  ad5e0000
8192bc7c  ad5dd000
8192bc80  ad5dfdb8
8192bc84  ad5e0000
8192bc88  00000000
8192bc8c  00000000
8192bc90  ad5df8d0
8192bc94  818d98e1 nt!KiSavePreBugcheckStack+0x66
8192bc98  819293e0 nt!KiPreBugcheckStackSaveArea
8192bc9c  ad5dd000
8192bca0  00003000
8192bca4  ad5e0000
8192bca8  00000003
8192bcac  ad5dd000
8192bcb0  ad5dfc78
8192bcb4  818d8c2d nt!KeBugCheck2+0x7a
8192bcb8  00000000
...
 
Notice that you can see the place on the stack where KeBugCheck was called and KiPreBugcheckStackSaveArea was pushed onto the stack. You can look from this address on the stack and higher to see what was running when the bugcheck occurred (remember the stack grows down to lower addresses). There are a few caveats (aren't there always!). One is that if there are parallel bugchecks occurring - the stack save area only records the first one. I am hoping that the probability of parallel bugchecks is low - although they can occur especially in the face of a machine check. The other caveat is that the addresses of the save area are not the same as the stack. This may seem really obvious but it is an important point to call out as it has bitten me several times! :)
 
Well - I hope that this is useful information for at least one person. Thanks for reading!

Posted by theelvez | 3 Comments

It Goes to Eleven and ... to the NT Insider!

Well - for anyone bored enough to track such things, I have been pretty slammed lately and haven't blogged anything. I have a bunch of stuff queued up though. Upcoming posts (in the next few days hopefully) are:

1. The Anatomy of a Context Switch

2. The Anatomy of an Interrupt

3. The Anatomy of a Rootkit

Also – the guys over at OSR were nice enough to ask me to write an article for them. It is going to be in the March/April edition of the Insider.

I am also looking for topics of interest to the community at large if anyone has suggestions – please send them my way. Thanks!

Posted by theelvez | 3 Comments

Why Your User Mode Pointer Captures Are Probably Broken

There is a problem that I suspect is pretty widespread in the majority of driver code. The problem is the improper capturing of user mode pointers. I decided to write a blog about it and try to get a feel for if I am right or not. J  I figure that if people comment with “of course we knew that you moron!” then I’ll assume that I am totally wrong. If not then I hope this will help at least one person.

User mode pointers passed to kernel mode must point to data that is wholly contained in user mode address space. Checking this property of a user mode pointer is called probing. Contrary to popular belief a probe does not touch the memory pointed to by the pointer, it just does an address range calculation on it. The calculation is basically “(pointer + LengthOfDataPointedTo) must be less than the highest legal user-mode (UM) address”. The reason we need to probe is to make sure that a UM component can’t write or read kernel space. When a user passes a pointer into a kernel mode component, the pointer is copied onto the kernel stack as part of the calling mechanism (sometimes called a “system call”, sysenter, a “trap”, etc.). The user has no way to change the value of that variable once it is passed in – so we can validate the pointer with confidence – since we know that its value won’t change underneath us. However, if that pointer is a pointer to a structure with embedded pointers, those internal pointers can be changing asynchronously from another thread. This is problematic because we need to validate all of the embedded pointers in a passed in structure. This is where capturing comes in to play. We capture the embedded pointers by storing their value in kernel mode space – usually the stack – by reading the embedded pointers through the already validated pointer. Once the embedded pointer is captured – we probe it, lather – rinse – repeat for the entire depth of the embedded pointers tree.

Consider this code:

typedef struct _USER_DATA {

    PULONG_PTR Data1;

    PULONG_PTR Data2;

} USER_DATA, *PUSER_DATA;

 

NTSTATUS

Foo(

    PUSER_DATA Data

    )

{

    PULONG_PTR CapturedData1;

    ULONG_PTR Data1Value;

    PULONG_PTR CapturedData2;

    ULONG_PTR Data2Value;

   

 

    //

    // See if this is user mode –in a driver PreviousMode

    // would normally be read from a field in the IRP and the

    // pointer would come from the Type3InputBuffer field

    //

 

    if (ExGetPreviousMode() != KernelMode) {

   

        try {

 

            //

            // Probe the passed in structure

            //

 

            ProbeForRead(Data,

                         sizeof(USER_DATA),

                         __alignof(USER_DATA));

 

 

            //

            // Capture the embedded pointers

            //

 

            CapturedData1 = Data->Data1;

            CapturedData2 = Data->Data2;

 

 

            //

            // Probe the first captured pointer

            //

           

            ProbeForRead(CapturedData1,

                         sizeof(ULONG_PTR),

                         __alignof(ULONG_PTR));

           

 

            //

            // Probe the second captured pointer

            //

           

            ProbeForRead(CapturedData2,

                         sizeof(ULONG_PTR),

                         __alignof(ULONG_PTR));

 

 

            //

            // Read the first embedded pointer

            //

 

            Data1Value = *CapturedData1;

 

 

            //

            // Read the second embedded pointer

            //

 

            Data2Value = *CapturedData2;

 

 

            //

            // More of your code here that does really cool stuff…

            //

 

        } except (EXCEPTION_EXECUTE_HANDLER) {

            return GetExceptionCode();

        }

    }

 

    return STATUS_SUCCESS;

}

Everything seems to be OK with this code. We probe the structure pointer, capture the embedded pointers to local variables and then probe them. But wait - let’s think about our ever important capture code a little deeper. The most important attribute of our capture code is that it stores the embedded pointer in a location where the user can’t modify it. If it didn’t, then we would be in really bad shape. So the question is – does our capture code in fact guarantee that the embedded pointers will be in a location such that they can’t be modified from user mode?  Unfortunately, the answer is NO! How is this possible? We told the compiler that we wanted to store the pointers locally. However – we didn’t do anything to tell the compiler that it was critical that they were stored locally. So the compiler can freely skip the local storage of the embedded pointers and just refetch them from user mode through the original pointer upon each reference, if it so chooses. This is potentially disastrous for our kernel mode code and not at all what we expected or intended.

So what can we do to get the behavior we expect? Well – we have to tell the compiler the truth about the code that we are writing. That’s right – the truth. We are lying in our code. Our code has implicitly told the compiler that our embedded pointers can’t change asynchronously. This is a lie. They can change. So in order for our code to be correct, we need to change it to a truthful representation of itself. How do we tell the compiler that our pointed to structure can change? Well – there are a couple of ways. The most straightforward way is to mark the passed in parameter with the keyword volatile. volatile when applied to a parameter or variable definition, tells the compiler that that memory location’s contents can change asynchronously – so all reads and writes to it must really happen and in the order they are specified in the code. This facility was put into the C language to deal with code that reads and writes memory that can change in a different scope (i.e. hardware device registers, device memory, shared memory) and we can take advantage of its semantics for our user mode pointer captures. With hardware - a reordered, omitted or combined read or write could lead to real life disasters. For hardware, reads as well as writes have side effects; this is completely analogous to our code. A read can have the side effect of violating our security mechanism.

 

 

So we can fix our code by changing our routine like so:

NTSTATUS

Foo(

    volatile USER_DATA* Data

    );

 

By changing the pointer Data, to be a pointer to a volatile structure we will force all reads and writes to the structure to really happen in our code (bonus points for explaining why we can't use "volatile PUSER_DATA" as our parameter type instead of "volatile USER_DATA*" - aren't they the same thing? :D ). However, if we have existing interfaces that we must maintain - it prevents us from being able to do this. What to do? Well – there is another way to get the behavior we want. We can cast at the capture site. This technique is called using “volatile glasses”. Here is an example:

 

     //

     // Capture the embedded pointers

     //

 

     CapturedData1 = ((volatile USER_DATA*)Data)->Data1;

     CapturedData2 = ((volatile USER_DATA*)Data)->Data2;

 

This will cause the compiler to perform the capture as if the variable Data had been declared volatile. Using this technique prevents the compiler from re-fetching from the passed in pointer because we told the compiler the truth. We said “Hey compiler – this thing that Data points to can change asynchronously, so you’d better not be playing any funny games with it”. And the compiler will honor that. It has to if it honors the volatile keyword. We would then have to do the same thing for the internal reads as well:

 

     //

     // Read the first embedded pointer

     //

 

     Data1Value = *(volatile ULONG_PTR*)CapturedData1;

 

 

     //

     // Read the second embedded pointer

     //

 

     Data2Value = *(volatile ULONG_PTR*)CapturedData2;

 

Again, we are telling the compiler the truth here – that the ULONG_PTR value can change asynchronously and it needs to really capture it locally.

 

This is a really esoteric topic – but very important IMHO. Please let me know your thoughts on this – I am highly interested. Thanks!

Posted by theelvez | 9 Comments

How Does KeMemoryBarrier Work?

KeMemoryBarrier is a kernel DDK support macro. There is also a WIN32 macro called MemoryBarrier that is implemented identically (there is an observance test hidden here!) - so we will just talk about KeMemoryBarrier here, but everything we say about it also applies to MemoryBarrier.

If you read the doc for KeMemoryBarrier, it says this:

"The KeMemoryBarrier routine creates a barrier at its position in the code—across which the compiler and the processor cannot move any operations."

Sounds good! However, I like to see where the bodies are buried so let's just look and see exactly how KeMemoryBarrier does this. Here is its definition for the x86 platform - from the public wdk header file wdm.h:

FORCEINLINE
VOID
KeMemoryBarrier (
    VOID
    )
{
    LONG Barrier;
    __asm {
        xchg Barrier, eax
    }
}

Hmm - this looks a little worrying. I don't immediately see anything that will block either compiler reordering or processor reordering! What gives? Well - first off we need to look at other constructs that block compiler and processor reordering and see if we can infer something from them. KeMemoryBarrierWithoutFence is documented to block compiler reordering. It is defined as:

#define KeMemoryBarrierWithoutFence() _ReadWriteBarrier()

Ok - that doesn't help much since we don't know what _ReadWriteBarrier does. So let's look at it's definition:

VOID
_ReadWriteBarrier(
    VOID
    );

#pragma intrinsic(_ReadWriteBarrier)

Aha! _ReadWriteBarrier is an intrinsic - is that why it blocks compiler reordering?  The "Compiler Intrinsics" topic on MSDN includes this:

"Some intrinsics, such as __assume and __ReadWriteBarrier, provide information to the compiler, which affects the behavior of the optimizer."

Great! But wait - although this is interesting knowledge - it helps us exactly zero in our quest to dissect KeMemoryBarrier as KeMemoryBarrier is not intrinsic or a re-#define of an intrinsic. So what next? Well - maybe the fact that it has inline assembler means something? Let's go ask MSDN:

"The presence of an __asm block in a function affects optimization in several ways. First, the compiler doesn't try to optimize the __asm block itself."

Ok - still a little ambiguous - but it is looking like the fact that it is inline assembler is the reason that the compiler won't reorder around it - and in fact that is what is happening here. But that still leaves us with the looming question of how KeMemoryBarrier prevents processor reordering. KeMemoryBarrier only has one instruction in it:

xchg Barrier, eax

Is that enough to prevent processor reordering? Well - normally a locked operation or a serializing instruction on the processor is required to prevent processor reordering, but this isn't a locked operation or a serializating instruction - or is it? Let's go to the processor manuals. The Intel IA-32 instruction set manual has this to say:

If a memory operand is referenced, the processor’s locking protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix description in this chapter for more information on the locking protocol.) This instruction is useful for implementing semaphores or similar data structures for process synchronization. (See “Bus Locking” in Chapter 7 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for more information on bus locking.)

So that is the answer to the other half of the mystery. Although the lock prefix is not present in the code - it is implicit in the instruction itself because we are referencing memory. In my opinion it should be explicitly called out in the code as well for clarity but that is just me.

Another point worth noting is that the compiler defined processor barrier mnemonics (i.e. __mf, __mb) guarantee to follow the semantics defined in the manuals for that processor and also prevent compiler reordering. That explains why the definition of KeMemoryBarrier on the IA64 platform is simply:

#define KeMemoryBarrier() __mf()

So what have we learned? Well we have learned that KeMemoryBarrier (and MemoryBarrier) does exactly what it says it does - it prevents both compiler and processor reordering. We have also learned that at first glance (and even second glance for that matter) it doesn't appear to do either!

I hope that this topic has helped at least one person in some way shape or form. Thanks for reading.

Posted by theelvez | 4 Comments

The Joys of Compiler and Processor Reordering: Why You Technically Need Read-Side Barriers

In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free code to be totally correct - you need a barrier on the read sides of the code - but that it was pretty complicated and wasn't required on any processors that Windows currently runs on. I also said that if there were equally as sick-minded individuals as myself that wanted to dive into the specifics of why - that I would post about it. There are - as it turns out - a lot of sickos out there and this is that post. :)

Consider this code:

//

// Global data definition and initialization

//

 

ULONG x = 1;

ULONG* p = &x;

ULONG y = 0;

 

VOID

T1(

    VOID

    )

{

    //

    // Set y to 1

    //

   

    y = 1;

 

 

    //

    // Compiler and processor barrier that prevents reodering

    // and forces an “invalidate” message to be sent to all

    // processors caching locations that have been written to by

    // this processor since the last barrier and before this

    // one – in other words the variable y

    //

 

    MemoryBarrier();

 

 

    //

    // Set p to point to y

    //

 

    p = &y;

}

 

 

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

 

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    i = *p;

 

    ASSERT(i != 0);

}

 

 

In this code T1() is the producer – it populates p with the pointer to some variable. T2() consumes p. From reading this code we can observe the following facts about this code:

 

- p will only ever point to x or y

- x will always have a value of 1

- y will initially be set to 0, then set to 1

- a memory barrier is issued by T1() after y is set to 1 and before p is set to point to y

 

Given these facts we should be able reason that we could never see i with the value of 0. However, there exists at least one processor where this is possible – the Alpha 21264. It is a narrow windowed race condition – but it is real – and if you wrote code that ran on that processor you would have to code within its model and worry about this exact problem. So how exactly could this occur on this processor? Let’s explore.

 

First - before T1() or T2() start to run, x, y and p are all initialized. Then T1() starts running on processor P1 and T2() starts running on processor P2. At this point any dereference of p will access the location containing the variable x. T1() then sets y to the value 1, but this does not mean that the outside world knows about this update to y yet. That comes when we issue the call to MemoryBarrier(). When MemoryBarrier() is called a message is sent to all the other processors’ probe queues. A probe queue is a per processor queue of addresses that have been invalidated by other processors. Each processor has its own queue and processes it independently of other processors. So in this case – when T1() calls MemoryBarrier() – a message is sent to P2 indicating that the memory location that holds the contents of the variable y, is no longer valid – it contains stale data and should not be used. Excellent! This is exactly what we want. The next thing T1() does is set p to hold the address of y. This is perfectly correct from the point of view of T1() and P1. y is fully initialized before it is published via p. This is good.

 

So how does T2() ever see i as 0? Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0). The first thing that T2() does is define and initialize i to the value of 1. Next it dereferences the address contained in p. What address will that be? We don’t know – it is a race condition. What we do know is that it will either be the address of x or the address of y. First let’s look at the case where p is pointing to x. In this case we know the value of i will be 1 because x is and has always been 1 since it was initialized. Not too interesting of a case. So let’s consider the more interesting case where p is pointing to y. If p is pointing to y then we know the following: y was initialized to 0, was then changed to 1 by T1() before T1() issued a MemoryBarrier() call. Now we need to know what a MemoryBarrier() call does on this processor. Here is what the manual says:

 

If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the integer queue is

empty and then performs the following actions:

a. Sends all pending MAF and IOWB entries to the system port.

b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed

events. When the counter decrements from one to zero, the Cbox marks the

youngest probe queue entry.

c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and

IOWB are empty.

 

When all of the above have occurred and a probe response has been sent to the system

for the marked probe queue entry, instruction execution continues with the

instruction after the MB.

 

The most important parts are highlighted – basically that text says “When a processor issues a memory barrier – it stalls instruction execution on the local processor, sends a message to all other processors letting them know to flush their caches because they now have invalid entries in them, flushes its own probe queue and continues executing instructions.”. Armed with that knowledge we can finally complete this long, weird, memory system - trivial pursuit sojourn. When T2() reads *p it is a 2 step process. First it loads the value of p – in this case the address of y - and then reads from that address. Because P2 has a cache entry for y – the read gets a cache hit. WAIT! T1() issued a MemoryBarrier() before setting p to hold the address of y, so how could P2 get a cache hit on y when T1() running on P1 just invalidated it?! Well – as it turns out – on the 21264 processor - local cache hits for reads are allowed to bypass the probe queue on the local processor! That’s right – the cache hit for the read of y on P2 bypasses the probe queue message telling P2 that y is not valid! Crazy-ness abounds!! So why would you ever design a processor this way? Because it is efficient, fast and beautiful - no blocking on cached reads. And any real programmer would know to put memory barriers before shared reads! Duh!! J

 

It would be an outright nightmare to program to this model. I have never had to do it – thank GOD! But assuming you did have to – how would you fix the code so that it worked correctly? Like this:

 

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

 

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    MemoryBarrier();   

 

    i = *p;

 

    ASSERT(i != 0);

}

 

The MemoryBarrier() in T2() forces P2 to drain its probe queue before continuing to execute instructions. This fixes the problem by not allowing the cache hit for the read to progress. This is also really weird because the memory barriers on this processor have local effects that are different that their global effects. And in some cases local memory barriers depend on remote barriers to complete their logical semantics. Like the local (local to P1) barrier – that depends on the remote barrier (local to P2 – but remote to P1) to complete the invalidate enforcement.

 

Well – I hope that I did a decent job explaining this problem and hope that it convinces you how hard it is to write correct lock-free code. Especially if it has to be portable to new processors that may have different memory barrier semantics than the processors you currently write code for. If I didn’t – or it didn’t – I hope it at least gave you something horrible and weird to think about for a few minutes. :D

Posted by theelvez | 29 Comments

The Joys of Compiler and Processor Reordering

So I thought that a good first technical blog entry would be one about a common – but “hardly thought about by most programmers” problem called “reordering”. It is a subtle problem but very important to understand if you write lock-less multithreaded code or write code that directly reads and writes mapped hardware registers.

 

First we need to define reordering in concrete terms. I define it this way: reordering occurs when a program executes load and/or store instructions (load = read and store = write) in an order that is different than the order specified in the program’s source code. There are 2 flavors of reordering: compiler reordering and processor reordering. Reordering problems are specific to multi-threaded code running on single-processor or multi-processor systems. Reordering is not a problem for single threaded code. Also, reordering is not specific to kernel mode code although it has some interesting ramifications for kernel mode developers that I'll talk about in  a future post. Let’s look at some code:

 

int

AddEm(

    void

    )

{

    int a, b, c, z;

 

    a = 1; // (1)

    b = 2; // (2)

    c = 3; // (3)

 

    z = a + b + c; // (4)

 

    return z;

}

 

 

In what order do the operations happen in this code (notice that I numbered the operations in the code)? The fact of the matter is – we don’t know! It could be 1,2,3,4 or 3,2,1,4 or 1,3,2,4 etc.. These are all legal optimizations that the compiler or processor can make. Why? Well – because to a single thread they are unobservable changes. A single thread can’t observe the intermediate stages of the values of a, b and c – it can only see their values at the point it sums them and stores the result in z – operation 4. The only thing that we know about the order of the operations in the above code is that the sequence will end with operation 4. If the sequence didn’t end with operation 4, then we would be able to observe the reordering in a single thread of execution – that would be really bad. Why would a compiler or processor want to do reordering anyway? There are many reasons that optimizations occur, however, let’s not focus on the reasons for the optimizations and focus on how to deal with them in our code (one reason for optimization might be combining memory accesses that are to the same processor cache line yet disparate in the source code). Now let's look at some multi-threaded code:

 

 

typedef struct _FOO {
    ULONG A;
    ULONG B;
    ULONG C;

    ULONG D;

    BOOLEAN Initialized;

} FOO;

 

VOID

InitAFoo(FOO* f)

{

    f->A = 4;

    f->B = 8;

    f->C = 15;

    f->D = 16;

    f->Initialized = TRUE;

}

 

//

// A global variable of type FOO

//

 

FOO f;

 

VOID

T1(VOID)

{

    //

    // Init the global FOO

    //     

 

    InitAFoo(&f);

 

    return;

}

 

 

VOID

T2(VOID)

{

    ULONG AVal;

 

    //

    // See if our FOO is initialized

    //

 

    if (f.Initialized) {

        AVal = f.A;

 

        // Use AVal here

    }

 

    return;

}

 

This code is a little more interesting as we have 2 threads of execution that both use the same global data. This means that they can observe intermediate states in the global data as the other execution path modifies it. Let’s assume that the routine InitAFoo() gets reordered so that its execution order is like this:

 

VOID

InitAFoo(FOO* f)

{

    f->A = 4; // (4)

    f->B = 8; // (3)

    f->C = 15; // (1)

    f->D = 16; // (5)

    f->Initialized = TRUE; // (2)

}

 

If T1() and T2() run in parallel then there exists a window of time where T2() can see f.Initialized equal to TRUE before f.A has been initialized. This is really bad if f.A is a pointer or anything that we use in any way for that matter.

 

I should point out here that T1() and T2() are sharing data without a lock. This type of progrmming raises the "difficult to get right" bar quite a bit. In fact if you use locks around your shared data then you never have to worry about reordering problems (more on why in a moment). T1() and T2() are (sort of) an example of lock free code. They are not technically “lock free” under the strictest definition – but the same principles apply – they modify global data without a shared locking mechanism. Lock-free code counts on state transitions for synchronization. In this case – the intended synchronizing state transition is when f.Initialized gets set to TRUE. The idea is that T1() is a producer (it produces initialized FOOs). T2() is a consumer of the FOOs. T2() knows that a FOO is ready for consumption when its Initialized field is set to TRUE.

 

The easiest way to fix this code is to associate a lock with f – and then acquire the lock every time the code reads or writes f. In other words – T1() would call InitAFoo() with the lock held and T2() would read f under the lock like so:

 

VOID

T1(VOID)

{

    //

    // Init the global FOO

    //     

 

    Lock(fLock);

 

    InitAFoo(&f);

 

    Unlock(fLock);

 

    return;

}

 

 

VOID

T2(VOID)

{

    ULONG AVal;

 

    //

    // See if our FOO is initialized

    //

 

    Lock(fLock);

 

    if (f.Initialized) {

        AVal = f.A;

       

        // Use AVal here

    }

 

    Unlock(fLock);

 

    return;

}

 

Life is so much easier when you use locks. But just for the sake of this topic – let’s say we have a really good reason to - and decide not to use locks and to continue down this lock-less path.

 

So what is it about the code that messes us up here? It is the fact that there is no reliable order of initialization of f as observed by T1(). What we really need is a way to make sure that f.Initialized is always set to TRUE after the other fields of f have been initialized. How do we do that? Well – on most platforms there is a thing called a “memory barrier”. In the generic sense a memory barrier is a construct that prevents a compiler or processor from reordering instructions around it. Windows has a routine called MemoryBarrier() that does just this. But where exactly would we put this in our code? Well – where do we need the strict order of operations to occur? After the initialization of the fields of f and before f.Initialized. Basically stated – we don’t care what order the fields of f are initialized in - as long as they are all initialized before f.Initialized is set to TRUE. So we could modify our code to be:

 

VOID

InitAFoo(FOO* f)

{

    f->A = 4;

    f->B = 8;

    f->C = 15;

    f->D = 16;

 

    MemoryBarrier();

 

    f->Initialized = TRUE;

}

 

By changing the code in this way we force the order of operations as observed by T2(). With the code written as above, T2() can always know that if f.Initialized is TRUE then the fields of f are initialized. Note that MemoryBarrier() acts as a compiler and processor memory barrier. There are many flavors of barriers and they vary by platform. Some barriers only prevent writes, some only prevent reads. MemoryBarrier() is what is called a “full barrier” – it blocks both reads and writes from moving around it. There are also variable decorations like "volatile" that have barrier implications. If the specific barrier types and volatiles are interesting I can cover them in another post.

 

Note: For this code to be explicitly correct we would also have to do this in T2():

 

VOID

T2(VOID)

{

    ULONG AVal;

 

    //

    // See if our FOO is initialized

    //

      

        MemoryBarrier();

 

    if (f.Initialized) {

        AVal = f.A;

 

        // Use AVal here

    }

 

    return;

} 

 

The reason we need this barrier on the "read side" is very complicated and will not happen on any current processor that Windows supports. However, if you are a sick individual and want the whole story let me know and I'll post about it.

 

So why do locks prevent you from worrying about this? You guessed it – all locks will have an embedded memory barrier! If they didn’t they wouldn’t be a lock. My personal advice is to always write your code with locks. If you can prove to yourself that the locked paths are too slow – make sure you are only locking around the required elements of your code (don’t lock around a whole function if you just need to lock around a single assignment). If you perf is still suffering – then and only then should you put yourself through the torture of writing correct lock-free code. By the way – did I mention that every compiler and platform have different rules about how things can get reordered. More to come on this later.

 

Well – I hope I haven’t bored you to tears with this. I it is not interesting – please let me know and I will start an entry about Lost!

Posted by theelvez | 9 Comments

It Goes to Eleven

I am brand new to the world of blogs so I apologize in advance to any one that reads this blog. Please let me know if I am doing something rude or ignorant. :)

"Allow myself to .. introduce ... myself ..."

My name is Jonathan Morrison and I work on the Kernel OCA/Reliability team at Microsoft. I spend most of my time either talking to people about how various parts of the kernel and operating system work, debugging kernel crashes or fixing bugs in / adding robustness to the Windows operating system. My purpose in writing this blog is to interact with other low-level geeks that love to talk about cache line sizes, asynchrony and other interesting things like the television show Lost as well as help people interested in learning about any of the above.

I will try to write about stuff that fits into one of 3 categories:

1. Stuff that confused me when I was learning about systems level programming

2. Stuff that I have seen others struggle with over my years in this business

3. Anything to do with Lost

So that said. Here are my main areas of professional/hobby level interest:

1. Operating Systems

2. Small devices (i.e. microcontroller based boards) and robotics

3. late 80's / early 90's heavy metal

4. Dark matter and dark energy

5. Lost

Well that's it. Let me know if there is anything specific you would like to start with and I will give it a go. I am thinking of something light like - compiler and processor instruction reordering!!

Posted by theelvez | 1 Comments

Hello World!

Well - you have to do it don't you?
Posted by theelvez | 1 Comments
 
Page view tracker