The Interlocked family of functions perform atomic operations on memory. How do they do it?

It depends on the underlying CPU architecture. For some CPUs, it's easy: The x86, for example, has direct support for many interlocked operations by means of the LOCK prefix (with the bonus feature that LOCK is implied for the XCHG opcode.) The ia64 and x64 also have direct support for atomic load-modify-store operations.

Most other architectures break the operation into two parts, known as Load-link/store-conditional. The first part (load-link) reads a value from memory and instructions the processor to monitor the memory address to see if any other processors modify that same memory. The second part (store-conditional) stores a value to memory provided that no other processors have written to the memory in the meantime. An atomic load-modify-store operation is therefore performed by reading the value via load-link, performing the desired computation, then attempting a store-conditional. If the store-conditional fails, then start all over again.

LONG InterlockedIncrement(LONG volatile *value)
{
  LONG lOriginal, lNewValue;
  do {
    // Read the current value via load-link so we will know if
    // somebody has modified it while we weren't looking.
    lOriginal = load_link(value);

    // Calculate the new value
    lNewValue = lOriginal + 1;

    // Store the value conditionally. This will fail if somebody
    // has updated the value in the meantime.
  } while (!store_conditional(value, lNewValue));
  return lNewValue;
}

(If this looks familiar, it should. You've seen this pattern before.)

Now, asking the CPU to monitor a memory address comes with its own gotchas. For one thing, the CPU can monitor only one memory address at a time, and its memory is very short-term. If your code gets pre-empted or if a hardware interrupt comes in after your load_link, then your store_conditional will fail because the CPU got distracted by the shiny object known as hardware interrupt and totally forgot about that memory address it was supposed to be monitoring. (Even if it managed to remember it, it won't remember it for long, because the hardware interrupt will almost certainly execute its own load_link instruction, thereby replacing the monitored address with its own.)

Furthermore, the CPU might be a little sloppy in its monitoring and monitor not the address itself but the cache line. If somebody modifies a different memory location which happens to reside in the same cache line, the store_conditional might fail even though you would expect it to succeed. The ARM architecture allows a processor to be so sloppy that any write in the same block of 2048 bytes can cause a store_conditional to fail.

What this means for you, the assembly-language coder who is implementing an interlocked operation, is that you need to minimize the number of instructions between the load_link and store_conditional. For example, InterlockedIncrement merely adds 1 to the value. The more instructions you insert between the load_link and store_conditional, the greater the chance that your store_conditional will fail and you will have to retry. And if you put too much code in between, your store_conditional will never succeed. As an extreme example, if you put code that takes five seconds to calculate the new value, you will certainly receive several hardware interrupts during those five seconds, and your store_conditional will always fail.

Bonus reading: Why did InterlockedIncrement/Decrement only return the sign of the result?