the suggestions for a blog entry was the managed memory model. This is timely, because we’ve just been
revising our overall approach to this confusing topic. For the most part, I write about product
decisions that have already been made and shipped. In this note, I’m talking about future
is a memory model? It’s the
abstraction that makes the reality of today’s exotic hardware comprehensible to
reality of hardware is that CPUs are renaming registers, performing speculative
and out-of-order execution, and fixing up the world during retirement. Memory state is cached at various levels
in the system (L0 thru L3 on modern X86 boxes, presumably with more levels on
the way). Some levels of cache are
shared between particular CPUs but not others. For example, L0 is typically per-CPU but
a hyper-threaded CPU may share L0
between the logical CPUs of a single physical CPU. Or an 8-way box may split the system into two
hemispheres with cache controllers performing an elaborate coherency protocol
between these separate hemispheres.
If you consider caching effects, at some level all MP (multi-processor)
computers are NUMA (non-uniform memory access). But there’s enough magic going on that
even a Unisys 32-way can generally be considered as UMA by
reasonable for the CLR to know as much as possible about the cache architecture
of your hardware so that it can exploit any imbalances. For example, the developers on our
performance team have experimented with a scalable rendezvous for phases of the
GC. The idea was that each CPU
establishes a rendezvous with the CPU that is “closest” to it in distance in the
cache hierarchy, and then one of this pair cascades up a tree to its closest
neighbor until we reach a single root CPU.
At that point, the rendezvous is complete. I think the jury is still out on this
particular technique, but they have found some other techniques that really pay
off on the larger systems.
course, it’s absolutely unreasonable for any managed developer (or 99.99% of
unmanaged developers) to ever concern themselves with these imbalances. Instead, software developers want to
treat all computers as equivalent.
For managed developers, the CLR is the computer and it better work
consistently regardless of the underlying machine.
Although managed developers shouldn’t know the difference between a 4-way
AMD server and an Intel P4 hyper-threaded dual proc, they still need to face the
realities of today’s hardware.
Today, I think the penalty of a CPU cache miss that goes all the way to
main memory is about 1/10th the penalty of a memory miss that goes
all the way to disk. And the trend
you wanted good performance on a virtual memory system, you’ve always been
responsible for relieving the paging system by getting good page density and
locality in your data structures and access patterns.
a similar vein, if you want good performance on today’s hardware, where
accessing main memory is a small disaster, you must pack your data into cache
lines and limit indirections. If
you are building shared data structures, consider separating any data that’s
subject to false sharing.
some extent, the CLR can help you here.
On MP machines, we use lock-free allocators which (statistically)
guarantee locality for each thread’s allocations. Any compaction will (statistically)
preserve that locality. Looking
into the very far future – perhaps after our sun explodes – you could imagine a
CLR that can reorganize your data structures to achieve even better
This means that if you are writing single-threaded managed code to
process a server request, and if you can avoid writing to any shared state, you
are probably going to be pretty scalable without even trying.
back to memory models, what is the abstraction that will make sense of current
hardware? It’s a simplifying model
where all the cache levels disappear.
We pretend that all the CPUs are attached to a single shared memory. Now we just need to know whether all the
CPUs see the same state in that memory, or if it’s possible for some of them to
see reordering in the loads and stores that occur on other CPUs.
extreme, we have a world where all the CPUs see a single consistent memory. All the loads and stores expressed in
programs are performed in a serialized manner and nobody perceives a particular
thread’s loads or stores being reordered.
That’s a wonderfully sane model which is easy for software developers to
comprehend and program to.
Unfortunately, it is far too slow and non-scalable. Nobody builds this.
other extreme, we have a world where CPUs operate almost entirely out of private
cache. If another CPU ever sees
anything my CPU is doing, it’s a total accident of timing. Because loads and stores can propagate
to other CPUs in any random order, performance and scaling are great. But it is impossible for humans to
program to this model.
between those extremes are a lot of different possibilities. Those possibilities are explained in
terms of acquire and release semantics:
super-strong extreme model puts a full fence after every load or store. A super-weak extreme model uses normal
loads and stores everywhere, with no fencing.
familiar model is X86. It’s a
relatively strong model. Stores are
never reordered with respect to other stores. But, in the absence of data dependence,
loads can be reordered with respect to other loads and stores. Many X86 developers don’t realize that
this reordering is possible, though it can lead to some nasty failures under
stress on big MP machines.
of the above, the memory model for X86 can be described as:
Historically, Windows NT has run on Alpha and MIPS computers.
forwards, Microsoft has announced that Windows will support Intel’s IA64 and
AMD’s AMD64 processors. Eventually,
we need to port the CLR to wherever Windows runs. You can draw an obvious conclusion from
has the same memory model as X86.
specifies a weaker memory model than X86.
Specifically, all loads and stores are normal loads and stores. The application must use special ld.acq
and st.rel instructions to achieve acquire and release semantics. There’s also a full fence instruction,
though I can’t remember the opcode (mf?).
especially skeptical when you read the next paragraph:
some reason to believe that current IA64 hardware actually implements a stronger
model than is specified. Based on
informed hearsay and lots of experimental evidence, it looks like normal store
instructions on current IA64 hardware are retired in order with release
is indeed the case, why would Intel specify something weaker than what they have
built? Presumably they would do
this to leave the door open for a weaker (i.e. faster and more scalable)
implementation in the future.
the CLR has done exactly the same thing.
Section 12.6 of Partition I of the ECMA CLI specification explains our
memory model. This explains the
alignment rules, byte ordering, the atomicity of loads and stores, volatile
semantics, locking behavior, etc.
According to that specification, an application must use volatile loads
and volatile stores to achieve acquire and release semantics. Normal loads and stores can be freely
reordered, as seen by other CPUs.
the practical implication of this?
Consider the standard double-locking protocol:
if (a == null)
if (a == null) a = new
a common technique for avoiding a lock on the read of ‘a’ in the typical
case. It works just fine on
X86. But it would be broken by a
legal but weak implementation of the ECMA CLI spec. It’s true that, according to the ECMA
spec, acquiring a lock has acquire semantics and releasing a lock has release
we have to assume that a series of stores have taken place during construction
of ‘a’. Those stores can be
arbitrarily reordered, including the possibility of delaying them until after
the publishing store which assigns the new object to ‘a’. At that point, there is a small window
before the store.release implied by leaving the lock. Inside that window, other CPUs can
navigate through the reference ‘a’ and see a partially constructed
fix this code in various ways. For
example, we could insert a memory barrier of some sort after construction and
before assignment to ‘a’. Or – if
construction of ‘a’ has no side effects – we could move the assignment outside
the lock, and use an Interlocked.CompareExchange to ensure that assignment only
happens once. The GC would collect
any extra ‘A’ instances created by this race.
that this example has convinced you that you don’t want to try writing reliable
code against the documented CLI model.
a fair amount of “clever” lock-free thread-safe code in version 1 of the
CLR. This included techniques like
lock-free synchronization between the class loader, the prestub (which traps
first calls on methods so it can generate code for them), and AppDomain
unloading so that I could back-patch MethodTable slots efficiently. But I have no desire to write any kind
of code on a system that’s as weak as the ECMA CLI spec.
I tried to write code that is robust under that memory model, I have no hardware
that I could test it on. X86, AMD64
and (presumably) IA64 are stronger than what we specified.
opinion, we screwed up when we specified the ECMA memory model. That model is unreasonable
would make a sensible memory model for the CLR?
first we would want to have a consistent model across all CLI
implementations. This would include
the CLR, Rotor, the Compact Frameworks, SPOT, and – ideally – non-Microsoft
implementations like Mono. So
putting a common memory model into an ECMA spec was definitely a good
without saying that this model should be consistent across all possible
CPUs. We’re in big trouble if
everyone is testing on X86 but then deploying on Alpha (which had a notoriously
also want to have a consistent model between the native code generator (JIT or
NGEN) and the CPU. It doesn’t make
sense to constrain the JIT or NGEN to order stores, but then allow the CPU to
reorder those stores. Or vice
the IL generator would also follow the same model. In other words, your C# compiler should
be allowed to reorder whatever the native code generator and CPU are allowed to
reorder. There’s some debate
whether the converse is true.
Arguably, it is okay for an IL generator to apply more aggressive
optimizations than the native code generator and CPU are permitted, because IL
generation occurs on the developer’s box and is subject to testing.
Ultimately, that last point is a language decision rather than a CLR
decision. Some IL generators, like
ILASM, will rigorously emit IL in the sequence specified by the source
code. Other IL generators, like
Managed C++, might pursue aggressive reordering based on their own language
rules and compiler optimization switches.
If I had to guess, IL generators like the Microsoft compilers for C# and
VB.NET would decide to respect the CLR’s memory model.
spent a lot of time thinking about what the correct memory model for the CLR
should be. If I had to guess, we’re
going to switch from the ECMA model to the following model. I think that we will try to persuade
other CLI implementations to adopt this same model, and that we will try to
change the ECMA specification to reflect this.
you’re thinking this looks an awful lot like X86, AMD64 and (presumably) IA64,
you are right. We also think it
hits the sweet spots for compilers.
Reordering loads is much more important for enabling optimizations than
happens in 10 years when these architectures are gone and we’re all using
futuristic Starbucks computers with an ultra-weak model? Well, hopefully I’ll be living the good
life in retirement on Maui. But the CLR’s native code generators
will generate whatever instructions are necessary to keep stores ordered when
executing your existing programs.
Obviously this will sacrifice some performance.
trade-off between developer productivity and computer performance is really an
economic one. If there’s sufficient
incentive to write code to a weak memory model so it can execute efficiently on
future computers, then developers will do so. At that point, we will allow them to
mark their assemblies (or individual methods) to indicate that they are “weak
model clean”. This will permit the
native code generator to emit normal stores rather than store.release
instructions. You’ll be able to
achieve high performance on weak machines, but this will always be “opt
in”. And we won’t build this
capability until there’s a real demand for it.
personally believe that for mainstream computing, weak memory models will never
catch on with human developers.
Human productivity and software reliability are more important than the
increment of performance and scaling these models provide.
I think the person asking about memory models was really interested in where he
should use volatile and fences in his code. Here’s my advice: