Modern processors run at incredibly high clock rates and can often execute multiple instructions concurrently. When things are going perfectly these machines plow through instructions at multiple billions per second, yet these maximums are rarely achieved even by the very best programmers.
Continuing in the spirit of my approximately correct articles I thought it would be useful to talk about some of the things that can cause performance problems at a low level. These issues tend to vary from processor to processor; even one brand of processors might have different characteristics from version to version across otherwise highly compatible processor types. Different word sizes and memory architectures also change the relative importance of these issues but the issues below still tend to be the more important ones.
Naturally a complete discussion of these issues would fill a massive tome so this is necessarily incomplete and certainly only approximately correct for any given processor.
Issue #1: Cache Misses
Nothing kills your throughput faster than a ton of cache misses. The bulk of your code and data accesses simply must come out of the cache or you’ll likely be giving up something like a factor of 1000 in execution speed for hitting main memory all the time. If your processor has multiple levels of cache then the hottest code and data must fit in the “nearest” cache to the processor, which is naturally the smallest. Let me break the root cause of the misses into three big buckets:
If you have too much raw code or raw data it simply isn’t going to fit. 10M of code can’t fit into 512k of cache no matter how you slice it.
Managed Code Considerations: The JIT compiler tends to produce code that’s bigger than a raw native compiler would (see Qualitative Code Differences in Managed Code for some reasons why) and managed objects have some per-object overhead associated with them. These can inflate the raw size of your code or data somewhat.
The processor does not fill its caches with memory on a byte-by-byte basis but rather loads whole cache lines which might be more like say 64 bytes. Sometimes the code and/or data could fit but it’s badly organized so that it doesn’t fit. For example you might need to access only 64k of data for some algorithm but the 64k you need is scattered within a 512k data structure. That sounds like an implausible situation but let me make it more concrete.
Example 1: low density due to large objects (internal waste): Suppose you have a linked list of 64 byte objects perfectly cache aligned and you need to walk the list and perform a sum of one of the fields. You will access two dwords (8 bytes) of the 64 to do the job. You’ll be using only 1/8th of the memory that was fetched – the rest of the memory was useless to you. That’s precisely the ratio that leads to 64k of needed data requiring 512k of cache. Example 2: low density due to mixing (external waste): Suppose you have that same linked list of objects except that they are now exactly 8 bytes each, however during the allocation process other related and unrelated objects were allocated at the same time and they are side-by-side with the list you need to walk. You have the same problem now because each cache line of data you read has the objects you need plus some other objects not needed for your calculation. You can easily get a density of 1/8 as in example #1.
Example 1: low density due to large objects (internal waste): Suppose you have a linked list of 64 byte objects perfectly cache aligned and you need to walk the list and perform a sum of one of the fields. You will access two dwords (8 bytes) of the 64 to do the job. You’ll be using only 1/8th of the memory that was fetched – the rest of the memory was useless to you. That’s precisely the ratio that leads to 64k of needed data requiring 512k of cache.
Example 2: low density due to mixing (external waste): Suppose you have that same linked list of objects except that they are now exactly 8 bytes each, however during the allocation process other related and unrelated objects were allocated at the same time and they are side-by-side with the list you need to walk. You have the same problem now because each cache line of data you read has the objects you need plus some other objects not needed for your calculation. You can easily get a density of 1/8 as in example #1.
Poor density is the leading cause of cache misses. Sometimes we talk about this in terms of “locality” but good “locality” merely refers to the tendency of the code to access memory in continuous chucks rather than haphazardly. That is of course just another way of saying that there is a high density of “needed” bytes in any cache line.
Managed Code Considerations: Per object overhead in managed objects makes the object slightly bigger and therefore can cause lower density of data that you need (like in Example 1) but in practice this isn’t the dominant effect density-wise. For data, that overhead tends to be dwarfed by locality benefits that come from using a garbage collected memory system. Allocations together in time tend to go together in space and temporary allocations tend to get squeezed out over time. Those are both very good for locality. Said another way, benefits from “Example 2” patterns more than erase losses in “Example 1” patterns.
See Locality of reference and performance for more details.
Cache Misses Due To Conflict
This is both the most exotic situation and one where managed code doesn’t have any special issues, so I’ll treat it the most briefly.
While any memory can be stored in the cache, normal caches are not able to store any piece of main memory at any place in the cache. Generally there are only a small number of places (like say 4) into which any given piece of main memory can be copied. Because of this you might be so unfortunate that even though you only need a small amount of memory, all the pieces you need happen to be coming from areas of main memory that all need the same cache locations. In that cause you couldn’t store it all in cache despite its small size.
In practice you don’t see a lot of this in real systems because there is such a variety of data being accessed that there is no pattern dominant enough to actually affect the performance. Where you do see this is in small benchmarks which try to magnify a particular code path or access pattern. This is one of the more painful cases to try to diagnose – a fluke conflict in a benchmark that looks like a real performance problem but is really just bad luck in a test case.
Issue #2: Bad Predictions
In order to keep all the components of the CPU “well fed” processors like to try to predict the future at least a little bit. A classic example of this is guessing about the outcomes of branch instructions. By taking a guess the processor can begin (partly) decoding instructions on the more likely path so that execution of those instructions is more economical when their turn comes. Similarly processors might like to predict the outcome of indirect “calls”, “jumps”, and even “returns.” Where prediction is either impossible or incorrect there are delays associated with restarting the processors pipeline. Sometimes processors have limited internal memory used to give useful prediction hints. Things like the previous consequences of branches, or recent call instructions, or even more exotic criteria might be used.
Managed Code Considerations: It’s hard to say if managed code is better or worse than unmanaged code in terms of bad predictions. In favor of managed code is consistent use of exception handling for error cases – this effectively removes a lot of branches from mainstream code, and sometimes/often in code that is very “hot.” Working against managed code is the large number of virtual functions and the associated indirect calls that are made to them, compounded by the fact that managed code writers tend to write smaller functions than their unmanaged counterparts.
Issue #3: Virtual Memory
Of course using more memory is generally slower than using less memory and I’ve already covered those sorts of issues under “Raw Size” above. But let’s talk about the mechanisms of virtual memory specifically as they apply to the processor.
Virtual memory addresses have to be mapped into physical memory addresses. The operating system keeps certain tables up to date to indicate to the processor how that is to be done. However the processor doesn’t want to be looking up those tables on every memory access. To avoid this lookup processors use a “Translation Lookaside Buffer” (TLB) – which is basically yet another cache. The TLB keeps track of recent virtual addresses and their physical mappings and, like everything else on the processor, is a limited resource. If processor is unable to find the virtual address mapping it needs in the TLB then it will have to go to memory (hopefully via the caches) to get those mappings, but this converts a normally free operation into a potentially costly one.
The reasons for TLB misses tend to mirror the ones for cache misses with density being the key problem, only now we’re talking about density of virtual memory pages and not density of cache lines. The “internal” (example 1 above) density problems become much more significant and “external” (example 2 above) density issues can be disastrous.
When considering TLB misses, good density in the code takes an importance equal to if not greater than that of data density. Recalling that, conversely, data density is more likely to dominate the cache problems than code density.
Managed Code Considerations: Jitting code can be a great benefit to your TLBs because related code from different functions is jitted together: code that is called together is jitted together. But conversely that same tendency to write smaller functions, and small function wrappers, means that sometimes the instruction pointer will dance through various different assemblies visiting tiny helpers and moving on to something closer to where the work will be done. A dancing instruction pointer by definition makes only passing use of the code on any given virtual memory page – more pages are visited, requiring more TLB entries for good performance. Large precompiled assemblies need to be well tuned to get good density or they will suffer from TLB misses (not to mention extra loading time)
Issue #4: Memory Traffic
If you think about the processors main memory in pretty much the same way as you think about a disk drive you are likely to be well served.
When the processor is going to read memory it is very useful if it can begin fetching that memory long before it is needed – prediction is a great thing. However processors have limits as to how many fetches they can have outstanding at any given moment. The operating system uses similar strategies to try to pre-fetch disk sectors it thinks you are highly likely to need soon.
When the processor is going to write to memory it is useful to defer those stores to a time when the memory system is less busy, maybe even idle. However processors have limited internal memory to keep track of these pending stores. The operating system uses a similar approach by deferring writes to your files so that they can be done in the most economical order – or perhaps not done at all (e.g. temporary files might be deleted before they are ever fully written to disk).
Code patterns that do a lot of consecutive (especially dependent) reads perform much worse than code patterns which read a little, think a little, read a little, etc. Likewise for writes – scheduling processor work between writes gives it time to “catch up” effectively making the writes much cheaper.
Managed Code Considerations: Managed code is object-oriented and object-oriented code generally tends to have more indirections than “functional” code. So there’s a penalty to be expected from the outset. Looking past that basic penalty, managed code has things like Write Barriers (see Garbage Collector Basics and Performance Hints). Extra operations of any kind use those resources. Competing with this is the presence of good primitive data types that are readily useable. These data types encourage developers to make sure of better data structures so you see fewer linked lists and more ArrayLists – that’s good for traffic.
Issue #5: Restrictions Due to Concurrency
If the other sections were brief this section is ultra brief but I want to at least mention this topic to foster some discussion and because it is important.
Anytime you are creating data structures that will be used by multiple threads things get complicated. Considering only the most basic things that must happen from the processors perspective, concurrency support in data structures turns into operations that require the memory subsystem to interact with the caches to stay “current.” These take the form of reads that must observe a write that potentially happened on another processor and writes that must be flushed to all processors immediately. Both of those operations are giant monkey-wrenches in the processors normal order of business.
Managed Code Considerations: Again, staying ultra brief in this discussion, many if not all of the issues managed code faces are also faced by unmanaged code. If there is any advantage to be had it is with the widespread availability of synchronization objects that are well implemented and easy to use (like Monitor.Enter via the lock construct), a highly scalable memory allocator (the GC), and less reliance on concurrency deathnails like Apartment based threading models. All of which can be totally destroyed by one piece of code that uses InterlockedIncrement with wild abandon…
And I think that’s all I have to say on this topic at least for now. I think I’d like to end again by apologizing for the incompleteness of this all. There is much more to say in all of these areas so please don’t make giant conclusions on the basis of this limited primer but rather use it as food for thought.