When refactoring a serial application into a parallel application the minimum bar of acceptable performance is that the parallel application must run from beginning to end in less wall clock time than the serial application. If it doesn’t run faster, there’s no point in doing the refactoring. I was working on making a certain application parallel and I was at the point of measuring the performance. The parallel version had a problem; it was about 3x slower than the serial. Something which added to my confusion was that the parallel version made excellent use of the CPU. At first glance there was a natural bottleneck in the refactored application.

I profiled the application with the Concurrency Visualizer and nothing jumped out to me as being the problem. There wasn’t any unexpected Synchronization or Sleep blocks and the swim lanes in the threads view were predominantly green (Execution). So why would an application which is doing work on all of its threads taking way longer than the serial version? I opened up the Execution Profile Report (by clicking on the word Execution under Visible Timeline Profile) and expanded the tree node with the name of my executable.

One of the children tree nodes stood out immediately. The Execution Report was showing that one of the applications methods was taking up about 86% of the execution time (inclusively), but that was certainly not expected because it was a small method which shouldn’t have been doing most of the work. Digging deeper I saw that most of the time spent in that method was to different calls to a System.Random object. Random.Next() is a pretty fast method, and even though it was being called frequently it should not have been taking up all of the execution time that it was. Thus far in the refactoring, all of the threads which were executing were accessing the same Random object. Since they were all accessing the same object, and because Random changes internal state every time it’s called, there was probably a lot of processor cache swapping going on.

To help determine if cache swapping was the culprit I started a new Performance Session in the Visual Studio Performance Explorer. In its properties I changed the Sample event to Performance counter and then selected Platform Events -> L2 Cache -> L2 Modified Lines In (there may be a better counter, this is the one I picked). I select the current project as the target and started profiling. The analyzed report lined up with what I suspected; 66% percent of the samples happened as part of Random.Next().

With this information in hand I attributed the declaration of the Random object with the ThreadStaticAttribute, and put some if(null) statements in the different methods which referred to the object to make sure that each thread got its own copy of a Random object. In addition I wrapped an array of read only objects in a ThreadLocal field, because accessing it was the second highest offender of L2 Modified Lines In samples.

With these changes made I ran again and now the parallel version of the application was 3x faster than the serial. Not quite the linear speedup I was hoping to achieve, but that cache swapping was certainly a big step in the right direction. At least now the parallel application was faster than the serial. So if the Execution Profile report is reporting that a method is taking up too much time, you might want to consider investigating if cache swapping is the culprit.

Some good reading about cache swapping and false cache sharing can be found at False sharing is no fun and .Net Matters: False Sharing.