In this article, I describe the first of a series of Performance Patterns that you can use the VSTS concurrency visualization tool to identify.  I thought that it would be appropriate to start with a simple lock contention scenario.  The code that I will use is a naive parallel matrix multiplication of two SIZE x SIZE matrices, A and B where each thread executes the following loop:

for (i = myid*PerProcessorChunk; i < (myid+1)*PerProcessorChunk; i++)
 {
     EnterCriticalSection(&mmlock);
     for (j=0; j<SIZE; j++)
     {
        for (k=0; k<SIZE; k++)
        {
          C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];
        }
     }
     LeaveCriticalSection(&mmlock);
  }

Each thread in this application has a unique id from 0 to the number_of_threads-1 and the problem is partitioned by allocating a block of rows in C to each thread to compute.  If you examine the code, you will notice that there is no sharing on the elements of matrix C (we can chat about issues such as false sharing later if there's interest), so there's actually no need for the critical section (lock) that is used here.  I will use this as an example of showing you how this code will show up in our tool and how we can help you identify the root cause in your application's source code so that you can fix the problem. 

Before using the tool, I have to make sure that I have a good symbol path defined.  Otherwise, the call stacks in the tool will not be very useful.  I have a habit of setting a system symbol path that points to the Microsoft public symbol server that looks something like this: 

set _NT_SYMBOL_PATH=srv*c:\symcache*http://msdl.microsoft.com/download/symbols

Notice that I've also set up a symbol cache so I don't have to keep retrieving symbol files from the network.

Let's take a look at the application's behavior in our tool.  We start by looking at the CPU utilization view:

 You'll notice that the CPU utilization view shows that my application (the green area) is only consuming the equivalent of a single logical core in the system, even though I've parallelized it.  My next work item is to figure out why my application's execution seems to be serialized.  In order to do that, you can switch to the Thread Blocking view, which looks like the following:

First you'll notice that your application is indeed creating 8 threads, which is what you intended.  You will also notice that the threads' execution is serialized (the green execution regions for the threads never overlap).  You can also see a lot of red between the greens, and the legend shows that red corresponds to synchronization delays.  Now you want to know what you did in your application that resulted in this behavior.  All you have to do is to click on a red region.  Here I show a zoomed in view and what happened after I clicked on a segment in the red region of a thread's state. 

A few things happened.  First, the synchronization segment was highlighted.  Second, the "Selection" tab was activated in the buttom frame.  Inside the Selection tab, we show the category of delay, which is user synchronization in this case.  We also show an API in this case: EnterCriticalSection.  Next, we show the call stack that resulted in the thread blocking.  There's a bug in VSTS 2010 Beta 1 that results in an extraneous stack frame after the application (matmult2) stack frame, but you can see that we show the frame showing the call to EnterCriticalSection in ntdll.dll.  Incidentally, if I didn't have a good symbol path set up, the tool would not have been able to show me useful information in the stacks and its value is considerably reduced.  Now, I can keep clicking on red segments, but that is not very productive.  I would like to know how expensive this particular call stack is so that I can prioritize my performance optimization effort.  You can click on the User Synchronization tab in the lower frame.  What you get there is a call tree report that summarizes, for every synchronization call stack, how often the threads that are enabled in the current view blocked on that call stack during the displayed time period (you should click updatestats in the top toolbar to make sure that the stats are updated for the current viewpoint).  In this case, the report looks like:

I've expanded the two call stacks in this report to show the details, but the report says that there were 113 synchronization blocking events.  One call stack was responsible for 75 and the other for 38 blocking events.  You can also see that one of the call stacks involves the main thread since it contains the matmult2!_wmain() frame and the other seems to involve the slave threads that start execution at the function matmult2!MatMult().  If I want to examine the code that resulted in the most significant blocking callstack, I right click on the stack frame of interest, which brings up a context menu with two options:  "View Source" which takes you to the source code of the function specified in the frame (in the example below that is disabled because we don't have source file data for ntdll.dll), and "View Call Sites", which would bring up the place in the previous function where you made the call to this function.  Here's what the UI looks like for this feature (notice that I'm compensating for the bug in Beta 1 by right clicking on the frame right after the matmult2 frame):

When you choose "View Call Sites" in the above menu, you will be taken to an editor window with the matmult2.cpp file open and the cursor near the call site where you made the call to EnterCriticalSection().  Now you have to spend some time determining whether this lock is needed and what you can do to reduce contention on it.  Of course, in this example, the lock is not needed at all.  The editor screenshot is below (the cursor was at the line following the call because we deal with return addresses when collecting call stacks):

 

Now, there's something curious about the behavior of my application.  In the above screenshots, I ran the application with SIZE=1024, so when I look at the code, I would expect 1024 executions for each thread rather than the few long execution segments that I ended up with.  If you're not familiar with the details of critical sections, you might not be aware that they do not enforce a FIFO order on threads waiting for the lock (i.e., they are not fair). The advantage of this feature is that the first thread to find the lock in a free state is allowed to acquire it even if there were other threads waiting on the lock.  The disadvantage is the lack of fairness.  This feature is also referred to as anti-convoy support.  From a performance perspective, the critical section is faster because we don't have to wait for threads to wake up, removing the context switch overheads from the critical path, so in this example, the application will probably be a little faster due to this.  If you want to enforce fairness and FIFO order, you can use a synchronization primitive that does not have anti-convoy features like the Win32 Mutex.  Here's how you can modify the code to use a Mutex:

for (i = myid*PerProcessorChunk; i < (myid+1)*PerProcessorChunk; i++)
 {
     WaitForSingleObject(hMutex, INFINITE);
     for (j=0; j<SIZE; j++)
     {
        for (k=0; k<SIZE; k++)
        {
          C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];
        }
     }
     ReleaseMutex(hMutex);
  }

When I compiled the application and collected a profile, I was able to clearly observe the convoy behavior in my thread executions in the Thread Blocking view, as shown below:

I hope that you had fun with this article. Go pick up VSTS Beta 1 and play with our tool!