Welcome to MSDN Blogs Sign in | Join | Help

Hazim Shafi's Blog

Concurrency Visualizer: Parallel Performance Tools for Windows
VS2010 Concurrency Visualizer: Parallel Performance Demystified!

Hi,

In my previous post, I mentioned the "Demystify" feature of our tool that isn't quite working in the VS2010 Beta 2 release (Premium and Ultimate versions).  Our team has now placed a web-based preview of this feature on our Team Blog.  Demystify is a great way of learning about our tool's features and it will be in the final release.  Give it a go and use it as a valuable resource while you're trying out our tool.  Please keep an eye on both blogs for more information about our tool.

 Enjoy!

 -Hazim

VS2010 Beta 2 Concurrency Visualizer Parallel Performance Tool Improvements

Hi,

I'm very excited about the release of Visual Studio 2010 Beta 2 that is going to be available to MSDN subscribers today and to the general public on 10/21.  This release includes significant improvements in many areas that I'm sure you'll love.  But, as the Architect of the Concurrency Visualizer tool in the VS2010 profiler, I'm extremely thrilled to share with you the huge improvements in the user interface and usability of our tool.  Our team has done an outstanding job in listening to feedback and making innovative enhancements that I'm sure will please you, our customers.  Here is a brief overview of some of the improvements that we've made:

Before we start, I'll remind you again that the tool that I'm describing here is the "visualize the behavior of a multithreaded application" option under the concurrency option in the Performance Wizard accessible through the Analyze Menu.  I've described how the tool can be run in a previous post.  Here's a screenshot of the performance wizard with the proper selection to use our tool:

Beta 2 Performance Wizard

 

 

 

 

 

 

 

 

 

 

Ok, now let's start going over the changes.  First, we've slightly changed the names of the views for our tool.  We now have "CPU Utilization", "Threads", and "Cores" views.  These views can be accessed either through the profiler toolbar's Current View pull-down menu, through bitmap buttons at the top of the summary page, or through links in our views as you'll see in the top left of the next screenshot.

bets2threadsview

You'll notice that the user interface has gone through some refinement since Beta 1 (see my earlier posts for a comparison) Let's go over the features quickly:

1. We've added an active legend in the lower left.  The active legend has multiple features.  First, for every thread state category, you can click on the legend entry to get a callstack based report in the Profile Report tab summarizing where blocking events occured in your application.  For the execution category, you get a sample profile that tells you what work your application performed.  As usual, all of the reports are filtered by the time range that you're viewing and the threads that are enabled in the view.  You can change this by zooming in or out and by disabling threads in the view to focus your attention on certain areas.  The legend also provides a summary of where time was spent as percentages shown next to the categories.

2. When you select an area in a thread's state, the "Current Stack" tab shows where your thread's execution stopped for blocking categories, or the nearest execution sample callstack within +/- 1ms of where you clicked for green segments.

3. When you select a blocking category, we also try to draw a link (dark line shown in the screenshot) to the thread that resulted in unblocking your thread whenever we are able to make that determination.  In addition, the Unblocking Stack tab shows you what the unblocking thread was doing by displaying its callstack when it unblocked your thread.  This is a great mechanism to understand thread-to-thread dependencies.

4. We've also improvement the File Operations summary report that is accessible from the active legend by also listing file operations performed by the System process.  Some of those accesses are actually triggered on behalf of your application, so we list them but clearly mark them as System accesses.  Some of those accesses may not be related to your application.

5. The Per Thread Summary report is the same bar graph breakdown of where each thread's time was spent that used to show up by default in Beta 1, but can now be accessed from the active legend.  This report is a guide that helps you understand improvements/regressions from one run to another and serves as a guide to help focus your attention on the threads and types of delay that are most important in your run.  This is valuable for filtering threads/time and prioritizing your tuning effort.

6. The profile reports now have two additional features.  By default, we now filter out the callstacks that contribute < 2% of blocking time (or samples for execution reports) to minimize noise.  You can change the noise reduction percentage yourself.  We also allow you to remove stack frames that are outside your application from the profile reports.  This can be valuable in certain cases, but it is left off by default because blocking events usually do not occur in your code, so filtering that stuff out may not help you figure out what's going on.

7. We added significant help content to the tool.  You'll notice the Hints tab that was added and it includes instructions about features of the view as well as links to two important help items.  One is a link to our Demystify feature, which is a graphical way to get contextual help.  This is also accessible through the button on the top right hand side of the view.  Unfortunately, the link isn't working in Beta 2, but we are working on hosting an equivalent web-based feature on the web to assist you and get feedback before the release is finalized.  I'll communicate this information in a subsequent post.  The other link is to a repository of graphical signatures for common performance problems.  This can be an awesome way of building a community of users and leveraging the experiences of other users and our team to help you identify potential problems. 

8. The UI has been improved to preserve details when you zoom out by allowing multiple colors to reside within a thread execution region when the same pixel in the view corresponds to multiple thread states.  This was the mechanism that we chose to always report the truth and give the users a hint that they need to zoom in to get more accurate information.

The next screenshot shows you a significantly overhauled "Cores" view:

The Cores view has the same functionality; namely, understanding how your application threads were scheduled on the logical cores in your systems.  The view leverages a new compression scheme to avoid loss of data when the view is zoomed out.  It has a legend that was missing in Beta 1.  It also has clearer statistics for each thread:  the total number of context switches, the number of context switches resulting in core migration, and the percentage of total context switches resulting in migration.  This can be very valuable when tuning to reduce context switches or cache/NUMA memory latency effects.  In addition, the visualization can easily illustrate thread serialization on cores that may result from inappropriate use of thread affinity. 

This is just a short list of the improvements that we've made.  I will be returning soon with another post about new Beta 2 features, so please visit again and don't be shy to give me your feedback and ask any questions that you may have.

Cheers!

-Hazim

 

Performance Pattern 2: Disk I/O

A common source of performance bugs is disk I/O.  In this blog, I'm going to give an overview of the features in our tool that help developers understand source of I/O stalls, including determining latency, files involved, and relating behavior to application source code.

The first screenshot, shown above illustrates how I/O stalls are depicted in our application.  In the Thread Blocking view, I/O delays have a dedicated category (dark purple in VSTS Beta 1).  You can identify the threads that are spending a significant fraction of their lifetime blocked I/O using the Execution Breakdown tab.  Using this view, you can focus your attention on the thread(s) that are relevant and you may choose to eliminate other threads from the view by selecting the thread channels on the left and then clicking the "Hide selected threads" icon in the toolbar.  In the screenshot, I've hidden all but the main thread (id = 5968) from the view and you can observe from the breakdown graph that this thread spends most of its time executing, except for some synchronization and I/O delays.  In the timeline, you can observe that the I/O stalls occur at the tail end of execution.

The screenshot above shows what happens when you zoom in on that phase with lots of I/O stalls.  More specifically, it shows what happens when you click on one of the I/O blocking segments.  You will notice that the "Selection" tab, now shows details concerning this blocking event, more specifically (if it wasn't for a bug in Beta 1), it would show something like the following text:

Category = I/O

API = WriteFile

kernel32!_WriteFile@
MatMult2!_wmain:myprogram.cpp:81
MatMult2!__tmainCRTStartup:crtexe.c:371

Delay = 70.3215 ms

Now, what we'd like to know is the file that was being written to at this time. The way you can do this is by moving the thread to be near the Disk write channels and clicking on the write segment (if available) that closely corresponds to the I/O blocking event.  I've zoomed in on the blocking event that I selected above and now you can see that what happens when I click on the Disk 0 Write channel segment right above the blocking event.  The Selection tab content now shows the number of physical disk I/O writes that were occuring at that moment with a list of the filenames involved.  In this case, the file is "MatMult2.out".  At this point, you are most likely interested in knowing the aggregate delay incurred writing to this file and where in my application these delays manifest themselves.  To do that, there are two tools.  First, when you click on the "Disk I/O Stats" tab, you will get a summary of the files being read/written to in the current view as well as the number, type of access, and number of bytes, in addition to the I/O latency.  Second, if you select the "Other Blocking Reasons" tab, you will get a summary of blocking delays, collated by callstack so you can identify the number of blocking events on the specific WriteFile call in the current view.  You should make sure to click the "UpdateStats" button in the toolbar to make sure that the reports are updated for the current view and threads selected.  This reporting capability will be improved to filter on specific blocking categories in the future.  From this report, you can also right-click on a callstack frame (e.g., WriteFile), which would allow you to view either the function body or the callsites for the current function.  The next screenshot shows how the source file where the call to WriteFile is made opens up in this case with the cursor in the general vicinity of the call.

 A couple of quick notes are in order about I/O and interactions with the operating system in our tool:

  1. We show "physical" disk I/O in our tool, so disk I/O that is buffered will not show up.  We made the decision because physical I/O is more important from a performance perspective if it is synchronous.  This brings up an interesting topic of system buffer caches that are used to hide latency.  If you run an app that reads a file twice in a row and profile it each time, it is not uncommon to see many I/O blocking events reading the file in the first run, but not the second.  You should keep such interactions with the operating system in mind when doing performance analysis.
  2. Sometime, I/O that is initiated by the application cannot be attributed directly back to it.  This can of I/O is often attributed to the system process.  We therefore show all physical disk I/O initiated both by your application and the system process.  You should keep this in mind when you are analyzing results because there might be I/O from other processes in the system.  Although this can be "noise" from a user's perspective, it can also help understand some system-level physical disk contention in your system.
  3. The Disk I/O Stats view as well as callstack reports are going to be significantly enhanced in Beta 2 to improve I/O investigations.
  4. Application startup performance is often limited by DLL load times.  Our tool can be a great aid in analyzing such scenarios.  DLL loads, although they require I/O, can be manifested as "memory management" in our tool due to paging activity.  You should be aware of this.  I will likely write a dedicated performance pattern article on this in the near future.

Now go play with our tool!

Performance Pattern 1: Identifying Lock Contention

In this article, I describe the first of a series of Performance Patterns that you can use the VSTS concurrency visualization tool to identify.  I thought that it would be appropriate to start with a simple lock contention scenario.  The code that I will use is a naive parallel matrix multiplication of two SIZE x SIZE matrices, A and B where each thread executes the following loop:

for (i = myid*PerProcessorChunk; i < (myid+1)*PerProcessorChunk; i++)
 {
     EnterCriticalSection(&mmlock);
     for (j=0; j<SIZE; j++)
     {
        for (k=0; k<SIZE; k++)
        {
          C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];
        }
     }
     LeaveCriticalSection(&mmlock);
  }

Each thread in this application has a unique id from 0 to the number_of_threads-1 and the problem is partitioned by allocating a block of rows in C to each thread to compute.  If you examine the code, you will notice that there is no sharing on the elements of matrix C (we can chat about issues such as false sharing later if there's interest), so there's actually no need for the critical section (lock) that is used here.  I will use this as an example of showing you how this code will show up in our tool and how we can help you identify the root cause in your application's source code so that you can fix the problem. 

Before using the tool, I have to make sure that I have a good symbol path defined.  Otherwise, the call stacks in the tool will not be very useful.  I have a habit of setting a system symbol path that points to the Microsoft public symbol server that looks something like this: 

set _NT_SYMBOL_PATH=srv*c:\symcache*http://msdl.microsoft.com/download/symbols

Notice that I've also set up a symbol cache so I don't have to keep retrieving symbol files from the network.

Let's take a look at the application's behavior in our tool.  We start by looking at the CPU utilization view:

 You'll notice that the CPU utilization view shows that my application (the green area) is only consuming the equivalent of a single logical core in the system, even though I've parallelized it.  My next work item is to figure out why my application's execution seems to be serialized.  In order to do that, you can switch to the Thread Blocking view, which looks like the following:

First you'll notice that your application is indeed creating 8 threads, which is what you intended.  You will also notice that the threads' execution is serialized (the green execution regions for the threads never overlap).  You can also see a lot of red between the greens, and the legend shows that red corresponds to synchronization delays.  Now you want to know what you did in your application that resulted in this behavior.  All you have to do is to click on a red region.  Here I show a zoomed in view and what happened after I clicked on a segment in the red region of a thread's state. 

A few things happened.  First, the synchronization segment was highlighted.  Second, the "Selection" tab was activated in the buttom frame.  Inside the Selection tab, we show the category of delay, which is user synchronization in this case.  We also show an API in this case: EnterCriticalSection.  Next, we show the call stack that resulted in the thread blocking.  There's a bug in VSTS 2010 Beta 1 that results in an extraneous stack frame after the application (matmult2) stack frame, but you can see that we show the frame showing the call to EnterCriticalSection in ntdll.dll.  Incidentally, if I didn't have a good symbol path set up, the tool would not have been able to show me useful information in the stacks and its value is considerably reduced.  Now, I can keep clicking on red segments, but that is not very productive.  I would like to know how expensive this particular call stack is so that I can prioritize my performance optimization effort.  You can click on the User Synchronization tab in the lower frame.  What you get there is a call tree report that summarizes, for every synchronization call stack, how often the threads that are enabled in the current view blocked on that call stack during the displayed time period (you should click updatestats in the top toolbar to make sure that the stats are updated for the current viewpoint).  In this case, the report looks like:

I've expanded the two call stacks in this report to show the details, but the report says that there were 113 synchronization blocking events.  One call stack was responsible for 75 and the other for 38 blocking events.  You can also see that one of the call stacks involves the main thread since it contains the matmult2!_wmain() frame and the other seems to involve the slave threads that start execution at the function matmult2!MatMult().  If I want to examine the code that resulted in the most significant blocking callstack, I right click on the stack frame of interest, which brings up a context menu with two options:  "View Source" which takes you to the source code of the function specified in the frame (in the example below that is disabled because we don't have source file data for ntdll.dll), and "View Call Sites", which would bring up the place in the previous function where you made the call to this function.  Here's what the UI looks like for this feature (notice that I'm compensating for the bug in Beta 1 by right clicking on the frame right after the matmult2 frame):

When you choose "View Call Sites" in the above menu, you will be taken to an editor window with the matmult2.cpp file open and the cursor near the call site where you made the call to EnterCriticalSection().  Now you have to spend some time determining whether this lock is needed and what you can do to reduce contention on it.  Of course, in this example, the lock is not needed at all.  The editor screenshot is below (the cursor was at the line following the call because we deal with return addresses when collecting call stacks):

 

Now, there's something curious about the behavior of my application.  In the above screenshots, I ran the application with SIZE=1024, so when I look at the code, I would expect 1024 executions for each thread rather than the few long execution segments that I ended up with.  If you're not familiar with the details of critical sections, you might not be aware that they do not enforce a FIFO order on threads waiting for the lock (i.e., they are not fair). The advantage of this feature is that the first thread to find the lock in a free state is allowed to acquire it even if there were other threads waiting on the lock.  The disadvantage is the lack of fairness.  This feature is also referred to as anti-convoy support.  From a performance perspective, the critical section is faster because we don't have to wait for threads to wake up, removing the context switch overheads from the critical path, so in this example, the application will probably be a little faster due to this.  If you want to enforce fairness and FIFO order, you can use a synchronization primitive that does not have anti-convoy features like the Win32 Mutex.  Here's how you can modify the code to use a Mutex:

for (i = myid*PerProcessorChunk; i < (myid+1)*PerProcessorChunk; i++)
 {
     WaitForSingleObject(hMutex, INFINITE);
     for (j=0; j<SIZE; j++)
     {
        for (k=0; k<SIZE; k++)
        {
          C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];
        }
     }
     ReleaseMutex(hMutex);
  }

When I compiled the application and collected a profile, I was able to clearly observe the convoy behavior in my thread executions in the Thread Blocking view, as shown below:

I hope that you had fun with this article. Go pick up VSTS Beta 1 and play with our tool!

 

VS2010: How to use the Parallel Performance Analysis Tools

This is a second post in the series about the parallel performance tools that my team is shipping in VS2010. In the previous post, I gave a quick overview of the features of our tools.  In this post, I will demonstrate how you can start analyzing your multithreaded application's performance using the VS2010 Beta 1 release as a guide.

For the purposes of this tutorial, I'll assume that you have a solution of interest loaded and built in VS.  Although that is not strictly needed to use our tool, we'll concentrate on this scenario for now.  When you're ready to analyze your application, you should open the Analyze menu (if you can't find it, you're probably not using the VS Team System Beta 1, so see my previous post for a link).

Choose the "Launch Performance Wizard" option, which will present you with the following:

 

Select the "Concurrency" profiling method and select the second check box for our tool.  The "resource contention" option is another cool tool that you should use, but for this series, we'll assume that the first option is turned off.  Click "Next" and you'll be presented with a dialog to choose your application.  In this case, my current solution shows up by default:

Since this is fine for my purposes, I just click "Next".  Now, I'm presented with an option to launch the profiler at the end of the performance wizard.  For this walkthrough, I'll assume that you chose the default, which is to launch the profiler:

Click "Finish" and the profiler will launch your application and collect data.  This should also bring up the "Performance Explorer" window in Visual Studio:

 

The performance wizard has created a "Performance Session" named "MatMult2-2" in this case, the default profiling method "Concurrency" is shown in the pull-down menu, my solution's executable is listed under the "Targets" folder, and a "Reports" folder is shown.  There are multiple buttons on the explorer's toolbar. From left to right, these buttons are used to launch the Performance Wizard, create a new Performance Session, run a profiling session (using launch), stop profiler data collection, and then attach/detach if you'd like to profile a running application.  In our walkthrough, collection will stop if you click the stop button (in the performance explorer window's toolbar) or when the process terminates (whether normally or by user action).  Once data collection completes, the profiler will generate a profiling report with a .vsp extension and add it to the reports folder of the associated performance session.  By default, the profiler will immediately open the profile report after collection completes.  To access the views described in my previous post, you can choose the appropriate view from the "Current View" pulldown menu in the profile report toolbar:

 Next, I will post some usage examples to illustrate how you may use our tool to understand and fix performance issues.

 Enjoy!

Visual Studio 2010 Beta 1: Parallel Performance Tools Overview

Today, Microsoft’s Developer Division released Visual Studio 2010 Beta 1 for general download.  VS2010 is a fully installable release that you can use to preview the great features that we have been working on.  I’m especially excited about the beta release of the parallel performance analysis tools that my team has been working hard on.  As you'll notice from the screenshots below, our tool has come a long way since my PDC 2008 talkI believe that we’re giving our developers something special that will make it easier for them to understand many aspects of the behavior of multithreaded applications on Windows.

In my first blog about our Beta 1 release, I’m going to give you an overview of some of the features of our tool.  I will follow up with a series of other articles over the next weeks on how the tool may be used to pinpoint issues and address them.  Just to be clear, the tool described here is shipping with Visual Studio Team System, so make sure to install that version to get your hands on it!

 

CPU Utilization / Concurrency Analysis

This is the main starting point for our tool.  What you will see here is a graph of the number of “logical” cores (remember that physical cores with hyperthreading will appear as multiple logical cores) in the system on which you collected the trace shown on the y-axis and time shown on the x-axis.  Your process’ consumption of cores is shown in a green area curve at the bottom of the graph.  We also show cores that are free in the grey area, cores that are used by the System process in a red area, and cores that are used by “other” processes that were running on the system when you collected the trace in an orange area.  The legend on the right hand side of the graph is a good reminder. 

The main purpose of this view is to help the user focus her attention on a period of execution that is of interest.  A user might be doing analysis for many reasons, depending on the phase of the development cycle that they are in.  For example, someone who is interested in parallelizing an existing application might be interested in CPU-bound regions or periods where there does not seem to be much CPU activity, which could indicate stalls due to I/O.  Another user might have parallelized an application, by he is not seeing the speed up that he expected and wants to confirm whether he is seeing the level of concurrency that he expected.  Using this view, the user can visually identify this area of interest, zoom in on it by clicking and dragging the mouse, and then switch to the thread blocking analysis view.  Here’s a snapshot of the CPU utilization view:

CPU Utilization View

Thread Blocking Analysis

This is the main view of our analysis tool.  Its purpose is to analyze the execution of each thread in the process of interest to identify blocking events that may indicate performance bottlenecks.  Each blocking event is mapped to a category, such as synchronization, or I/O.  The user can then analyze the reason for the blocking event by using interactive callstacks or callstack-based summary reports to understand the root cause of the problem.  Because the tool is integrated in the IDE, from the summary reports, the user can also view the source code in their project that may be the root cause of a delay.  There are also graphs that summarize where threads were spending their time (e.g., running or blocked), as well as many features to hide/sort threads in order to minimize noise in the reports.  In addition to threads, we also show physical disk I/O from the user application or the System process during trace collection.  This helps users identify the causes of I/O delays, or even page faults (e.g., loading a DLL, or paging).  Further, it is often hard to identify inter-thread dependencies, so we have a special feature that can help identify threads that wait on others and what the latter were doing when they released a blocked thread.  This is a great way of identifying work dependencies in your application.  Finally, when threads are executing, we provide a way of sampling the execution callstacks.  That can be very valuable in correlating the visualization with what code was running at a given period of time.  Here’s a snapshot of the thread blocking view:

Thread Blocking View

 

Core Execution / Thread Migration:

The third view in our tool shows how application threads were scheduled on the logical cores in the system.  Using this view, you can identify excessive thread migrations (when a thread is moved to another core as a result of a context switch), that can reduce performance due to caching effects.  You can also use this view to understand the impact of thread affinity settings on an execution.  Threads are associated with different colors that are displayed in time along the x-axis corresponding to the logical core that they were scheduled on.  Once you’ve identified a behavior of interest, you can zoom in on that time segment and switch to the Thread Blocking view for more in depth analysis (e.g., what caused thread blocking events that resulted in thread migration?).  Here’s a snapshot of the Core Execution view:

 Core Execution

 

MVP Summit 2009 Presentations on Visual Studio Parallel Performance Tools
I'm excited to be presenting our work on parallel performance tools in Visual Studio 2010 to Microsoft MVPs today.  I will be covering the features of our tools and improvements that we have made since PDC for native and managed developers.  I'm looking forward to a healthy discussion and any follow-ups here.
Basics Process for Parallelizing an Application

Many of our customers are asking us for guidance on what they can do to make use of multicore systems.  This is a good time to be thinking about this problem since single thread performance will remain relatively flat for the forseeable future.  As developers throw more features into applications, there's a potential for performance regressions.  So, the question here is really about the process to follow in order to parallelize applications.  Below, I will outline a process that I've been advocating internally in my parallel development class as a practical approach to the problem.  I can provide more details upon request.

Step 1: Identifying Your Goals 

Before we even start discussing a parallelization process, we need to be aware of a primary principle: parallelism is about performance, and the performance that you should care care about is a function of the specific application and usage scenarios.  So, the first step in any performance improvement effort needs to include the following:

  • Identifying the application phases and or usage scenarios that are important from a performance perspective.
  • For each such phase or scenario, decide on a performance goal.  This step is critical because it drives the "completion" criterion of any performance effort.  For some usage scenarios, especially for interactive applications, setting the goals may be relatively straight-forward.  For example, usability experts typically set a limit of 200ms on any operation that is expected to be instantaneous by users.  For other, more time-consuming scenarios, the challenge is often making perceived response time short, so hiding latency is often an important feature of a responsive application.

Step 2: Measuring Your Baseline Performance

Now that you've identified your scenarios, you have to evaluate your current state compared to your ultimate goals.  The important steps here are:

  • Creating a performance test harness for each scenario identified in step 1. 
  • Run these tests on every build to track regressions.
  • Identify scenarios that are missing their goals.
  • For each failing scenario, identify the components of your application that contribute the most to performance.  This is a step that requires two important pieces of work.  First, instrumenting the code to identify where time is spent so that improvements/regressions may be narrowed down easilty.  Second, using profiling tools.  For this last step, you need to be careful about your choice of tools.  I can offer more help here in another post if people are interested.

Step 3: Traditional Tuning Comes First

If a scenario fails, you should attempt traditional performance tuning before considering parallelism.  Why?  Unnecessarily introducing parallelism can become a burden on your development and test resources.  With parallelism, you start introducing nondeterminism, race conditions, synchronization overheads, etc. and these are issues that just increase development and testing overheads.  So, it's best to avoid them. 

The main culprits for single thread performance are I/O and cache behavior (memory latency really).  This is a large topic that we can also get into, but here are some key points that you should keep in mind:

  • Reduce unnecessary I/O.
  • Hide I/O latency when it cannot be avoided.
  • Exploit available bandwidth to reduce latency, but coealescing I/O requests if possible.
  • Improve the temporal and spatial locality of your code to reduce cache misses and associated memory latency.

Step 4: Identify Opportunities for Parallelism

If you're here, then you've attempted step 3 for failing scenarios and you have reached the conclusion that you cannot meet your performance goals unless you parallelize.  There are two classes of parallelism that you can exploit usually: hiding latency from your user (or overlapping work) or reducing latency of computation.  Notice that the first is often related to I/O, but can also be used to hide CPU-bound work.  The second is almost always a technique to speed-up CPU-bound work.  The main forms of parallelism that you're likely to encounter fall under one or more of these categories:

  • Task-Level Parallelism: Multiple time-consuming independent tasks on the critical path can be executed in parallel.  Note that the word "task" has been overloaded significantly in our industry.  Here, I mean the dictionary (not CS-specific) definition.
  • Pipelining: A task may be broken down into independent stages that can execute in parallel.  If many such tasks exist, this can significantly improve the throughput of task hadling at a (hopefully small) latency penalty due to handshaking at stage boundaries.
  • Data Parallelism: This is an area that I predict will see most use for parallelism in the near future.  This form of parallelism exists when the application performance effectively the same work on independent pieces of data.  This can be exploited by dividing the data/work across multiple CPUs as well as possibly using SIMD (Single Instruction Multiple Data) instruction sets (ala MMX/SSE).

Step 5: Decide on Phase(s) to Parallelize

The decision should be based on some estimates of potential improvement.  You can optimistically assume linear speed up (= the number of cores working on a problem) for now.  Estimate, using the performance instrumentation that you've been using so far, what fraction of the target scenario's performance will be improved, and plug-in this data into something like Amdahl's Law to get an estimate of potential speedup for the whole scenario.  This is a critical step to avoid investing time on work that will not achieve your goals.

Step 6: Implement the Parallel Algorithm

This is a large topic, but I offer some guidance here:

  • There are usually well known coding patterns for exploiting the main types of parallelism (e.g., thread pools for task-level parallelism, parallel for loops such as the ones provided by our Task Parallel Library and Parallel Pattern Library).  You should become familiar with the common tools at your disposal.
  • Reduce sharing (especially write sharing) of data across parallelism boundaries to avoid cache coherence overheads with their associated latencies.
  • This is related to the above, but reducing synchronization overheads is critical.  If you come across a solution that requires a lock in an inner-loop, that's probably a bad idea! 
  • Remember that the best sequential algorithm may not be the best parallel algorithm.  So, be open minded to potentially changing your basic implementation.

Step 7: Tune and Iterate

Repeat the measurement step and don't forget about tuning the cache behavior of your parallel implementation.  The tools that my team is shipping in Visual Studio 2010 will be very valuable for parallel tuning.  You'll have to reduce synchronization overheads for example and identify reasons for blocking UI threads, contributing to UI hangs etc. and our tools will be a great help.

 I think that I've included enough here to give you a feel for what the process might look like.  There are many points in this post that can be significantly elaborated on, but I'll leave the choice to you folks.  Let me know what your questions/concerns are and I'll try to answer them.

 Take care.

 

 

 

Come Test Drive our Technology at Supercomputing 2008
This week, a few members of the Parallel Computing Platform team are at Supercomputing 2008 in Austin.  We will be demonstrating the parallel programming tools (e.g., Parallel Pattern Library and Task Parallel Library) as well as the parallel debugging and performance analysis tools that will ship in Visual Studio 2010.  Please stop by if you're attending.  Our demo station is located in the Windows HPC Server 2008 booth area, which is hard to miss. 
Visual Studio Concurrency Analysis Performance Tools Announced at PDC

Hi,

Yesterday, the Parallel Computing Platform team at Microsoft announced the parallel performance tools shipping in Visual Studio 2010 during my presentation at PDC2008.  If you missed it, you can view the presentation, including demos here.  The team is really excited to share some of the new features in Visual Studio 2010 that are going to make it easier to express (i.e., code) parallelism, debug, and tune parallel applications.  Although it was my pleasure to give the presentation, I want you all to know that there's a fantastic team of devs, PMs, and testers at Microsoft that's making all this technology happen.  Feel free to share your questions & comments here and I promise to respond. 

Introduction

Hi,

My name is Hazim Shafi and I'm an Architect in the Parallel Computing Platform team at Microsoft.  My primary reponsibility is parallel performance analysis tools.  I received a BSEE from Santa Clara and MS and PhD degrees from Rice University.  For the past 15 years, I've been primarily working on various aspects of parallel and distributed processing.  Most of my experience has been in efforts to improve the performance of shared-memory multiprocessor systems, primarily through techniques to reduce or tolerate memory latency (think of memory/cache hierarchy design, cache coherence, and cooperative hardware/software techniques in this area).  In the process, I gained a lot of experience in building parallel performance analysis tools and general performance tuning techniques for parallel applications.  Before joining Microsoft, I was a Research Staff Member at IBM Research where I worked on the Sony Playstation 3 (Cell Architecture), the design and performance evaluation of a petaflop high performance computing system (DARPA HPCS program), and power aware computing.

I'm thrilled to be participating in the upcoming PDC2008 conference later this month (http://www.microsoftpdc.com), where I will be showing off some really cool technology.  My session is on Monday 10/27 at 1:45pm and is titled "Microsoft Visual Studio: Bringing out the Best in Multicore Systems".  Stay tuned here after the conference starts for an exchange about what's coming.  In the meantime, you can see a preview of some of the cool tools that we'll demo at PDC in our article "Improved Support For Parallelism In The Next Version Of Visual Studio" in the October issue of MSDN Magazine. 

I hope that this blog will be a good forum to get to know our customers, understand their needs, and hopefully help them be more successful.

 

Page view tracker