<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/atom.xsl" media="screen"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US"><title type="html">Hazim Shafi's Blog</title><subtitle type="html">Concurrency Visualizer: Parallel Performance Tools for Windows</subtitle><id>http://blogs.msdn.com/hshafi/atom.xml</id><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/default.aspx" /><link rel="self" type="application/atom+xml" href="http://blogs.msdn.com/hshafi/atom.xml" /><generator uri="http://communityserver.org" version="2.1.61025.2">Community Server</generator><updated>2008-10-19T20:59:00Z</updated><entry><title>Concurrency Visualizer: Avoiding Interference During Profile Collection</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/12/29/concurrency-visualizer-avoiding-interference-during-profile-collection.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/12/29/concurrency-visualizer-avoiding-interference-during-profile-collection.aspx</id><published>2009-12-29T18:43:00Z</published><updated>2009-12-29T18:43:00Z</updated><content type="html">&lt;P&gt;Those of you who are used to doing performance analysis can appreciate the value of reducing interference between your application and other applications and services running on the system under study.&amp;nbsp; So far, I've been using the Visual Studio IDE to show you how you can collect and analyze a profile.&amp;nbsp; Since Visual Studio itself can be a resource intensive application, it is sometimes desirable to collect a profile without the IDE's assistance.&amp;nbsp; Further, it is sometimes desirable to collect a profile on a system that does not have Visual Studio installed.&amp;nbsp; For these purposes, the Concurrency Visualizer comes with support in the Visual Studio profiler command-line tools.&amp;nbsp; The command-line tools allow both launch and attach profile collection.&amp;nbsp; Here's how you can accomplish this:&lt;/P&gt;
&lt;P&gt;&lt;U&gt;Lanunch Scenario:&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;1. Open a Visual Studio Command Prompt window as an Admininstrator (remember, ETW-based collection requires high privileges).&amp;nbsp; The tool is usually found at Start-&amp;gt;All Programs-&amp;gt;Microsoft Visual Studio 2010-&amp;gt;Visual Studio Tools-&amp;gt;Visual Studio Command Prompt (2010).&amp;nbsp; The console will have the appropriate paths set up for the tools that we'll be using.&lt;/P&gt;
&lt;P&gt;2. Start profiling and launch the application of interest using the following command (I usually do this from the directory containing the program of interest):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vsperfcmd /start:CONCURRENCY,THREADONLY /launch:fullpathtoprogram.exe /output:profilefilename&lt;/P&gt;
&lt;P&gt;3. Now the application should be launched and you can perform your test.&amp;nbsp; When finished, if you terminated the application, you can just run the following command to complete the profile collection:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vsperfcmd /shutdown&lt;/P&gt;
&lt;P&gt;4. When the above command completes, you will find the profile file "profilefilename.vsp" in the current directory.&amp;nbsp; All you need to do now is to open this .vsp file in Visual Studio (Ultimate or Premium) using the File-&amp;gt;Open-&amp;gt;File menu option.&amp;nbsp; Just so that you know, there are two other files containing profile data: profilefilename.app.ctl and profilefilename.krn.ctl&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;U&gt;Attach Scenario:&lt;/U&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;1. Find the process id (PID) of the application that you're interest in.&amp;nbsp; You can use Task Manager to do that by enabling the PID column in Processes tab using the View-&amp;gt;Select Columns option.&lt;/P&gt;
&lt;P mce_keep="true"&gt;2. Attach to the process for analysis using the following command:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vsperfcmd /start:CONCURRENCY,THREADONLY /attach:&amp;lt;pid of the process&amp;gt; /output:profilefilename &lt;/P&gt;
&lt;P mce_keep="true"&gt;3. When you're done profiling the usage pattern that you're interested in, run the following commands:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vsperfcmd /detach&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; vsperfcmd /shutdown&lt;/P&gt;
&lt;P mce_keep="true"&gt;4. When the above command completes, you will find the profile file "profilefilename.vsp" in the current directory.&amp;nbsp; All you need to do now is to open this .vsp file in Visual Studio (Ultimate or Premium) using the File-&amp;gt;Open-&amp;gt;File menu option.&lt;/P&gt;
&lt;P mce_keep="true"&gt;If you'd like to collect on a system that does not have Visual Studio installed, you will need to install the Standalone Profiler tools.&amp;nbsp; There's a directory on the Visual Studio DVD containing the installer for these tools.&amp;nbsp; You will need to run this as an admin because it installs a driver.&amp;nbsp; In addition, the command-line tools require .NET 4.&lt;/P&gt;
&lt;P mce_keep="true"&gt;That's all you need to collect profiles without the overhead of Visual Studio.&amp;nbsp; Now go give these a try!&lt;/P&gt;
&lt;P mce_keep="true"&gt;-Hazim&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9942005" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Concurrency Visualizer: Linking Visualizations to Application Phases</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/11/13/concurrency-visualizer-linking-visualizations-to-application-phases.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/11/13/concurrency-visualizer-linking-visualizations-to-application-phases.aspx</id><published>2009-11-13T16:13:00Z</published><updated>2009-11-13T16:13:00Z</updated><content type="html">&lt;P&gt;In my PDC 2008 presentation, I showed how the Concurrency Visualizer in Visual Studio 2010 allows users the option of instrumenting their code in order to link the visualizations with application constructs or phases of execution.&amp;nbsp; The Concurrency Visualizer does not require any instrumentation to function, but for some complex application scenarios, it is often difficult to identify the regions of execution that are of interest to us.&amp;nbsp; This is because a common performance investigation is usually focused on a certain "problem" that manifests itself during a portion of an application's execution.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;For VS2010 Beta 2, we have released a simple API that can be used for this purpose.&amp;nbsp; This API is called the Scenario and is available for download for free from &lt;A href="http://code.msdn.microsoft.com/Scenario" mce_href="http://code.msdn.microsoft.com/Scenario"&gt;http://code.msdn.microsoft.com/Scenario&lt;/A&gt;.&amp;nbsp; There are both native and managed implementations of this API, depending on the application that you are dealing with.&amp;nbsp; The Scenario API includes many features that may be of interest to the user, so I urge you to read the documentation to learn about it.&amp;nbsp; For our purposes, the Scenario API encapsulates the work necessary to generate ETW events that are consumed by the Concurrency Visualizer.&amp;nbsp; In order to use it, you need to instantiate a Scenario object, and then mark the phases that are important to you by invoking the Begin and End APIs.&amp;nbsp; When you do so, the Concurrency Visualizer will mark these regions with vertical markers in the CPU Utilization view, or rectangular regions in the Threads and Cores views.&amp;nbsp; Each Scenario has an associated string describing it and these strings are shown in tooltips in the views.&amp;nbsp; In the CPU Utilization view, the strings are shown when you hover on the vertical markers.&amp;nbsp; In the other views, they show up when you hover on the horizontal connectors of the Scenario rectangles.&amp;nbsp; If you're interested in analyzing work that happens in one of these regions, you can just zoom in on it and then switch among the various view and examine reports or interact with the UI to get your work done.&amp;nbsp; You can also use the measurement tool in the Threads view toolbar to measure the time it takes to execute the various phases/scenarios.&amp;nbsp; Here's a simple example that you can use to try out this functionality in VS2010 Beta 2 after downloading the appropriate bits from the above website.&amp;nbsp; &lt;/P&gt;
&lt;P mce_keep="true"&gt;#include "stdafx.h"&lt;BR&gt;#include "Scenario.h"&lt;/P&gt;
&lt;P mce_keep="true"&gt;int _tmain(int argc, _TCHAR* argv[])&lt;BR&gt;{&lt;BR&gt;&amp;nbsp;double *testarray = (double *) malloc(100000000*sizeof(double));&lt;BR&gt;&amp;nbsp;// Initialize the Scenario object that we will use to mark phases&lt;BR&gt;&amp;nbsp;Scenario *myScenario = new Scenario(0, L"Scenario Example", (LONG) 0);&lt;BR&gt;&amp;nbsp;// Mark the start of initialization phase&lt;BR&gt;&amp;nbsp;myScenario-&amp;gt;Begin(0, TEXT("Initialization"));&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;// Initialization&lt;BR&gt;&amp;nbsp;for (int i=0; i&amp;lt;100000000; i++)&lt;BR&gt;&amp;nbsp;&amp;nbsp;testarray[i] = 0.0;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;// Mark end of initialization&lt;BR&gt;&amp;nbsp;myScenario-&amp;gt;End(0, TEXT("Initialization"));&lt;BR&gt;&amp;nbsp;// Mark start of work phase&lt;BR&gt;&amp;nbsp;myScenario-&amp;gt;Begin(0, TEXT("Work Phase"));&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;// Work&lt;BR&gt;&amp;nbsp;srand(31);&lt;BR&gt;&amp;nbsp;for (int i=0; i&amp;lt;100000000; i++)&lt;BR&gt;&amp;nbsp;&amp;nbsp;testarray[i] = rand()/RAND_MAX * 10000.0;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;// Mark end of work phase&lt;BR&gt;&amp;nbsp;myScenario-&amp;gt;End(0, TEXT("Work Phase"));&lt;BR&gt;&amp;nbsp;exit(0);&lt;BR&gt;}&lt;/P&gt;
&lt;P&gt;When you&amp;nbsp;profile this&amp;nbsp;app, you'll notice rectangular regions such as the ones depicted below that correspond to each Scenario Begin/End pair in the application.&amp;nbsp; If you hover the mouse on the horizontal bars, you'll get a tooltip containing the text string that you associated with the Scenario phases.&amp;nbsp; You can now&amp;nbsp;zoom in on a phase that interests you, make timing measurements with the measurement tool etc.&amp;nbsp; The Cores view has a similar UI as shown, but the CPU Utilization view shows vertical bars for the begin and end markers instead of rectangular regions.&amp;nbsp; Unfortunately, in Beta 2 the bars use a shade of grey that's hard to see.&amp;nbsp; You can also hover on those vertical markers to get the Scenario text.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;Now go give this a try!&amp;nbsp;&lt;/P&gt;&lt;/SPAN&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 476px; HEIGHT: 234px" src="http://blogs.msdn.com/photos/hshafi/images/9935360/original.aspx" width=610 height=277 mce_src="http://blogs.msdn.com/photos/hshafi/images/9935360/original.aspx"&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9922036" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>VS2010 Concurrency Visualizer: Parallel Performance Demystified!</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/10/22/vs2010-parallel-performance-demystified.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/10/22/vs2010-parallel-performance-demystified.aspx</id><published>2009-10-22T17:54:00Z</published><updated>2009-10-22T17:54:00Z</updated><content type="html">&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;In my previous post, I mentioned the "Demystify" feature of our tool that isn't quite working in the VS2010 Beta 2 release (Premium and Ultimate versions).&amp;nbsp; Our team has now placed a web-based preview of this feature on our &lt;A href="http://blogs.msdn.com/visualizeparallel/" mce_href="http://blogs.msdn.com/visualizeparallel/"&gt;Team Blog&lt;/A&gt;.&amp;nbsp; Demystify is a great way of learning about our tool's features and it will be in the final release.&amp;nbsp; Give it a go and use it as a valuable resource while you're trying out our tool.&amp;nbsp; Please keep an eye on both blogs for more information about our tool.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Enjoy!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;-Hazim&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9911568" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>VS2010 Beta 2 Concurrency Visualizer Parallel Performance Tool Improvements</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/10/17/vs2010-beta-2-parallel-performance-tool-improvements.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/10/17/vs2010-beta-2-parallel-performance-tool-improvements.aspx</id><published>2009-10-17T16:57:00Z</published><updated>2009-10-17T16:57:00Z</updated><content type="html">&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I'm very excited about&amp;nbsp;the release of &lt;A title="Visual Studio 2010 Beta 2" href="http://go.microsoft.com/fwlink/?LinkID=151797" mce_href="http://go.microsoft.com/fwlink/?LinkID=151797"&gt;Visual Studio 2010 Beta 2&lt;/A&gt;&amp;nbsp;that is going to be available to MSDN subscribers today and to the general public on 10/21.&amp;nbsp; This release includes significant improvements in many areas that I'm sure you'll love.&amp;nbsp; But, as the Architect of the Concurrency&amp;nbsp;Visualizer tool in the VS2010 profiler, I'm extremely thrilled to share with you the huge improvements in the user interface and usability of our tool.&amp;nbsp; Our team has done an outstanding job in listening to feedback and&amp;nbsp;making innovative enhancements that I'm sure will please you,&amp;nbsp;our customers.&amp;nbsp; Here is a brief overview of some of the improvements that we've made:&lt;/P&gt;
&lt;P mce_keep="true"&gt;Before we start, I'll remind you again that the tool that I'm describing here is the "visualize the behavior of a multithreaded application" option under the concurrency option in the Performance Wizard accessible through the Analyze Menu.&amp;nbsp; I've described how the tool can be run in a previous post.&amp;nbsp; Here's a screenshot of the performance wizard with the proper selection to use our tool:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 384px; HEIGHT: 330px" title="Beta 2 Performance Wizard" alt="Beta 2 Performance Wizard" align=left src="http://blogs.msdn.com/photos/hshafi/images/9908583/original.aspx" width=384 height=330 mce_src="http://blogs.msdn.com/photos/hshafi/images/9908583/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;Ok, now let's start going over the changes.&amp;nbsp; First, we've slightly changed the names of the views for our tool.&amp;nbsp; We now have "CPU Utilization", "Threads", and "Cores" views.&amp;nbsp; These views can be accessed either through the profiler toolbar's &lt;STRONG&gt;Current View&lt;/STRONG&gt; pull-down menu,&amp;nbsp;through bitmap buttons at the top of the summary page, or through links in our views as you'll see in the top left of the next screenshot.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 600px; HEIGHT: 450px" title=bets2threadsview alt=bets2threadsview src="http://blogs.msdn.com/photos/hshafi/images/9908616/original.aspx" width=600 height=450 mce_src="http://blogs.msdn.com/photos/hshafi/images/9908616/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;You'll notice that the user interface has gone through some refinement since Beta 1 (see&amp;nbsp;my earlier posts for a comparison)&amp;nbsp;Let's go over the features quickly:&lt;/P&gt;
&lt;P mce_keep="true"&gt;1. We've added an active legend in the lower left.&amp;nbsp; The active legend has multiple features.&amp;nbsp; First, for every thread&amp;nbsp;state category, you can click on the legend entry to get a callstack based report in the&amp;nbsp;&lt;STRONG&gt;Profile Report&lt;/STRONG&gt; tab&amp;nbsp;summarizing where blocking events occured in your application.&amp;nbsp; For the execution category, you get a sample profile that tells you what work your application performed.&amp;nbsp; As usual, all of the reports are filtered by the time range that you're viewing and the threads that are enabled in the view.&amp;nbsp; You can change this by zooming in or out and by disabling threads in the view to focus your attention on certain areas.&amp;nbsp; The legend also provides a summary of where time was spent as percentages shown next to the categories.&lt;/P&gt;
&lt;P mce_keep="true"&gt;2. When you select an area in a thread's state, the "&lt;STRONG&gt;Current Stack&lt;/STRONG&gt;" tab shows where your thread's execution stopped for blocking categories, or the nearest execution sample callstack within +/- 1ms of&amp;nbsp;where you clicked for&amp;nbsp;green segments.&lt;/P&gt;
&lt;P mce_keep="true"&gt;3.&amp;nbsp;When you select a blocking category, we also try to draw a link (dark line shown in the screenshot) to the thread that resulted in unblocking your thread whenever we are able to make that determination.&amp;nbsp; In addition, the &lt;STRONG&gt;Unblocking Stack&lt;/STRONG&gt; tab shows you what the unblocking thread was doing by displaying its callstack when it unblocked your thread.&amp;nbsp; This is a great mechanism to understand thread-to-thread dependencies.&lt;/P&gt;
&lt;P mce_keep="true"&gt;4. We've also improvement the &lt;STRONG&gt;File Operations&lt;/STRONG&gt; summary report that is accessible from the active legend by also listing file operations performed by the System process.&amp;nbsp; Some of those accesses are actually triggered on behalf of your application, so we list them but clearly mark them as System accesses.&amp;nbsp; Some of those accesses may not be related to your application.&lt;/P&gt;
&lt;P mce_keep="true"&gt;5. The &lt;STRONG&gt;Per Thread Summary&lt;/STRONG&gt; report is the same bar graph breakdown of where each thread's time was spent that used to show up by default in Beta 1, but can now be accessed from the active legend.&amp;nbsp; This report is a guide that helps you understand improvements/regressions from one run to another and serves as a guide to help focus your attention on the threads and types of delay that are most important in your run.&amp;nbsp; This is valuable for filtering threads/time and prioritizing your tuning effort.&lt;/P&gt;
&lt;P mce_keep="true"&gt;6. The profile reports now have two additional features.&amp;nbsp; By default, we now filter out the callstacks that contribute &amp;lt; 2% of blocking time (or samples for execution reports) to minimize noise.&amp;nbsp; You can change the noise reduction percentage yourself.&amp;nbsp; We also allow you to remove stack frames that are outside your application from the profile reports.&amp;nbsp; This can be valuable in certain cases, but it is left off by default because blocking events usually do not occur in your code, so filtering that stuff out may not help you figure out what's going on.&lt;/P&gt;
&lt;P mce_keep="true"&gt;7. We added significant help content to the tool.&amp;nbsp; You'll notice the &lt;STRONG&gt;Hints&lt;/STRONG&gt; tab that was added and it includes instructions about features of the view as well as links to two important help items.&amp;nbsp; One is a link to our &lt;STRONG&gt;Demystify&lt;/STRONG&gt; feature, which is a graphical way to get contextual help.&amp;nbsp; This is also accessible through the button on the top right hand side of the view.&amp;nbsp; Unfortunately, the link isn't working in Beta 2, but we are working on hosting an equivalent web-based feature on the web to assist you and get feedback before the release is finalized.&amp;nbsp; I'll communicate this information in a subsequent post.&amp;nbsp; The other link is to a repository of graphical signatures for common performance problems.&amp;nbsp; This can be an awesome way of building a community of users and leveraging the experiences of other users and our team to help you identify potential problems.&amp;nbsp; &lt;/P&gt;
&lt;P mce_keep="true"&gt;8. The UI has been improved to preserve details when you zoom out by allowing multiple colors to reside within a thread execution region when the same pixel in the view corresponds to multiple thread states.&amp;nbsp; This was the mechanism that we chose to always report the truth and give the users a hint that they need to zoom in to get more accurate information.&lt;/P&gt;
&lt;P mce_keep="true"&gt;The next screenshot shows you a significantly overhauled "Cores" view:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 600px; HEIGHT: 450px" src="http://blogs.msdn.com/photos/hshafi/images/9908818/original.aspx" width=600 height=450 mce_src="http://blogs.msdn.com/photos/hshafi/images/9908818/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;The Cores view has the same functionality;&amp;nbsp;namely, understanding how your application threads were scheduled on the logical cores in your systems.&amp;nbsp; The view leverages a new compression scheme to avoid loss of data when the view is zoomed out.&amp;nbsp; It has a legend that was missing in Beta 1.&amp;nbsp; It also has clearer statistics for each thread:&amp;nbsp; the total number of context switches, the number of context switches resulting in core migration, and the percentage of total context switches resulting in migration.&amp;nbsp; This can be very valuable when tuning&amp;nbsp;to reduce context switches or&amp;nbsp;cache/NUMA memory latency effects.&amp;nbsp; In addition, the visualization can easily illustrate thread serialization on cores that may result from inappropriate use of thread affinity.&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;This is just a short list of the improvements that we've made.&amp;nbsp; I will be returning soon with another post about new Beta 2 features, so please visit again and don't be shy&amp;nbsp;to give me your feedback and ask any questions that you may have.&lt;/P&gt;
&lt;P mce_keep="true"&gt;Cheers!&lt;/P&gt;
&lt;P mce_keep="true"&gt;-Hazim&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9908582" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Performance Pattern 2: Disk I/O</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/07/29/performance-pattern-2-disk-i-o.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/07/29/performance-pattern-2-disk-i-o.aspx</id><published>2009-07-29T23:42:00Z</published><updated>2009-07-29T23:42:00Z</updated><content type="html">&lt;P&gt;A common source of performance bugs is disk I/O.&amp;nbsp; In this blog, I'm going to give an overview of the features in our tool that help developers understand source of I/O stalls, including determining latency, files involved, and relating behavior to application source code.&lt;/P&gt;
&lt;P&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9852604/original.aspx" width=567 height=418 mce_src="http://blogs.msdn.com/photos/hshafi/images/9852604/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;The first screenshot, shown&amp;nbsp;above illustrates how I/O stalls are depicted in our application.&amp;nbsp; In the Thread Blocking view, I/O delays have a dedicated category (dark purple in VSTS Beta 1).&amp;nbsp; You can identify the threads that are spending a significant fraction of their lifetime blocked I/O using the Execution Breakdown tab.&amp;nbsp; Using this view, you can focus your attention on the thread(s) that are relevant and you may choose to eliminate other threads from the view by selecting the thread channels on the left and then clicking the "Hide selected threads" icon in the toolbar.&amp;nbsp; In the screenshot, I've hidden all but the main thread (id = 5968) from the view and you can observe from the breakdown graph that this thread spends most of its time executing, except for some synchronization and I/O delays.&amp;nbsp; In the timeline, you can observe that the I/O stalls occur at the tail end of execution.&lt;/P&gt;
&lt;P&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9852600/original.aspx" width=569 height=446 mce_src="http://blogs.msdn.com/photos/hshafi/images/9852600/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;The screenshot above shows what happens when you zoom in on that phase with lots of I/O stalls.&amp;nbsp; More specifically, it shows what happens when you click on one of the I/O blocking segments.&amp;nbsp; You will notice that the "Selection" tab, now shows details concerning this blocking event, more specifically (if it wasn't for a bug in Beta 1), it would show something like the following text:&lt;/P&gt;
&lt;P&gt;Category = I/O&lt;/P&gt;
&lt;P&gt;API = WriteFile&lt;/P&gt;
&lt;P&gt;kernel32!_WriteFile@&lt;BR&gt;MatMult2!_wmain:myprogram.cpp:81&lt;BR&gt;MatMult2!__tmainCRTStartup:crtexe.c:371&lt;/P&gt;
&lt;P&gt;Delay = 70.3215 ms&lt;/P&gt;
&lt;P&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9852602/original.aspx" width=509 height=479 mce_src="http://blogs.msdn.com/photos/hshafi/images/9852602/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;Now, what we'd like to know is the file that was being written to at this time. The way you can do this is by moving the thread to be near the Disk write channels and clicking on the write segment (if available) that closely corresponds to the I/O blocking event.&amp;nbsp; I've zoomed in on the blocking event that I selected above and now you can see that what happens when I click on the Disk 0 Write channel segment right above the blocking event.&amp;nbsp; The Selection tab content now shows the number of physical disk I/O writes that were occuring at that moment with a list of the filenames involved.&amp;nbsp; In this case, the file is "MatMult2.out".&amp;nbsp; At this point, you are most likely interested in knowing the aggregate delay incurred writing to this file and where in my application these delays manifest themselves.&amp;nbsp; To do that, there are two tools.&amp;nbsp; First, when you click on the "Disk I/O Stats" tab, you will get a summary of the files being read/written to in the current view as well as the number, type of access, and number of bytes, in addition to the I/O latency.&amp;nbsp; Second, if you select the "Other Blocking Reasons" tab, you will get a summary of blocking delays, collated by callstack so you can identify the number of blocking events on the specific WriteFile call in the current view.&amp;nbsp; You should make sure to click the "UpdateStats" button in the toolbar to make sure that the reports are updated for the current view and threads selected.&amp;nbsp; This reporting capability will be improved to filter on specific blocking categories in the future.&amp;nbsp; From this report, you can also right-click on a callstack frame (e.g., WriteFile), which would allow you to view either the function body or the callsites for the current function.&amp;nbsp; The next screenshot shows how the source file where the call to WriteFile is made opens up in this case with the cursor in the general vicinity of the call.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 532px; HEIGHT: 62px" src="http://blogs.msdn.com/photos/hshafi/images/9852603/original.aspx" width=532 height=26 mce_src="http://blogs.msdn.com/photos/hshafi/images/9852603/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;A&amp;nbsp;couple of&amp;nbsp;quick notes are in order about I/O and interactions with the operating system in our tool:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;DIV mce_keep="true"&gt;We show "physical" disk I/O in our tool, so disk I/O that is buffered will not show up.&amp;nbsp; We made the decision because physical I/O is more important from a performance perspective if it is synchronous.&amp;nbsp; This brings up an interesting topic of system buffer caches that are used to hide latency.&amp;nbsp; If you run an app that reads a file twice in a row and profile it each time, it is not uncommon to see many I/O blocking events reading the file in the first run, but not the second.&amp;nbsp; You should keep such interactions with the operating system in mind when doing performance analysis.&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV mce_keep="true"&gt;Sometime, I/O that is initiated by the application cannot be attributed directly back to it.&amp;nbsp; This can of I/O is often attributed to the system process.&amp;nbsp; We therefore show all physical disk I/O initiated both by your application and the system process.&amp;nbsp; You should keep this in mind when you are analyzing results because there might be I/O from other processes in the system.&amp;nbsp; Although this can be "noise" from a user's perspective, it can also help understand some system-level physical disk contention in your system.&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV mce_keep="true"&gt;The Disk I/O Stats view as well as callstack reports are going to be significantly enhanced in Beta 2 to improve I/O investigations.&lt;/DIV&gt;&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV mce_keep="true"&gt;Application startup performance is often limited by DLL load times.&amp;nbsp; Our tool can be a great aid in analyzing such scenarios.&amp;nbsp; DLL loads, although they require I/O, can be manifested as "memory management" in our tool due to paging activity.&amp;nbsp; You should be aware of this.&amp;nbsp; I will likely write a dedicated performance pattern article on this in the near future.&lt;/DIV&gt;&lt;/LI&gt;&lt;/OL&gt;
&lt;P mce_keep="true"&gt;Now go play with our tool!&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9852530" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Performance Pattern 1: Identifying Lock Contention</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/06/19/performance-pattern-1-identifying-lock-contention.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/06/19/performance-pattern-1-identifying-lock-contention.aspx</id><published>2009-06-19T05:31:00Z</published><updated>2009-06-19T05:31:00Z</updated><content type="html">&lt;P&gt;In this article, I describe the first of a series of Performance Patterns that you can use the VSTS concurrency visualization tool to identify.&amp;nbsp; I thought that it would be appropriate to start with a simple lock contention scenario.&amp;nbsp; The code that I will use is a naive parallel matrix multiplication of two&amp;nbsp;SIZE x SIZE matrices, A and B where each thread executes the following loop:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;for (i = myid*PerProcessorChunk; i &amp;lt; (myid+1)*PerProcessorChunk; i++)&lt;BR&gt;&amp;nbsp;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; EnterCriticalSection(&amp;amp;mmlock);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (j=0; j&amp;lt;SIZE; j++) &lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (k=0; k&amp;lt;SIZE; k++)&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; LeaveCriticalSection(&amp;amp;mmlock);&lt;BR&gt;&amp;nbsp; }&lt;/&lt; P&gt; &lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Each thread in this application has a unique id from 0 to the number_of_threads-1 and the problem is partitioned by allocating a block of rows in&amp;nbsp;C&amp;nbsp;to each thread to compute.&amp;nbsp; If you examine the code, you will notice that there is no sharing on the elements of matrix C (we can chat about issues such as false sharing later if there's interest), so there's actually no need for the critical section (lock) that is used here.&amp;nbsp; I will use this as an example of showing you how this code will show up in our tool and how we can help you identify the root cause in your application's source code so that you can fix the problem.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;Before using the tool, I have to make sure that I have a good symbol path defined.&amp;nbsp; Otherwise, the call stacks in the tool will not be very useful.&amp;nbsp; I have a habit of setting a system symbol path that points to the Microsoft public symbol server that looks something like this:&amp;nbsp; &lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;set _NT_SYMBOL_PATH=srv*c:\symcache*http://msdl.microsoft.com/download/symbols&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Notice that I've also set up a symbol cache so I don't have to keep retrieving symbol files from the network.&lt;/P&gt;
&lt;P&gt;Let's take a look at the application's behavior in our tool.&amp;nbsp; We start by looking at the CPU utilization view:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 483px; HEIGHT: 463px" src="http://blogs.msdn.com/photos/hshafi/images/9799621/original.aspx" width=800 height=600 mce_src="http://blogs.msdn.com/photos/hshafi/images/9799621/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;You'll notice that the CPU utilization view shows that my application (the green area) is only consuming the equivalent of a single logical core in the system, even though I've parallelized it.&amp;nbsp; My next work item is to figure out why my application's execution seems to be serialized.&amp;nbsp; In order to do that, you can switch to the Thread Blocking view, which looks like the following:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 467px; HEIGHT: 161px" src="http://blogs.msdn.com/photos/hshafi/images/9799616/original.aspx" width=800 height=300 mce_src="http://blogs.msdn.com/photos/hshafi/images/9799616/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;First you'll notice that your application is indeed creating 8 threads, which is what you intended.&amp;nbsp; You will also notice that the threads' execution is serialized (the green execution regions for the threads never overlap).&amp;nbsp; You can also see a lot of red between the greens, and the legend shows that red corresponds to synchronization delays.&amp;nbsp; Now you want to know what you did in your application that resulted in this behavior.&amp;nbsp; All you have to do is to click on a red region.&amp;nbsp; Here I show a zoomed in view and what happened after I clicked on a segment in the red region of a thread's state.&amp;nbsp; &lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9799623/original.aspx" mce_src="http://blogs.msdn.com/photos/hshafi/images/9799623/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;A few things happened.&amp;nbsp; First, the synchronization segment was highlighted.&amp;nbsp; Second, the "Selection" tab was activated in the buttom frame.&amp;nbsp; Inside the Selection tab, we show the category of delay, which is user synchronization in this case.&amp;nbsp; We also show an API in this case: EnterCriticalSection.&amp;nbsp; Next, we show the call stack that resulted in the thread blocking.&amp;nbsp; There's a bug in VSTS 2010 Beta 1 that results in an extraneous stack frame after the application (matmult2) stack frame, but you can see that&amp;nbsp;we show the frame showing the call to EnterCriticalSection in ntdll.dll.&amp;nbsp; Incidentally, if I didn't have a good symbol path set up, the tool would not have been able to show me useful information in the stacks and its value is considerably reduced.&amp;nbsp; Now, I can keep clicking on red segments, but that is not very productive.&amp;nbsp; I would like to know how expensive this particular call stack is so that I can prioritize my performance optimization effort.&amp;nbsp; You can click on the User Synchronization tab in the lower frame.&amp;nbsp; What you get there is a call tree report that summarizes, for every synchronization call stack,&amp;nbsp;how often the threads that are enabled in the current view blocked on that call stack during the displayed time period (you should click updatestats in the top toolbar to make sure that the stats are updated for the current viewpoint).&amp;nbsp; In this case, the report looks like:&lt;/P&gt;
&lt;P&gt;&lt;IMG style="WIDTH: 480px; HEIGHT: 166px" src="http://blogs.msdn.com/photos/hshafi/images/9799625/original.aspx" width=545 height=143 mce_src="http://blogs.msdn.com/photos/hshafi/images/9799625/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;I've expanded the two call stacks in this report to show the details, but the report says that there were&amp;nbsp;113 synchronization blocking events.&amp;nbsp; One call stack was responsible for&amp;nbsp;75 and the other for&amp;nbsp;38 blocking events.&amp;nbsp; You can also see that one of the call stacks involves the main thread since it contains the matmult2!_wmain() frame and the other seems to involve the slave threads that start execution at the function matmult2!MatMult().&amp;nbsp; If I want to examine the code that resulted in the most significant blocking callstack, I right click on the stack frame of interest, which brings up a context menu with two options:&amp;nbsp; "View Source" which takes you to the source code of the function specified in the frame (in the example below that is disabled because we don't have source file data for ntdll.dll), and "View Call Sites", which would bring up the place in the previous function where you made the call to this function.&amp;nbsp; Here's what the UI looks like for this feature (notice that I'm compensating for the bug in Beta 1 by right clicking on the frame right after the matmult2 frame):&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9799632/original.aspx" mce_src="http://blogs.msdn.com/photos/hshafi/images/9799632/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;When you choose "View Call Sites" in the above menu, you will be taken to an editor window with the matmult2.cpp file open and the cursor near the call site where you made the call to EnterCriticalSection().&amp;nbsp; Now you have to spend some time determining whether this lock is needed and what you can do to reduce contention on it.&amp;nbsp; Of course, in this example, the lock is not needed at all.&amp;nbsp; The editor screenshot is below (the cursor was at the line following the call because we deal with return addresses when collecting call stacks):&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG style="WIDTH: 504px; HEIGHT: 97px" src="http://blogs.msdn.com/photos/hshafi/images/9799635/original.aspx" width=504 height=59 mce_src="http://blogs.msdn.com/photos/hshafi/images/9799635/original.aspx"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;Now, there's something curious about the behavior of my application.&amp;nbsp; In the above screenshots, I ran the application with SIZE=1024, so when I look at the code, I would expect 1024 executions for each thread rather than the few long execution segments that I ended up with.&amp;nbsp; If you're not familiar with the details of critical sections, you might not be aware that they do not enforce a FIFO order on threads waiting for the lock (i.e., they are not fair). The advantage of this feature is that the first thread to find the lock in a free state is allowed to acquire it even if there were other threads waiting on the lock.&amp;nbsp; The disadvantage is the lack of fairness.&amp;nbsp; This feature is also referred to as anti-convoy support.&amp;nbsp; From a performance perspective, the critical section is faster because we don't have to wait for threads to wake up, removing the context switch overheads from the critical path, so in this example, the application will probably be a little faster due to this.&amp;nbsp; If you want to enforce fairness and FIFO order, you can use a synchronization primitive that does not have anti-convoy features like the Win32 Mutex.&amp;nbsp; Here's how you can modify the code to use a Mutex:&lt;/P&gt;
&lt;P&gt;for (i = myid*PerProcessorChunk; i &amp;lt; (myid+1)*PerProcessorChunk; i++)&lt;BR&gt;&amp;nbsp;{&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;WaitForSingleObject(hMutex, INFINITE);&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (j=0; j&amp;lt;SIZE; j++) &lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (k=0; k&amp;lt;SIZE; k++)&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; C[i+j*SIZE] += A[i+k*SIZE]*B[k+j*SIZE];&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ReleaseMutex(hMutex);&lt;BR&gt;&amp;nbsp; }&lt;/&lt; P&gt; &lt;/P&gt;
&lt;P&gt;When I compiled the application and collected a profile, I was able to clearly observe the convoy behavior in my thread executions in the Thread Blocking view, as shown below:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9799644/original.aspx" width=494 height=169 mce_src="http://blogs.msdn.com/photos/hshafi/images/9799644/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;I hope that you had fun with this article. Go pick up VSTS Beta 1 and play with our tool!&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9778507" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>VS2010: How to use the Parallel Performance Analysis Tools</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/06/02/vs2010-how-to-use-the-parallel-performance-analysis-tools.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/06/02/vs2010-how-to-use-the-parallel-performance-analysis-tools.aspx</id><published>2009-06-02T21:17:00Z</published><updated>2009-06-02T21:17:00Z</updated><content type="html">&lt;P&gt;This is a second post in the series about the parallel performance tools that my team is shipping in VS2010. In the previous post, I gave a quick overview of the features of our tools.&amp;nbsp; In this post, I will demonstrate how you can start analyzing your multithreaded application's performance using the VS2010 Beta 1 release as a guide.&lt;/P&gt;
&lt;P&gt;For the purposes of this tutorial, I'll assume that you have a solution of interest loaded and built in VS.&amp;nbsp; Although that is not strictly needed to use our tool, we'll concentrate on this scenario for now.&amp;nbsp; When you're ready to analyze your application, you should open the Analyze menu (if you can't find it, you're probably not using the VS Team System Beta 1, so see my previous post for a link).&lt;/P&gt;
&lt;P&gt;&lt;IMG style="WIDTH: 341px; HEIGHT: 216px" src="http://blogs.msdn.com/photos/hshafi/images/9686380/original.aspx" width=341 height=216 mce_src="http://blogs.msdn.com/photos/hshafi/images/9686380/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;Choose the "Launch Performance Wizard" option, which will present you with the following:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;IMG style="WIDTH: 525px; HEIGHT: 452px" src="http://blogs.msdn.com/photos/hshafi/images/9686384/original.aspx" width=525 height=571 mce_src="http://blogs.msdn.com/photos/hshafi/images/9686384/original.aspx"&gt;&lt;/P&gt;
&lt;P&gt;Select the "Concurrency" profiling method and select the second check box for our tool.&amp;nbsp; The "resource contention" option is another cool tool that you should use, but for this series, we'll assume that the first option is turned off.&amp;nbsp; Click "Next" and you'll be presented with a dialog to choose your application.&amp;nbsp; In this case, my current solution shows up by default:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9686390/original.aspx" width=527 height=515 mce_src="http://blogs.msdn.com/photos/hshafi/images/9686390/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;Since this is fine for my purposes, I just click "Next".&amp;nbsp; Now, I'm presented with an option to launch the profiler at the end of the performance wizard.&amp;nbsp; For this walkthrough, I'll assume that you chose the default, which is to launch the profiler:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9686391/original.aspx" width=532 height=528 mce_src="http://blogs.msdn.com/photos/hshafi/images/9686391/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;Click "Finish" and the profiler will launch your application and collect data.&amp;nbsp; This should also bring up the "Performance Explorer" window in Visual Studio:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9686456/original.aspx" mce_src="http://blogs.msdn.com/photos/hshafi/images/9686456/original.aspx"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;The performance wizard has created a "Performance Session"&amp;nbsp;named "MatMult2-2" in this case, the default profiling method "Concurrency" is shown in the pull-down menu,&amp;nbsp;my solution's executable is listed under the "Targets" folder, and a "Reports" folder&amp;nbsp;is shown.&amp;nbsp; There are multiple buttons on the explorer's toolbar. From left to right, these buttons are used to launch the&amp;nbsp;Performance Wizard,&amp;nbsp;create a new Performance Session, run a profiling session (using launch), stop profiler data collection, and then attach/detach if you'd like to profile a running application.&amp;nbsp; In our walkthrough, collection will stop if you click the&amp;nbsp;stop button (in the performance explorer window's toolbar)&amp;nbsp;or when the process terminates (whether normally or by user action).&amp;nbsp; Once data collection completes, the profiler will generate a profiling report with a .vsp extension and add it to the reports folder of the associated performance session.&amp;nbsp; By default, the profiler will immediately open the profile report after collection completes.&amp;nbsp; To access the views described in my previous post, you can choose the appropriate view from the "Current View" pulldown menu in the profile report toolbar:&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;IMG src="http://blogs.msdn.com/photos/hshafi/images/9686493/original.aspx" mce_src="http://blogs.msdn.com/photos/hshafi/images/9686493/original.aspx"&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;Next, I will post some usage examples to illustrate how you may use our tool to understand and fix performance issues.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;Enjoy!&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9686437" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Visual Studio 2010 Beta 1: Parallel Performance Tools Overview</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/05/18/visual-studio-2010-beta-1-parallel-performance-tools.aspx</id><published>2009-05-18T23:50:00Z</published><updated>2009-05-18T23:50:00Z</updated><content type="html">&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;Today, Microsoft’s Developer Division released Visual Studio 2010 Beta 1 for general download.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;VS2010 is a fully installable release that you can use to preview the great features that we have been working on.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;I’m especially excited about the beta release of the parallel performance analysis tools that my team has been working hard on.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; As you'll notice from the screenshots below, our tool has come a long way since my &lt;A title="PDC 2008 talk" href="http://channel9.msdn.com/pdc2008/TL19/" mce_href="http://channel9.msdn.com/pdc2008/TL19/"&gt;PDC 2008 talk&lt;/A&gt;.&amp;nbsp; &lt;/SPAN&gt;I believe that we’re giving our developers something special that will make it easier for them to understand many aspects of the behavior of multithreaded applications on Windows.&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;In my first blog about our Beta 1 release, I’m going to give you an overview of some of the features of our tool.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;I style="mso-bidi-font-style: normal"&gt;I will follow up with a series of other articles over the next weeks on how the tool may be used to pinpoint issues and address them&lt;/I&gt;&lt;/B&gt;.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Just to be clear, the tool described here is shipping with Visual Studio Team System, so make sure to &lt;A title="install that version" href="http://go.microsoft.com/fwlink/?LinkId=147407" mce_href="http://go.microsoft.com/fwlink/?LinkId=147407"&gt;install that version&lt;/A&gt; to get your hands on it!&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;FONT size=3 face=Calibri&gt;CPU Utilization / Concurrency Analysis&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3&gt;&lt;FONT face=Calibri&gt;This is the main starting point for our tool.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;What you will see here is a graph of the number of “logical” cores (remember that physical cores with hyperthreading will appear as multiple logical cores) in the system on which you collected the trace shown on the y-axis and time shown on the x-axis.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Your process’ consumption of cores is shown in a green area curve at the bottom of the graph.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;We also show cores that are free in the grey area, cores that are used by the System process in a red area, and cores that are used by “other” processes that were running on the system when you collected the trace in an orange area.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;The legend on the right hand side of the graph is a good reminder.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The main purpose of this view is to help the user focus her attention on a period of execution that is of interest.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;A user might be doing analysis for many reasons, depending on the phase of the development cycle that they are in.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;For example, someone who is interested in parallelizing an existing application might be interested in CPU-bound regions or periods where there does not seem to be much CPU activity, which could indicate stalls due to I/O.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Another user might have parallelized an application, by he is not seeing the speed up that he expected and wants to confirm whether he is seeing the level of concurrency that he expected.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Using this view, the user can visually identify this area of interest, zoom in on it by clicking and dragging the mouse, and then switch to the thread blocking analysis view.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Here’s a snapshot of the CPU utilization view:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;&lt;IMG style="WIDTH: 565px; HEIGHT: 392px" title="CPU Utilization View" alt="CPU Utilization View" align=middle src="http://blogs.msdn.com/photos/hshafi/images/9626176/original.aspx" width=630 height=392 mce_src="http://blogs.msdn.com/photos/hshafi/images/9626176/original.aspx"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;?xml:namespace prefix = v ns = "urn:schemas-microsoft-com:vml" /&gt;&lt;v:shapetype id=_x0000_t75 stroked="f" filled="f" path="m@4@5l@4@11@9@11@9@5xe" o:preferrelative="t" o:spt="75" coordsize="21600,21600"&gt;&lt;v:stroke joinstyle="miter"&gt;&lt;/v:stroke&gt;&lt;v:formulas&gt;&lt;v:f eqn="if lineDrawn pixelLineWidth 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 1 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum 0 0 @1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @2 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @3 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @0 0 1"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @6 1 2"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelWidth"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @8 21600 0"&gt;&lt;/v:f&gt;&lt;v:f eqn="prod @7 21600 pixelHeight"&gt;&lt;/v:f&gt;&lt;v:f eqn="sum @10 21600 0"&gt;&lt;/v:f&gt;&lt;/v:formulas&gt;&lt;v:path o:connecttype="rect" gradientshapeok="t" o:extrusionok="f"&gt;&lt;/v:path&gt;&lt;o:lock aspectratio="t" v:ext="edit"&gt;&lt;/o:lock&gt;&lt;/v:shapetype&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;FONT size=3 face=Calibri&gt;Thread Blocking Analysis&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;This is the main view of our analysis tool.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Its purpose is to analyze the execution of each thread in the process of interest to identify blocking events that may indicate performance bottlenecks.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Each blocking event is mapped to a category, such as synchronization, or I/O.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;The user can then analyze the reason for the blocking event&amp;nbsp;by using interactive callstacks or callstack-based summary reports to understand the root cause of the problem.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Because the tool is integrated in the IDE, from the summary reports, the user can also view the source code in their project that may be the root cause of a delay.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;There are also graphs that summarize where threads were spending their time (e.g., running or blocked), as well as many features to hide/sort threads in order to minimize noise in the reports.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;In addition to threads, we also show physical disk I/O from the user application or the System process during trace collection.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;This helps users identify the causes of I/O delays, or even page faults (e.g., loading a DLL, or paging).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Further, it is often hard to identify inter-thread dependencies, so we have a special feature that can help identify threads that wait on others and what the latter were doing when they released a blocked thread.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;This is a great way of identifying work dependencies in your application.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Finally, when threads are executing, we provide a way of sampling the execution callstacks.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;That can be very valuable in correlating the visualization with what code was running at a given period of time.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Here’s a snapshot of the thread blocking view:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;&lt;IMG style="WIDTH: 519px; HEIGHT: 392px" title="Thread Blocking View" alt="Thread Blocking View" align=middle src="http://blogs.msdn.com/photos/hshafi/images/9626180/original.aspx" width=630 height=392 mce_src="http://blogs.msdn.com/photos/hshafi/images/9626180/original.aspx"&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;B style="mso-bidi-font-weight: normal"&gt;&lt;FONT size=3 face=Calibri&gt;Core Execution / Thread Migration:&lt;/FONT&gt;&lt;/B&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;FONT size=3 face=Calibri&gt;The third view in our tool shows how application threads were scheduled on the logical cores in the system.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Using this view, you can identify excessive thread migrations (when a thread is moved to another core as a result of a context switch),&amp;nbsp;that can reduce performance due to caching effects.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;You can also use this view to understand the impact of thread affinity settings on an execution.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&amp;nbsp;T&lt;/SPAN&gt;hreads are associated with different colors that are displayed in time along the x-axis corresponding to the logical core that they were scheduled on.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Once you’ve identified a behavior of interest, you can zoom in on that time segment and switch to the Thread Blocking view for more in depth analysis (e.g., what caused thread blocking events that resulted in thread migration?).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Here’s a snapshot of the Core Execution view:&lt;/FONT&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;SPAN style="mso-no-proof: yes"&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;IMG style="WIDTH: 482px; HEIGHT: 392px" title="Core Execution" alt="Core Execution" align=middle src="http://blogs.msdn.com/photos/hshafi/images/9626181/original.aspx" width=630 height=392 mce_src="http://blogs.msdn.com/photos/hshafi/images/9626181/original.aspx"&gt;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P style="MARGIN: 0in 0in 10pt" class=MsoNormal&gt;&lt;o:p&gt;&lt;FONT size=3 face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9626005" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>MVP Summit 2009 Presentations on Visual Studio Parallel Performance Tools</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2009/03/03/mvp-summit-2009-presentations-on-visual-studio-parallel-performance-tools.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2009/03/03/mvp-summit-2009-presentations-on-visual-studio-parallel-performance-tools.aspx</id><published>2009-03-03T21:14:00Z</published><updated>2009-03-03T21:14:00Z</updated><content type="html">I'm excited to be presenting our work on parallel performance tools in Visual Studio 2010 to Microsoft MVPs today.&amp;nbsp; I will be covering the features of our tools and improvements that we have made since PDC for native and managed developers.&amp;nbsp; I'm looking forward to a healthy discussion and any follow-ups here.&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9457194" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Basics Process for Parallelizing an Application</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2008/12/11/basics-process-for-parallelizing-an-application.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2008/12/11/basics-process-for-parallelizing-an-application.aspx</id><published>2008-12-11T22:43:00Z</published><updated>2008-12-11T22:43:00Z</updated><content type="html">&lt;P&gt;Many of our customers are asking us for guidance on what they can do to make use of multicore systems.&amp;nbsp;&amp;nbsp;This is a good time to be thinking about this problem since single thread performance will remain relatively&amp;nbsp;flat for the forseeable future.&amp;nbsp;&amp;nbsp;As developers throw more features into applications, there's a potential for performance regressions.&amp;nbsp; So, the question here is really about the process to follow in order to parallelize applications.&amp;nbsp;&amp;nbsp;Below, I will outline&amp;nbsp;a process that I've been advocating internally in my parallel development class as a practical approach to the problem.&amp;nbsp; I can provide more details upon request.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 1: Identifying Your Goals&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Before we even start discussing a parallelization process, we need to be aware of a primary principle: parallelism is about performance, and the performance that&amp;nbsp;you should care&amp;nbsp;care about is a function of the specific application and usage scenarios.&amp;nbsp; So, the first step in any performance improvement effort needs to include the following:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Identifying the application phases and or usage scenarios that are important from a performance perspective.&lt;/LI&gt;
&lt;LI&gt;For each such phase or scenario, decide on a performance goal.&amp;nbsp; This step is critical because it drives the "completion" criterion of any performance effort.&amp;nbsp; For some usage scenarios, especially for interactive applications, setting the goals may be relatively straight-forward.&amp;nbsp; For example, usability experts typically set a limit of 200ms on any operation that is expected to be instantaneous by users.&amp;nbsp; For other, more time-consuming scenarios, the challenge is often making perceived response time short, so hiding latency is often an important feature of a responsive application.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Step 2: Measuring Your Baseline Performance&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Now that you've identified your scenarios, you have to evaluate your current state compared to your ultimate goals.&amp;nbsp; The important steps here are:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Creating a performance test harness for each scenario identified in step 1.&amp;nbsp; &lt;/LI&gt;
&lt;LI&gt;Run these tests on every build to track regressions.&lt;/LI&gt;
&lt;LI&gt;Identify scenarios that are missing their goals.&lt;/LI&gt;
&lt;LI&gt;For each failing scenario, identify the components of your application that contribute the most to performance.&amp;nbsp; This is a step that requires two important pieces of work.&amp;nbsp; First, instrumenting the code to identify where time is spent so that improvements/regressions may be narrowed down easilty.&amp;nbsp; Second, using profiling tools.&amp;nbsp; For this last step, you need to be careful about your choice of tools.&amp;nbsp; I can offer more help here in another post if people are interested.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Step 3: Traditional Tuning Comes First&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;If a scenario fails, you should attempt traditional performance tuning before considering parallelism.&amp;nbsp; Why?&amp;nbsp; Unnecessarily introducing parallelism can become a burden on your development and test resources.&amp;nbsp; With parallelism, you start introducing nondeterminism, race conditions, synchronization overheads, etc. and these are issues that just increase development and testing overheads.&amp;nbsp; So, it's best to avoid them.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;The main culprits for single thread performance are I/O and cache behavior (memory latency really).&amp;nbsp; This is a large topic that we can also get into, but here are&amp;nbsp;some key points that you should keep in mind:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Reduce unnecessary I/O.&lt;/LI&gt;
&lt;LI&gt;Hide I/O latency when it cannot be avoided.&lt;/LI&gt;
&lt;LI&gt;Exploit available bandwidth to reduce latency, but coealescing I/O requests if possible.&lt;/LI&gt;
&lt;LI&gt;Improve the temporal and spatial locality of your code to reduce cache misses and associated memory latency.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Step 4: Identify Opportunities for Parallelism&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;If you're here, then you've attempted step 3 for failing scenarios and you have reached the conclusion that you cannot meet your performance goals unless you parallelize.&amp;nbsp; There are two classes of parallelism that you can exploit usually: hiding latency from your user (or overlapping work) or reducing latency of computation.&amp;nbsp; Notice that the first is often related to I/O, but can also be used to hide CPU-bound work.&amp;nbsp; The second is almost always a technique to speed-up CPU-bound work.&amp;nbsp; The main forms of parallelism that you're likely to encounter fall under one or more of these categories:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Task-Level Parallelism: Multiple time-consuming independent tasks on the critical path can be executed in parallel.&amp;nbsp; Note that the word "task" has been overloaded significantly in our industry.&amp;nbsp; Here, I mean the dictionary (not CS-specific) definition.&lt;/LI&gt;
&lt;LI&gt;Pipelining: A task may be broken down into independent stages that can execute in parallel.&amp;nbsp; If many such tasks exist, this can significantly improve the throughput of task hadling at a (hopefully small) latency penalty due to handshaking at stage boundaries.&lt;/LI&gt;
&lt;LI&gt;Data Parallelism: This is an area that I predict will see most use for parallelism in the near future.&amp;nbsp; This form of parallelism exists when the application performance effectively the same work on independent pieces of data.&amp;nbsp; This can be exploited by dividing the data/work across multiple CPUs as well as possibly using SIMD (Single Instruction Multiple Data) instruction sets (ala MMX/SSE).&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Step 5: Decide on Phase(s) to Parallelize&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The decision should be based on some estimates of potential improvement.&amp;nbsp; You can optimistically assume linear speed up (= the number of cores working on a problem) for now.&amp;nbsp; Estimate, using the performance instrumentation that you've been using so far, what fraction of the target scenario's performance will be improved, and plug-in this data into something like Amdahl's Law to get an estimate of potential speedup for the whole scenario.&amp;nbsp; This is a critical step to avoid investing time on work that will not achieve your goals.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 6: Implement the Parallel Algorithm&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;This is a large topic, but I offer some guidance here:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;There are usually well known coding patterns for exploiting the main types of parallelism (e.g., thread pools for task-level parallelism, parallel for loops such as the ones provided by our Task Parallel Library and Parallel Pattern Library).&amp;nbsp; You should become familiar with the common tools at your disposal.&lt;/LI&gt;
&lt;LI&gt;Reduce sharing (especially write sharing) of data across parallelism boundaries to avoid cache coherence overheads with their associated latencies.&lt;/LI&gt;
&lt;LI&gt;This is related to the above, but reducing synchronization overheads is critical.&amp;nbsp; If you come across a solution that requires a lock in an inner-loop, that's probably a bad idea!&amp;nbsp; &lt;/LI&gt;
&lt;LI&gt;Remember that the best sequential algorithm may not be the best parallel algorithm.&amp;nbsp; So, be open minded to potentially changing your basic implementation.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Step 7: Tune and Iterate&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Repeat the measurement step and don't forget about tuning the cache behavior of your parallel implementation.&amp;nbsp; The tools that my team is shipping in Visual Studio 2010 will be very valuable for parallel tuning.&amp;nbsp; You'll have to reduce synchronization overheads for example and identify reasons for blocking UI threads, contributing to UI hangs etc. and our tools will be a great help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;I think that I've included enough here to give you a feel for what the process might look like.&amp;nbsp; There are many points in this post that can be significantly elaborated on, but I'll leave the choice to you folks.&amp;nbsp; Let me know what your questions/concerns are and I'll try to answer them.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Take care.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9198076" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author><category term="how to parallelize" scheme="http://blogs.msdn.com/hshafi/archive/tags/how+to+parallelize/default.aspx" /><category term="parallelizing an application" scheme="http://blogs.msdn.com/hshafi/archive/tags/parallelizing+an+application/default.aspx" /><category term="parallel development process" scheme="http://blogs.msdn.com/hshafi/archive/tags/parallel+development+process/default.aspx" /></entry><entry><title>Come Test Drive our Technology at Supercomputing 2008</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2008/11/18/come-test-drive-our-technology-at-supercomputing-2008.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2008/11/18/come-test-drive-our-technology-at-supercomputing-2008.aspx</id><published>2008-11-19T02:26:00Z</published><updated>2008-11-19T02:26:00Z</updated><content type="html">This week, a few members of the Parallel Computing Platform team are at Supercomputing 2008 in Austin.&amp;nbsp; We will be demonstrating the parallel programming tools (e.g., Parallel Pattern Library and&amp;nbsp;Task Parallel Library) as well as the parallel debugging and performance analysis tools that will ship in Visual Studio 2010.&amp;nbsp; Please stop by if you're attending.&amp;nbsp; Our demo station is located in the Windows HPC Server 2008 booth area, which is hard to miss.&amp;nbsp; &lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9120692" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Visual Studio Concurrency Analysis Performance Tools Announced at PDC</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2008/10/28/visual-studio-concurrency-analysis-performance-tools-announced-at-pdc.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2008/10/28/visual-studio-concurrency-analysis-performance-tools-announced-at-pdc.aspx</id><published>2008-10-29T00:03:00Z</published><updated>2008-10-29T00:03:00Z</updated><content type="html">&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;Yesterday,&amp;nbsp;the Parallel Computing Platform team at Microsoft&amp;nbsp;announced the parallel performance tools shipping in Visual Studio 2010 during my presentation at PDC2008.&amp;nbsp; If you missed it, you can view the presentation, including demos &lt;A class="" href="http://channel9.msdn.com/pdc2008/TL19/" mce_href="http://channel9.msdn.com/pdc2008/TL19/"&gt;here&lt;/A&gt;.&amp;nbsp;&amp;nbsp;The team is&amp;nbsp;really excited to share some of the new features in Visual Studio 2010 that are going to make it easier to express (i.e., code) parallelism, debug, and tune parallel applications.&amp;nbsp; Although it was my pleasure to give the presentation, I want you all to know that there's a fantastic team of devs, PMs, and testers at Microsoft that's making all this technology happen.&amp;nbsp; Feel free to share your questions &amp;amp; comments here and I promise to respond.&amp;nbsp; &lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9020982" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry><entry><title>Introduction</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/hshafi/archive/2008/10/19/introduction.aspx" /><id>http://blogs.msdn.com/hshafi/archive/2008/10/19/introduction.aspx</id><published>2008-10-19T22:59:00Z</published><updated>2008-10-19T22:59:00Z</updated><content type="html">&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;My name is Hazim Shafi and I'm an Architect in the Parallel Computing Platform team at Microsoft.&amp;nbsp; My primary reponsibility is parallel performance analysis tools.&amp;nbsp; I received a BSEE from Santa Clara and MS and PhD degrees from Rice University.&amp;nbsp; For the past 15 years, I've been primarily working on various aspects of parallel and distributed processing.&amp;nbsp; Most of my experience has been in efforts to improve the performance of shared-memory multiprocessor systems, primarily through techniques to reduce or tolerate memory latency (think of memory/cache hierarchy design, cache coherence, and cooperative hardware/software techniques in this area).&amp;nbsp; In the process, I gained a lot of experience in building parallel performance analysis tools and general performance tuning techniques for parallel applications.&amp;nbsp; Before joining Microsoft, I was a Research Staff Member at IBM Research where I worked on the Sony Playstation 3 (Cell Architecture), the design and performance evaluation of a petaflop high performance computing system (DARPA HPCS program), and power aware computing.&lt;/P&gt;
&lt;P&gt;I'm thrilled to be participating&amp;nbsp;in the upcoming PDC2008 conference later this month (&lt;A href="http://www.microsoftpdc.com/" mce_href="http://www.microsoftpdc.com/"&gt;http://www.microsoftpdc.com&lt;/A&gt;), where I will be showing off some really cool technology.&amp;nbsp; My session is on Monday 10/27 at 1:45pm and is titled "Microsoft Visual Studio: Bringing out the Best in Multicore Systems".&amp;nbsp; Stay tuned here after the conference starts for an exchange about what's coming.&amp;nbsp;&amp;nbsp;In the meantime, you can see a preview of some of the cool tools that we'll demo at PDC in our article &lt;A class="" title="Improved Support For Parallelism In The Next Version Of Visual Studio" href="http://msdn.microsoft.com/en-us/magazine/cc817396.aspx" mce_href="http://msdn.microsoft.com/en-us/magazine/cc817396.aspx"&gt;"Improved Support For Parallelism In The Next Version Of Visual Studio"&lt;/A&gt; in the October issue of MSDN Magazine.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;I hope that this blog will be a good forum to get to know our customers, understand their needs, and hopefully help them be more successful.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9006611" width="1" height="1"&gt;</content><author><name>hshafi</name><uri>http://blogs.msdn.com/members/hshafi.aspx</uri></author></entry></feed>