<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Parallel Programming in Native Code</title><link>http://blogs.msdn.com/b/nativeconcurrency/</link><description>Parallel programming using C++ AMP, PPL and Agents libraries.</description><dc:language>en-US</dc:language><generator>Telligent Evolution Platform Developer Build (Build: 5.6.50428.7875)</generator><item><title>Image Processing toolkit for .NET framework using C++ AMP</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/05/21/image-processing-toolkit-for-net-framework-using-c-amp.aspx</link><pubDate>Tue, 21 May 2013 18:31:57 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10420377</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/05/21/image-processing-toolkit-for-net-framework-using-c-amp.aspx#comments</comments><description>&lt;p&gt;Manipulation of images is an excellent scenario for GPU computation. Apart from &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/01/aviary-s-photo-editor-sdk-for-windows-8-is-using-c-amp.aspx"&gt;Aviary&lt;/a&gt;, we were pleased to know that PrecisionImage.NET SDK also &lt;a href="http://coreoptical.wordpress.com/2013/03/07/gpu-compute-on-nvidias-gtx-titan/"&gt;uses C++ AMP&lt;/a&gt; for their GPU computation. Their blog post, &lt;a href="http://coreoptical.wordpress.com/2013/03/11/image-processing-with-nvidias-gtx-titan/"&gt;Image Processing with NVidia's GTX Titan&lt;/a&gt;, delineates the benefits of GPU computation by comparing the runtime performance of C++ AMP code on GTX Titan and GTX 460 against a six-core AMD Phenom II. CoreOptical graciously agreed to document their experience using C++ AMP and attached below is a blog post detailing their experiences (also cross posted in &lt;a href="http://blogs.msdn.com/b/vcblog/archive/2013/05/21/image-processing-with-c-amp-and-the-net-framework.aspx"&gt;VC++ team blog&lt;/a&gt;). 
&lt;/p&gt;&lt;p&gt;Image processing is a computational task that lends itself very well to GPU compute scenarios. In many cases the most commonly used algorithms are inherently massively parallel, with each pixel in the image being processed independently from the others. As a result, image processing toolkits have been early adopters of the new GPGPU programming model. Many of these mass-market toolkits, however, can be more accurately described as &lt;em&gt;image manipulation&lt;/em&gt; packages that offer "image-in, image-out" capabilities; in other words, for each operation there is an input image and a resulting output (manipulated) image. In contrast, an &lt;em&gt;image processing&lt;/em&gt; workflow differs from this in that the goal is usually the portrayal or extraction of analytical information which is determined after some multi-step processing workflow. These workflows are commonly employed in scientific and technical industries like medical imaging.
&lt;/p&gt;&lt;p&gt;For the last two years, Core Optical Inc. has been building an image processing toolkit for the .NET framework called&lt;span style="color:#424242; font-family:Segoe UI; font-size:9pt"&gt;
			&lt;a href="http://www.coreoptical.com/"&gt;PrecisionImage.NET&lt;/a&gt;&lt;/span&gt;. Internally it centers around two separate execution branches, one targeting multicore CPUs and the other targeting GPU execution. While the CPU branch is a fully-managed CLS-compliant implementation leaning heavily on the .NET framework's excellent built-in thread pool, the GPU branch is implemented using Microsoft's brand new C++ AMP compiler.
&lt;/p&gt;&lt;p&gt;We had two requirements when choosing the GPGPU tool we would use for that branch of our toolkit. First, the generated code needed to be vendor agnostic so that a decision to use our SDK wouldn't overly restrict our customer's choice concerning graphics hardware. The current minimum platform for C++ AMP is DirectX 11, a version that will soon be ubiquitous among modern GPUs from Intel, Nvidia and AMD. Secondly, since we focus on the Microsoft developer stack we needed something that would play nicely with the .NET framework. Obviously C++ AMP is the best bet in this regard since it's produced by Microsoft.
&lt;/p&gt;&lt;p&gt;For a v1 product we've found C++ AMP to be both solid and easy to program to. Although Microsoft doesn't produce an official managed wrapper, accessing AMP in .NET was a straight forward matter of P/Invoking from our existing C# code base. To keep the surface area between the two to a minimum, we stuck with our managed code for the CPU fallback and condensed the various operations of the SDK into hundreds of compact AMP kernels compiled in combinations of single/double precision and 32/64-bit implementations. In almost all cases we found the simpler untiled model readily met our speedup goals. When this wasn't the case, we were able to produce a tiled version that met our performance objectives with minimum drama.
&lt;/p&gt;&lt;p&gt;To demonstrate the performance of the GPU branch we decided to&lt;span style="color:#424242; font-family:Segoe UI; font-size:9pt"&gt;
			&lt;a href="http://coreoptical.wordpress.com/"&gt;compare the speed&lt;/a&gt;
		&lt;/span&gt;of two operations running on a 6-core CPU (multithreaded managed code) versus the C++ AMP version running on two different GPUs from Nvidia. The first operation was chosen as an ideal case for GPU implementation and consisted of a 2D convolution using a large kernel implemented using AMP's simple untiled model. The second was chosen as for its unsuitability to GPU processing and was implemented using the tiled model. Even when including the overhead of marshalling arguments from managed to native code, and memory copying to/from the GPU, we saw huge gains (60x) in the first test case. Perhaps more surprising were the gains achieved in the second, less suitable, test case – up to 7x – an indication of the quality of the AMP compiler. Based on our experience to date, if you are a developer considering using AMP from a managed code base we can recommend it without reservation.
&lt;/p&gt;&lt;p&gt;Currently, one aspect of C++ AMP imposes a performance limitation (acknowledged by Microsoft) for our particular use cases: redundant memory copying between CPU and GPU. This is partly imposed by hardware and partly by software. Since our SDK is designed to allow the easy assembly of processing chains, the overhead of these redundant memory copies can add up quickly. Microsoft has stated that this behavior needs improvement, and all our AMP kernels are using the &lt;em&gt;array_view&lt;/em&gt; abstraction to take advantage of the improvement when it arrives. This will be a welcome enhancement to the AMP implementation, especially given AMD's recent announcement of their hUMA architecture initiative. With both the hardware and software pieces falling into place soon, we should see a whole new generation of image processing software with unprecedented power and flexibility.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10420377" width="1" height="1"&gt;</description></item><item><title>Addendum to CMA’s Case Study</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/05/09/addendum-to-cma-s-case-study.aspx</link><pubDate>Thu, 09 May 2013 19:41:18 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10417425</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/05/09/addendum-to-cma-s-case-study.aspx#comments</comments><description>&lt;p&gt;We are delighted to announce that CMA wrote an addendum to C++ AMP case study, detailing their problem and how they used C++ AMP. For context, we do recommend reading the C++ AMP case study, &lt;a href="http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001354" title="Financial Data Leader Shortens Time-to-Market, Increases Speed with Right Tools"&gt;Financial Data Leader Shortens Time-to-Market, Increases Speed with Right Tools&lt;/a&gt;. This goes without saying, we are very thankful to Moody Hadi and CMA for providing this addendum (which is quoted verbatim below)
&lt;/p&gt;&lt;p&gt;"&lt;strong&gt;&lt;em&gt;Business Case:    
&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;CMA needs to build multiple interest rate curves per currency that have relationships amongst them within the same currency. In the cases, when market inputs are volatile due to illiquidity in a section of the curve or due to genuine volatility, the analysts have to intervene in order to validate that the moves are genuine  and if not rely on other sources in order to accurately derive that segment of the curve. This process is highly iterative where the analyst needs to try different market inputs and regenerate the curves and then check the impact of the changes they made on the entire set of curves. A change in a market input can have a ripple effect on other curves in the same set. The time window for the analysts to perform this process is short (~20minutes) daily and they have to validate multiple asset classes and currencies as well. Thus, the user experience for every analyst needs to be kept to as close to instantaneous as possible.
&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Technical Solution:
&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;em&gt;Algorithm:
&lt;/em&gt;&lt;/p&gt;&lt;p&gt;The set of relationships that exist amongst each curve set within the same currency translate to a number of constraints that ensure those relationships exist after deriving the curves from market inputs. In addition, each individual curve has to satisfy a number of constraints to ensure smoothness, some degree of locality and monotonicity. Finally, all the generated curves need to been consistent with all the market inputs used to generate them. The algorithm needs to generate the curves that satisfy these constraints.
&lt;/p&gt;&lt;p&gt;A multidimensional gradient based root searching algorithm is used that has to solve for approximately 20 degrees of freedom and satisfy all the constraints to reach a solution. The algorithm itself is iterative and so is dependent on the input instrument selection. We used C++ AMP to remove the bottlenecks in cases when the instruments require more processing time, the gains we get there are compounded as each iteration requires much less time to converge.
&lt;/p&gt;&lt;p&gt;&lt;em&gt;Hardware:
&lt;/em&gt;&lt;/p&gt;&lt;p&gt;Intel Core i7-860
&lt;/p&gt;&lt;p&gt;16GB RAM
&lt;/p&gt;&lt;p&gt;Nvidia GTX-570
&lt;/p&gt;&lt;p&gt;Windows 8 Professional 64bit
&lt;/p&gt;&lt;p&gt;
 &lt;/p&gt;&lt;p&gt;The machine is used to process multiple currencies and asset classes as well. As such, we are trying to use the same hardware as efficiently as possible. During the analyst validation period there is a high degree of concurrency of requests coming into the machine, so if a certain set of curves have expensive instruments they can delay the processing of other requests.
&lt;/p&gt;&lt;p&gt;
 &lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Improvements:
&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;We were able to reduce the processing time for curve groups from ~20 to 30 seconds (single threaded) to ~10 to 15 seconds (CPU multi threaded) to less than 5 seconds (GPU multi threaded)
&lt;/p&gt;&lt;p&gt;An additional benefit is that with the use of GPU we are able to achieve a good degree of load balancing during critical time periods and not experience delay due to high processing requests whether they are due to incongruent market inputs or due to expensive instruments. Requests that can be processed quickly on the CPU are sent there and expensive requests can be simultaneously sent to the GPU to process thereby giving all users a fast and consistent experience.
&lt;/p&gt;&lt;p&gt;The C++ AMP API was easy to use and allows us to quickly deploy and address bottlenecks without having to resort to more complicated APIs or perform major re-writes to the existing code base. The use of lambdas and STL like library made the entire coding process much easier and thus we were able to deploy our solution quickly, with a good degree of re-usability and without needing proprietary GPU API specific knowledge."
&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Moody Hadi &lt;/li&gt;&lt;/ul&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10417425" width="1" height="1"&gt;</description></item><item><title>C++ AMP used for GPU Benchmarking</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/23/c-amp-used-for-gpu-benchmarking.aspx</link><pubDate>Tue, 23 Apr 2013 20:43:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10413470</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/23/c-amp-used-for-gpu-benchmarking.aspx#comments</comments><description>&lt;p&gt;Continuing with our series on how C++ AMP is being used, we are happy to highlight how C++ AMP is used for GPU Benchmarking. SystemCompute is an in-house benchmark developed by Dr. Ian Cutress of &lt;a href="http://www.anandtech.com/"&gt;AnandTech&lt;/a&gt;. Once we came to know that the GPU version of this benchmark was written in C++ AMP, we reached out to Dr. Cutress for a blog post. Please find below the guest blog post detailing his experience.
&lt;/p&gt;&lt;p&gt;"My Experience with C++ AMP
&lt;/p&gt;&lt;p&gt;My first foray as a non-computer scientist but computer enthusiast into multi-threaded programming led to OpenMP.  I was always fascinated by all the different BOINC projects, and how they were able to use the resources around them to brute force compute everything around them – the embarrassingly parallel tasks afforded such methods and were welcomed with open arms.  During my undergraduate I donated a lot of my CPU PC time to helping these projects, then eventually GPU time.  By the time I started my Computational Chemistry PhD, I was neither a competent programmer nor an optimized one; my masters thesis was computational yet written in Visual Basic .NET and thus amazingly slow.  However moving to C/C++ and looking at OpenMP gave the research group a good deal of speedup, especially when all around me people were only using one core of a quad core machine (albeit four simulations running at once).  It was at this point I casually remarked how interesting it would be to program GPUs.  The situation took a lucky turn as one of my friends studying Computational Finance on Facebook linked me up to his supervisor, who was starting the CUDA course at Oxford.  After a crash course lasting a week with no IDE and Linux (I was totally out of my depth), I set forth with CUDA knowledge and pains.  What followed was the three years of my PhD using CUDA &lt;a href="http://compton.chem.ox.ac.uk/index.php?title=papers&amp;amp;search=initials&amp;amp;stxt=IJC"&gt;to publish several papers&lt;/a&gt; and get to grips with what programming on a parallel device really entails – the intricacies of making sure kernels are register light and rearranging operations or formula to use the least amount of cycles and get the results quicker.
&lt;/p&gt;&lt;p&gt;Since leaving academia, I am now a hardware reviewer (which is a nice logical progression from chemistry… not &lt;span style="font-family:Wingdings"&gt;J&lt;/span&gt;) for &lt;a href="http://www.anandtech.com/tag/mb"&gt;AnandTech.com&lt;/a&gt;, a well known hardware technology review website.  As the Senior Motherboard Editor, I have the chance to dice and tumble with a lot of hardware, especially on the CPU and motherboard side. This means not only the consumer level kit, but also some workstation builds and dual processor setups.  While nothing to do with GPU programming, I enjoyed reading about CUDA developments, and what OpenCL was bringing to the table.  Reading about them put me squarely in the higher-level CUDA camp – OpenCL looked very daunting and somewhat confusing without an instructor telling you what to do!
&lt;/p&gt;&lt;p&gt;During all this time I am an avid competitive overclocker – the dark art of making computer hardware run faster for small amounts of time to prove to the world you can make hardware run faster than anyone else.  I enjoyed some minor success, becoming UK number one in the enthusiast league at HWBot.org for a couple of years, before moving on to the UK Team.  The way to determine whether you are better than someone else (in terms of speed and efficiency) is to run one of the benchmarks supported by the league, such as SuperPi or wPrime for CPUs or 3DMark for GPUs.  The system is always looking for better benchmarks, and the lack of a compute benchmark for GPUs always irked me.  So as part of my reviewing, and also the overclocking scene, I set out to write one.
&lt;/p&gt;&lt;p&gt;I cannot remember explicitly how I came to [hear] about C++ AMP.  In order to write a benchmark for the overclockers it had to be GPU vendor agnostic, so CUDA was out of the picture, and as I mentioned before, OpenCL looked confusingly daunting (even with a step-by-step manual in front of me).  Though I did come across C++ AMP, and the Native Concurrency blog, and started reading about it.  I very quickly picked up the &lt;a href="http://www.gregcons.com/cppamp/"&gt;C++ AMP book by Gregory and Miller&lt;/a&gt; and started to work through it along with the code examples posted on the Native Concurrency blog.
&lt;/p&gt;&lt;p&gt;My first thought was that C++ AMP was &lt;em&gt;incredibly simple&lt;/em&gt;.  I mean amazingly so.  For simple kernels there was no need to worry about allocating memory, although I was initially confused with array and array_view given that I had not performed any matrix mathematics in the past.  I had no prior knowledge of lambdas (in actual fact I doubt I could describe one to you now), but using a combination of monkey see and monkey do, I was able to port a good deal of the basic structure of my PhD research code into C++ AMP.  All in all I probably saved 75%+ lines of code as well.  The only thing that bugs me is making all my code work for multi-GPU setups, as it only works for single GPUs right now.
&lt;/p&gt;&lt;p&gt;My final result was twofold – a benchmark for reviews (GPU version featured in &lt;a href="http://anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3"&gt;Ryan Smith's GTX Titan review&lt;/a&gt;, CPU version featured in my &lt;a href="http://www.anandtech.com/show/6533/gigabyte-ga7pesh1-review-a-dual-processor-motherboard-through-a-scientists-eyes"&gt;Dual Sandy Bridge-E&lt;/a&gt; and &lt;a href="http://anandtech.com/show/6808/westmereep-to-sandy-bridgeep-the-scientist-potential-upgrade"&gt;Upgrading from Westmere&lt;/a&gt; reviews) and a potential benchmark for overclocking.  The GPU version is purely C++ AMP, and the CPU version takes more bites from OpenMP due to efficiency.
&lt;/p&gt;&lt;p&gt;If I was learning how to program on GPUs today, from scratch, I would advise newcomers to look at C++ AMP first.  It seems quicker to get off the ground than CUDA, and not too overly confusing if you are familiar with C++ and a Windows environment.  CUDA does have its advantages, being the more optimized, being the older language and the clout of NVIDIA behind it, but C++ AMP offers that unrestrictive element of AMD GPUs and widens the hardware base of whatever software you are creating.
&lt;/p&gt;&lt;p&gt;The thread for the software can be found over at &lt;a href="http://www.overclock.net/t/1329409/"&gt;overclock.net&lt;/a&gt;. Basic code snippets can also be found in my &lt;a href="http://www.anandtech.com/show/6533/gigabyte-ga7pesh1-review-a-dual-processor-motherboard-through-a-scientists-eyes"&gt;Dual Sandy Bridge-E&lt;/a&gt; review."
&lt;/p&gt;&lt;p&gt;  --Dr Ian Cutress.
&lt;/p&gt;&lt;p&gt;Please do note that we are working with Dr. Cutress to get his software made available to the community. 
&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10413470" width="1" height="1"&gt;</description></item><item><title>Verity Software’s integrating C++ AMP into their GEMStone software</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/11/verity-software-s-integrating-c-amp-into-their-gemstone-software.aspx</link><pubDate>Thu, 11 Apr 2013 19:02:42 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10410401</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/11/verity-software-s-integrating-c-amp-into-their-gemstone-software.aspx#comments</comments><description>&lt;p&gt;&lt;a href="http://www.vsh.com/"&gt;Verity software&lt;/a&gt; is a software company providing flow cytometry analysis software and we are pleased to learn that they are currently integrating C++ AMP into their high end application called &lt;a href="http://www.vsh.com/products/GemStone/index.asp"&gt;GemStone&lt;/a&gt;. You can learn more about how GemStone works by watching &lt;a href="http://www.youtube.com/watch?v=vqvqmbnSbpo"&gt;this youtube video&lt;/a&gt;, where a GemStone model (used to analyze B-Cell maturation in bone marrow samples acquired with flow cytometry) is reviewed.
&lt;/p&gt;&lt;p&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/5811.041113_5F00_1902_5F00_VeritySoftw1.jpg" alt=""/&gt;
	&lt;/p&gt;&lt;p&gt;Since GemStone models are computational intensive, verity software wants to leverage C++ AMP to enhance the performance of reviewing the models. Up till now, the hardest part they have encountered while integrating C++ AMP, was the need to rework some of the algorithms so that they are naturally parallel. This is in line with our expectations that developers will have to rethink their sequential code to fit to C++ AMP (or any other parallel language technology) programming model.  Further many of the data structures in the application also needed change. As a part of &lt;a href="http://msdn.microsoft.com/en-us/library/dd492418.aspx"&gt;Parallel Patterns Library&lt;/a&gt;, we do offer &lt;a href="http://msdn.microsoft.com/en-us/library/dd504906.aspx"&gt;Parallel Containers and Objects&lt;/a&gt; to alleviate some of the needs for concurrent data structures. 
&lt;/p&gt;&lt;p&gt;Some of the comments specific to C++ AMP included, to quote:
&lt;/p&gt;&lt;p&gt;"….As far as the nuts and bolts of [C++] AMP is concerned, it seems very straightforward to me.  I really like the idea of exploiting the lambda expression concept to run code on a GPU and not having a lot of very messy GPU specific code to deal with.  Other than changes to our data structures, the other issue that is taking some time is writing nice looking code that runs on all the platforms that we support.  Since our Mac versions of the software will not work with the amp.h and associated library, it's a bit tricky to have easily maintainable code that works for all these platforms…. I also probably don't need to tell you that it is a pain to have to decide whether you are going to debug the GPU or CPU before launching the program".
&lt;/p&gt;&lt;p&gt;With intend of addressing cross platform support for non-Microsoft platform by our ecosystem partners, C++ AMP team did release the &lt;a href="http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf"&gt;open specification&lt;/a&gt;. The support of mixed mode debugging across CPU and GPU is something we would address in future releases. This is the third post on the series detailing C++ AMP customer experience.
&lt;/p&gt;&lt;p&gt;
 &lt;/p&gt;&lt;p&gt;
 &lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10410401" width="1" height="1"&gt;</description></item><item><title>Aviary’s Photo Editor SDK for Windows 8 is using C++ AMP</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/01/aviary-s-photo-editor-sdk-for-windows-8-is-using-c-amp.aspx</link><pubDate>Mon, 01 Apr 2013 18:12:13 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406750</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/04/01/aviary-s-photo-editor-sdk-for-windows-8-is-using-c-amp.aspx#comments</comments><description>&lt;p&gt;&lt;a href="http://www.aviary.com/"&gt;Aviary&lt;/a&gt; is a photo editing SDK Company that "provides developers with a robust, customizable photo editor that can be plugged into applications on iOS, Android, Windows Phone, HTML5 and Windows 8 in minutes". We are delighted that Aviary is using C++ AMP inside their Photo Editor SDK for Windows 8. Needless to say, we are excited that through this SDK, the benefits of C++ AMP will be available to millions of Windows 8 customers. 
&lt;/p&gt;&lt;p&gt;&lt;a href="http://blog.aviary.com/wp-content/uploads/introducing_w8_landing2.png"&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/6014.040113_5F00_1812_5F00_AviarysPhot1.png" alt="" border="0"/&gt;&lt;/a&gt;
	&lt;/p&gt;&lt;p&gt;Aviary's simple yet powerful photo editing capabilities is used by more than 35 million monthly Aviary users across 3500+ partners. Apart from releasing the SDK, they also released a Windows store app, &lt;a href="http://apps.microsoft.com/windows/en-US/app/photo-editor/cdd22d88-c0c4-4fff-a741-fe5ea3692b22"&gt;Aviary Photo Editor&lt;/a&gt; and half-dozen of their partners have released apps (&lt;a href="http://apps.microsoft.com/windows/en-us/app/rowi/ad0d96c9-d895-4be6-9967-26cad215d1ee" target="_blank"&gt;Rowi&lt;/a&gt;, &lt;a href="http://apps.microsoft.com/windows/en-us/app/memorylage/269b17da-9475-4339-9786-2131c9880d52" target="_blank"&gt;Memorylage&lt;/a&gt;, &lt;a href="http://apps.microsoft.com/windows/en-us/app/myframes/f17deb97-ddfd-4e98-bb15-de68206f2ffb" target="_blank"&gt;MyFrames&lt;/a&gt;, &lt;a href="http://apps.microsoft.com/windows/en-US/app/photo-annotater/be8cedf5-0e6f-4594-8df9-7d856d906d80" target="_blank"&gt;Photo Annotater&lt;/a&gt;, &lt;a href="http://apps.microsoft.com/windows/en-us/app/selektiv/75fc79b8-d528-44d5-b695-0cd90521967e" target="_blank"&gt;Selektiv&lt;/a&gt;, &lt;a href="http://volevi.com/" target="_blank"&gt;Volet by Volevi&lt;/a&gt;).
&lt;/p&gt;&lt;p&gt;The use of C++ AMP is mentioned in both Aviary and AMD's blog posts. To quote
&lt;/p&gt;&lt;p&gt;&lt;a href="http://blog.aviary.com/aviary-launches-windows-8-sdk-with-6-partners/"&gt;Aviary Blog&lt;/a&gt; mentions "….We've achieved significant performance improvements by implementing filters and effects through the new heterogeneous compute language C++ AMP. Computations are performed on the highly parallel graphics processing unit (GPU) cores inside the AMD APU instead of the more serial central processing unit (CPU) cores. Applying CAMP allows for processing of our complete range of filters and effects instantaneously – on average 16x faster than comparable processors, according to &lt;a href="http://blogs.amd.com/fusion/2013/03/18/new-aviary-sdk-and-windows-8-app-optimized-for-amd/"&gt;benchmark studies conducted by AMD&lt;/a&gt;"
&lt;/p&gt;&lt;p&gt;&lt;a href="http://blogs.amd.com/fusion/2013/03/18/new-aviary-sdk-and-windows-8-app-optimized-for-amd/"&gt;AMD's blog&lt;/a&gt; mentions "…Since the target was to design a Windows Store app, we used &lt;a href="http://msdn.microsoft.com/en-us/library/hh265137.aspx"&gt;C++ AMP&lt;/a&gt; to process the filters and effects on the graphics engine where it can be done much faster than processing on the traditional CPU cores. The end result is a super-fast experience that will delight consumers. "
&lt;/p&gt;&lt;p&gt;This is the second post in the series of posts detailing the use of C++ AMP by our customers. As always, we would love to hear how C++ AMP is being used by our customers.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406750" width="1" height="1"&gt;</description></item><item><title>Kinect Fusion Runs on C++ AMP</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/29/kinect-fusion-runs-on-c-amp.aspx</link><pubDate>Fri, 29 Mar 2013 18:51:32 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10406371</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/29/kinect-fusion-runs-on-c-amp.aspx#comments</comments><description>&lt;p&gt;We are happy to announce that the newest feature, &lt;a href="http://msdn.microsoft.com/en-us/library/dn188670.aspx"&gt;Kinect Fusion&lt;/a&gt; in the &lt;a href="http://www.microsoft.com/en-us/kinectforwindows/develop/developer-downloads.aspx"&gt;Kinect for Windows SDK 1.7&lt;/a&gt; runs on &lt;a href="http://msdn.microsoft.com/en-us/library/vstudio/hh265137.aspx"&gt;C++ AMP&lt;/a&gt;. &lt;a href="http://msdn.microsoft.com/en-us/library/dn188670.aspx"&gt;Kinect Fusion&lt;/a&gt; is a tool that uses Kinect for Windows sensor to create highly accurate 3-D renderings of people and objects in real time. 
&lt;/p&gt;&lt;p&gt;Using C++ AMP for GPU based reconstruction allows Kinect fusion to provide customers with real time and interactive frame rates on both NVidia (GeForce GTX560) and AMD (Radeon 6950) graphics cards. The below picture, obtained from &lt;a href="http://blogs.msdn.com/b/kinectforwindows/archive/2013/03/18/the-latest-kinect-for-windows-sdk-is-here.aspx"&gt;Kinect for Windows product blog&lt;/a&gt; the 3D rendering of a person created using Kinect Fusion.
&lt;/p&gt;&lt;p&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/4087.032913_5F00_1851_5F00_KinectFusio1.jpg" alt=""/&gt;
	&lt;/p&gt;&lt;p&gt;The &lt;a href="http://msdn.microsoft.com/en-us/library/dn188670.aspx"&gt;MSDN&lt;/a&gt; documentation on Kinect fusion shows how within a few seconds Kinect fusion is able to create a 3D reconstruction of a static scene.
&lt;/p&gt;&lt;p&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/4571.032913_5F00_1851_5F00_KinectFusio2.png" alt=""/&gt;
	&lt;/p&gt;&lt;p&gt;
 &lt;/p&gt;&lt;p&gt;In the coming weeks, we will posting a series of blog posts listing how some of our customers are using C++ AMP in their products. So stay tuned!&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10406371" width="1" height="1"&gt;</description></item><item><title>C++ AMP CPU fallback support now available on Windows 7</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/26/c-amp-cpu-fallback-support-now-available-in-windows-7.aspx</link><pubDate>Tue, 26 Mar 2013 19:40:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10405523</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/26/c-amp-cpu-fallback-support-now-available-in-windows-7.aspx#comments</comments><description>&lt;p&gt;We recently announced the availability of &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2013/01/25/c-amp-gpu-debugging-now-available-on-windows-7.aspx"&gt;C++ AMP GPU debugging&lt;/a&gt; on Windows 7 and Windows Server 2008 R2 platforms. Now that the RTM version of Windows 7 &lt;a href="http://www.microsoft.com/en-us/download/details.aspx?id=36805"&gt;platform update&lt;/a&gt; has been released, we are happy to announce the availability of one more feature: C++ AMP CPU fallback support on Windows 7. So what does it mean?&lt;/p&gt;
&lt;p&gt;Earlier, on a Windows 7 machine with no C++ AMP capable graphics card, if you try to enumerate over all the accelerators using accelerators::get_all() function you would have gotten the following output:&lt;/p&gt;
&lt;p&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/5353.032613_5F00_1940_5F00_CAMPCPUfal1.png" alt="" /&gt;&lt;/p&gt;
&lt;p&gt;As you are familiar, &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/10/cpu-accelerator-in-c-amp.aspx"&gt;CPU accelerator&lt;/a&gt; cannot be used for computation (not yet), while &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/11/direct3d-ref-accelerator-in-c-amp.aspx"&gt;Software adapter&lt;/a&gt; (or direct3d_ref) is a single threaded software emulator. While you can run C++ AMP code on the direct3d_ref accelerator, this accelerator corresponds to the DirectX reference rasterizer and is very slow. It is meant to be used for driver validation and for debugging when &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2013/01/25/c-amp-gpu-debugging-now-available-on-windows-7.aspx"&gt;GPU hardware debugging&lt;/a&gt; is not present. So for all practical purpose, there are no accelerators available in Windows 7 other than hardware GPUs. In fact if you run the &lt;a href="http://ampbook.codeplex.com/SourceControl/changeset/view/101831"&gt;sample code&lt;/a&gt; provided in the C++ AMP book by Kate Gregory and Ade Miller, you will get the message "No accelerators found that are compatible with C++ AMP".&lt;/p&gt;
&lt;p&gt;When you install the &lt;a href="http://www.microsoft.com/en-us/download/details.aspx?id=36805"&gt;platform update&lt;/a&gt; or &lt;a href="http://windows.microsoft.com/en-US/internet-explorer/downloads/ie-10/worldwide-languages"&gt;Internet Explorer 10&lt;/a&gt;, apart from &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2013/01/25/c-amp-gpu-debugging-now-available-on-windows-7.aspx"&gt;debugging support&lt;/a&gt;, you will get the CPU fallback support also. To install the platform update, select the appropriate file to download (based on your machine's architecture) and proceed through the series of steps to complete the installation. Restart the computer when prompted. Run the same code again and you will notice a difference&lt;/p&gt;
&lt;p&gt;&lt;img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/8715.032613_5F00_1940_5F00_CAMPCPUfal2.png" alt="" /&gt;&lt;/p&gt;
&lt;p&gt;Apart from ref, there is one more software Adapter present, direct3d\warp. You can learn more about WARP, which stands for Windows Advanced Rasterization Platform (WARP), on this &lt;a href="http://msdn.microsoft.com/en-us/library/gg615082(v=VS.85).aspx"&gt;MSDN page&lt;/a&gt;. With this update, WARP now support DirectX11 &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ff476331(v=vs.85).aspx"&gt;DirectCompute 5.0&lt;/a&gt;, thereby enabling support for running C++ AMP code. Please do note that WARP does not have double precision support. As always, any feedback or comments are welcome either below or at the &lt;a href="http://social.msdn.microsoft.com/Forums/en-US/parallelcppnative/threads"&gt;forum&lt;/a&gt;.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10405523" width="1" height="1"&gt;</description></item><item><title>Using C++ AMP in financial service industry – A Case Study </title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/12/using-c-amp-in-financial-service-industry-a-case-study.aspx</link><pubDate>Tue, 12 Mar 2013 17:13:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10401658</guid><dc:creator>Boby George (MSFT)</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/03/12/using-c-amp-in-financial-service-industry-a-case-study.aspx#comments</comments><description>&lt;p&gt;&amp;nbsp;We would like to give a shout out to a C++ AMP case study we released couple of months ago. The case study talks about how &lt;a title="CMA" href="http://www.cmavision.com/"&gt;CMA&lt;/a&gt;, a company that develops pricing solutions and financial data services, used C++ AMP to deliver a faster pricing mechanism without sacrificing accuracy. CMA has been an excellent example where C++ AMP was able to provide a significant advantage to our customers. To quote Moody Hadi, Research Director in CMA &amp;ldquo;By using C++ AMP, we can generate fast, accurate pricing with less strain on our resources, which is a key differentiator in our segment and helps us provide greater value to our clients&amp;rdquo;.&lt;/p&gt;
&lt;p&gt;Please read CMA&amp;rsquo;s fascinating story at Microsoft Case Studies Site. The case study is titled &lt;a title="Financial Data Leader Shortens Time-to-Market, Increases Speed with Right Tools" href="http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000001354"&gt;Financial Data Leader Shortens Time-to-Market, Increases Speed with Right Tools&lt;/a&gt;. We will continue to blog about how our customers are using C++ AMP. So stay tuned!&lt;/p&gt;
&lt;p&gt;Update: In response to comments, CMA folks have written an addendum to this case study, which is published in &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2013/05/09/addendum-to-cma-s-case-study.aspx"&gt;Addendum to CMA's case study&lt;/a&gt; blog post.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10401658" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/nativeconcurrency/archive/tags/case+study/">case study</category></item><item><title>Harmless DXGI warnings with C++ AMP</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/02/19/harmless-dxgi-warnings-with-c-amp.aspx</link><pubDate>Tue, 19 Feb 2013 15:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10313800</guid><dc:creator>Łukasz Mendakiewicz</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/02/19/harmless-dxgi-warnings-with-c-amp.aspx#comments</comments><description>&lt;p&gt;If you have tried to run C++ AMP program in Visual Studio 2012 on Windows 8 in debug configuration, you might have noticed that after a successful execution there are DXGI warnings reported in the output window. Long story short &amp;ndash; &lt;strong&gt;there is nothing to worry about; these warnings may be safely ignored&lt;/strong&gt;. If you are interested in the background, read on for more information.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/4213.DXGI_5F00_warnings_5F00_12F3F222.png"&gt;&lt;img style="display: inline; background-image: none;" title="DXGI_warnings" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/5706.DXGI_5F00_warnings_5F00_thumb_5F00_798BEEE7.png" alt="DXGI_warnings" width="644" height="369" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Fig. 1. Visual Studio 2012 with DXGI warnings visible in the output window.&lt;/em&gt;&lt;/p&gt;
&lt;h1&gt;DXGI&lt;/h1&gt;
&lt;p&gt;The &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/bb205075(v=vs.85).aspx"&gt;DXGI&lt;/a&gt; acronym stands for Microsoft DirectX Graphics Infrastructure. It is a low-level layer used by DirectX to communicate with the kernel-mode GPU driver. As you may know, C++ AMP implementation provided by Microsoft builds on top of the DirectX stack, thus by extension, on top of DXGI. What is crucial here, is that Windows 8 introduces a new version, &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/hh404490(v=vs.85).aspx"&gt;DXGI 1.2&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;DXGI warnings&lt;/h1&gt;
&lt;p&gt;When you are working on Windows 8 and running your program in in debug configuration even the simplest code using C++ AMP will result in DXGI warnings being reported in the Visual Studio output window. They will look similar to the following:&lt;/p&gt;
&lt;p&gt;&lt;span style="font-family: Consolas;"&gt;DXGI WARNING: Process is terminating. Using simple reporting. Please call ReportLiveObjects() at runtime for standard reporting. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Producer at 0x006C4820, Refcount: 3. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object at 0x006CBB18, Refcount: 3. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object at 0x006E9328, Refcount: 1. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object : 2 [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Process is terminating. Using simple reporting. Please call ReportLiveObjects() at runtime for standard reporting. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Producer at 0x006E9A28, Refcount: 2. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object at 0x006EBE50, Refcount: 3. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object : 1 [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Process is terminating. Using simple reporting. Please call ReportLiveObjects() at runtime for standard reporting. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Producer at 0x00716688, Refcount: 2. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object at 0x00718A00, Refcount: 3. [ STATE_CREATION WARNING #0: ] &lt;br /&gt;DXGI WARNING: Live Object : 1 [ STATE_CREATION WARNING #0: ]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;To explain the conditions, reporting &amp;ldquo;live objects&amp;rdquo; is a part of the debugging capabilities added to DXGI 1.2. You can register for receiving these messages yourself using &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/Hh780371(v=vs.85).aspx"&gt;&lt;em&gt;IDXGIInfoQueue&lt;/em&gt;&lt;/a&gt; interface or you can rely on them being printed to Visual Studio 2012 output window, which is automatically enabled when you are using DXGI (and by extension, C++ AMP) in the debug mode.&lt;/p&gt;
&lt;p&gt;The reported warnings are last chance messages printed during DXGI shutdown. They include all DXGI objects being alive (meaning not released) at that point of time.&lt;/p&gt;
&lt;p&gt;Does it mean that C++ AMP runtime is not freeing these objects? Yes. Is it harmful? No, since:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;There is a fixed number of such cached objects (consuming small amount of resources), and their count never goes up during the application lifetime.&lt;/li&gt;
&lt;li&gt;The resources associated with these objects are automatically reclaimed by the operating system at process exit.&lt;/li&gt;
&lt;/ol&gt;
&lt;h1&gt;Diving into a deeper explanation&lt;/h1&gt;
&lt;p&gt;To go a little deeper, the C++ AMP runtime caches some DXGI objects in its static structures to improve performance. For example this includes &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/bb174523(v=vs.85).aspx"&gt;&lt;em&gt;IDXGIAdapter&lt;/em&gt;&lt;/a&gt;&lt;em&gt;s&lt;/em&gt; underlying Direct3D &lt;a href="http://www.danielmoth.com/Blog/concurrencyaccelerator.aspx"&gt;&lt;em&gt;accelerator&lt;/em&gt;&lt;/a&gt; objects. At a higher level, both the C++ AMP runtime and DXGI are dynamic libraries, where the first is a dependency of any C++ AMP program and the latter is a dependency of the runtime DLL. So far so good, static dependencies of DLLs aren&amp;rsquo;t the problem, however the situation complicates a little bit when we add truly dynamically loaded (i.e. using &lt;a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx"&gt;&lt;em&gt;LoadLibrary&lt;/em&gt;&lt;/a&gt;) libraries to the equation. As it happens, the order of their unloading is not defined against other dynamic libraries. And actually they are quite common in C++ AMP programs, e.g. all driver DLLs are currently dynamically loaded by the DXGI. Going back to the C++ AMP runtime implementation, given that we cannot tell whether our DLL is being unloaded before or after the dynamically loaded driver DLL, we cannot safely release any object from it. But as I explained before, it&amp;rsquo;s not a problem, since the operating system is going to reclaim it anyway.&lt;/p&gt;
&lt;p&gt;One solution would be not to cache DXGI objects. That&amp;rsquo;s a big no-no for us because caching buys us performance, and we (and you) like performance. The other workaround would be to suppress these messages. Unfortunately there is no way to selectively do so, and we do not want to inhibit this mechanism altogether, as some users might be creating DXGI objects themselves for their own use and might benefit from learning about extraordinary leaks.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10313800" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/nativeconcurrency/archive/tags/C_2B002B00_+AMP/">C++ AMP</category></item><item><title>Remote GPU Debugging on Nvidia Hardware</title><link>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/02/06/remote-gpu-debugging-on-nvidia-hardware.aspx</link><pubDate>Wed, 06 Feb 2013 20:04:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:10391656</guid><dc:creator>Paul Maybee - MSFT</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/b/nativeconcurrency/archive/2013/02/06/remote-gpu-debugging-on-nvidia-hardware.aspx#comments</comments><description>&lt;p&gt;The GPU debugger in Visual Studio 2012 (VS2012) can be extended by hardware vendors to debug directly on GPU hardware rather than on a software emulator. Nvidia has built such an extension and has made it publicly available as part of Nsight 3.0 (&lt;strong&gt;&lt;em&gt;Nsight Visual Studio Edition 3.0 Release Candidate 1&lt;/em&gt;&lt;/strong&gt; is available &lt;a href="https://developer.nvidia.com/rdp/nsight-visual-studio-edition-early-access"&gt;here&lt;/a&gt;. It requires you to first register and login to the &lt;strong&gt;&lt;em&gt;Nsight Visual Studio Edition Registered Developer Program&lt;/em&gt;&lt;/strong&gt; &lt;a href="https://developer.nvidia.com/user/login"&gt;here&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Once you have downloaded the Nsight 3.0 installer, you can follow the instructions below to begin debugging using Nvidia&amp;rsquo;s GPUs. Debugging on Nvidia hardware requires two machines: one &lt;strong&gt;&lt;em&gt;local&lt;/em&gt;&lt;/strong&gt; machine to run Visual Studio and one &lt;strong&gt;&lt;em&gt;remote&lt;/em&gt;&lt;/strong&gt; machine (running Windows 7 or Windows 8) with an Nvidia (Fermi or Kepler) GPU to use as a debugging target.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;For debugging purposes, because of current limitations in the driver, the remote machine must be configured with only one GPU card enabled.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Local Installation&lt;/h2&gt;
&lt;p&gt;1. Run the Nsight 3.0 installer on a local machine with an existing VS2012 installation. Once you have accepted the license terms you will be presented with a feature installation summary that looks something like this.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/0435.clip_5F00_image001_5F00_7BA6CEA1.png"&gt;&lt;img style="background-image: none; float: none; padding-top: 0px; padding-left: 0px; margin-left: auto; display: block; padding-right: 0px; margin-right: auto; border: 0px;" title="clip_image001" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/5238.clip_5F00_image001_5F00_thumb_5F00_345178AF.png" alt="clip_image001" width="496" height="422" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;2. Select &amp;ldquo;Customize&amp;rdquo; to install the C++ AMP VS extension to VS2012. The custom setup page allows you to individually select those features you want to install. The item you need is &amp;ldquo;Nsight C++ AMP Debugger&amp;rdquo;. Select it, hit Next, and let setup continue.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/8321.clip_5F00_image002_5F00_3B048232.png"&gt;&lt;img style="background-image: none; float: none; padding-top: 0px; padding-left: 0px; margin-left: auto; display: block; padding-right: 0px; margin-right: auto; border: 0px;" title="clip_image002" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/7343.clip_5F00_image002_5F00_thumb_5F00_0FBFEB2B.png" alt="clip_image002" width="503" height="430" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Remote Installation&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;On the remote machine a similar installation process is required. Again you run the installer and select &amp;ldquo;Customize&amp;rdquo;. This time the item to install is &amp;ldquo;Nsight C++ AMP Target Support for MSVSMON&amp;rdquo;.&lt;/li&gt;
&lt;li&gt;Disable Tdr (Timeout Detection and Reset) on the remote machine:&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;Under key HKLM\System\CurrentControlSet\Control\GraphicsDrivers, add TdrLevel (REG_DWORD) and set its value to 0.&lt;/li&gt;
&lt;/ul&gt;
&lt;li&gt;Reboot the remote machine so that the new Tdr setting takes effect.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Debugging&lt;/h2&gt;
&lt;p&gt;Now that the Nvidia extension is installed you can start debugging an application. The general instructions for remote debugging are in the blog post &lt;a href="http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/19/remote-gpu-debugging-in-visual-studio-11.aspx"&gt;Remote GPU Debugging in Visual Studio 11&lt;/a&gt;. The following changes are required:&lt;/p&gt;
&lt;p&gt;In Step 1, you cannot use a Remote Desktop Connection to log into the remote machine, you must log into a display that is physically connected to the machine before you run msvsmon.exe.&lt;/p&gt;
&lt;p&gt;In Step 2, you will need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set &amp;ldquo;Debugging Accelerator Type&amp;rdquo; to &amp;ldquo;NVIDIA GPU&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now place your breakpoint, hit F5 and enjoy! Remember that while you are stopped at a breakpoint the GPU is really stopped, so the display will be frozen until you continue.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/6761.nvidiaCapture_5F00_1D923126.png"&gt;&lt;img style="background-image: none; float: none; padding-top: 0px; padding-left: 0px; margin-left: auto; display: block; padding-right: 0px; margin-right: auto; border: 0px;" title="nvidiaCapture" src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-04-99-metablogapi/3122.nvidiaCapture_5F00_thumb_5F00_7657E7F0.png" alt="nvidiaCapture" width="873" height="543" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=10391656" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/b/nativeconcurrency/archive/tags/C_2B002B00_+AMP/">C++ AMP</category></item></channel></rss>