Many of our customers are asking us for guidance on what they can do to make use of multicore systems. This is a good time to be thinking about this problem since single thread performance will remain relatively flat for the forseeable future. As developers throw more features into applications, there's a potential for performance regressions. So, the question here is really about the process to follow in order to parallelize applications. Below, I will outline a process that I've been advocating internally in my parallel development class as a practical approach to the problem. I can provide more details upon request.
Step 1: Identifying Your Goals
Before we even start discussing a parallelization process, we need to be aware of a primary principle: parallelism is about performance, and the performance that you should care care about is a function of the specific application and usage scenarios. So, the first step in any performance improvement effort needs to include the following:
Step 2: Measuring Your Baseline Performance
Now that you've identified your scenarios, you have to evaluate your current state compared to your ultimate goals. The important steps here are:
Step 3: Traditional Tuning Comes First
If a scenario fails, you should attempt traditional performance tuning before considering parallelism. Why? Unnecessarily introducing parallelism can become a burden on your development and test resources. With parallelism, you start introducing nondeterminism, race conditions, synchronization overheads, etc. and these are issues that just increase development and testing overheads. So, it's best to avoid them.
The main culprits for single thread performance are I/O and cache behavior (memory latency really). This is a large topic that we can also get into, but here are some key points that you should keep in mind:
Step 4: Identify Opportunities for Parallelism
If you're here, then you've attempted step 3 for failing scenarios and you have reached the conclusion that you cannot meet your performance goals unless you parallelize. There are two classes of parallelism that you can exploit usually: hiding latency from your user (or overlapping work) or reducing latency of computation. Notice that the first is often related to I/O, but can also be used to hide CPU-bound work. The second is almost always a technique to speed-up CPU-bound work. The main forms of parallelism that you're likely to encounter fall under one or more of these categories:
Step 5: Decide on Phase(s) to Parallelize
The decision should be based on some estimates of potential improvement. You can optimistically assume linear speed up (= the number of cores working on a problem) for now. Estimate, using the performance instrumentation that you've been using so far, what fraction of the target scenario's performance will be improved, and plug-in this data into something like Amdahl's Law to get an estimate of potential speedup for the whole scenario. This is a critical step to avoid investing time on work that will not achieve your goals.
Step 6: Implement the Parallel Algorithm
This is a large topic, but I offer some guidance here:
Step 7: Tune and Iterate
Repeat the measurement step and don't forget about tuning the cache behavior of your parallel implementation. The tools that my team is shipping in Visual Studio 2010 will be very valuable for parallel tuning. You'll have to reduce synchronization overheads for example and identify reasons for blocking UI threads, contributing to UI hangs etc. and our tools will be a great help.
I think that I've included enough here to give you a feel for what the process might look like. There are many points in this post that can be significantly elaborated on, but I'll leave the choice to you folks. Let me know what your questions/concerns are and I'll try to answer them.
Take care.