QUESTION: Does a ParallelActivity start one thread for each branch?  How many threads will it use for processing?

ANSWER: The short answers are "no" and "1".  If you've heard it before, then these answers make sense.  If not, then read on for an explanation of parallelism in WF.

To promote programming simplicity for both the custom activity and workflow writer, WF has chosen to make the guarantee that only one .NET thread will be executing any portion of a workflow at any given time.  This means that the handler for your CodeActivity, the Execute method for your custom activity, and the sequence in your EventHandlerActivity will never have to worry that some other part of the workflow is executing at the same time.  If you think about it a bit you will see that this greatly simplifies programming WF applications and is one of the few reasons why writing complex custom activities is a surprisingly simple task.

Isn't that bad?

The usual knee-jerk reaction is outcry about the loss of multi-threaded parallelism.  Isn't that a step backward?  The answer is no, this is not a step backward.  First, we are only stating that there is a single threaded nature within a single instance of a workflow.  This means that if you have two instances of the same workflow executing simultaneously then they will exhibit true multi-threaded parallelism. 

Second, let's remember the nature of the workflow beast.  Workflows are meant to coordinate tasks, both human and computer, in an event driven world over an unknown amount of time.  The workflow itself processes in short bursts with long periods of dormancy in between.  For example, a workflow might send an e-mail requesting that a task be performed and then persist to the database.  Only when the task is now complete will the workflow come back to life and process some more ... and this processing should be limited to deciding what action should be taken next and delegating that to some external source whether that be a human or a service added to the WorkflowRuntime.

This behavior of a single workflow instance means that true parallelism is wholy unnecessary.  If the workflow assigns 10 tasks in parallel then it is highly unlikely that there will be a processing bottleneck when collating the results which return scattered across time.  While I've got no hard evidence to back this up, it is my opinion that a single thread per instance actually improves the performance of WF as opposed to hindering it when considering the reduced complexity in activity execution code, the lack of neccessity for locking constructs, and the burst processing nature of workflow.

How it works

WF is a scheduled environment.  Abstractly, you can consider that every instance of a workflow has its own scheduler which is just a queue of delegates.  The scheduler simply loops through a 2 step sequence: dequeue the next delegate and call it.  If there are no more items on the queue then the workflow is idle.  If we consider that the scheduler just has one thread and invokes the delegates synchronously then we see where the single threaded guarantee comes from.

During the execution of a delegate (like Activity.Execute) there are several occurrences which can cause new items to be added to the scheduler queue.  ActivityExecutionContext.ExecuteActivity() can be used to schedule a child's execution, throwing an exception will cause the runtime to schedule the HandleFault method for the activity, calling Activity.Invoke<>() will cause the specified delegate to be scheduled, and returning a value of ActivityExecutionStatus.Closed will cause the runtime to schedule the OnClosed method.  These are just some of the triggers which cause new items to be added to the queue.

Extended ParallelActivity walkthrough

The ParallelActivity, when executed, will schedule the Activity.Execute method for each of its direct children and subscribe to the Closed event for each child.  The result is that the scheduler queue will look something like this (first item to be dequeued is on left):
{child1.Execute, child2.Execute, child3.Execute}

Calling child1.Execute might result in a DelayActivity's Execute to be added to the queue: {child2.Execute, child3.Execute, delay1.Execute}

Now consider if child2 contains a single CodeActivity and child3 is empty:
{delay1.Execute, code1.Execute, child3.OnClose}

Up to this point we have had purely interleaved execution.  Draw it out and remember that in normal execution an activity will have Execute schedule, then it will schedule any work it needs to do, then it will have OnClose scheduled, and then anyone listening to the Closed event will be scheduled.  Knowing this you can walk through almost any chain of activities.

Back to our queue, we will next see the delay disappear because it has added a timer to the workflow, we will see the code activity execute, and the third child will process its close:
{code1.OnClose, parallel.OnChildClosed(child3)}

Without drawing it out, let's say that the delay was a long enough one to let child2 close as well before the timer is fired.  The result will be that the ParallelActivity's Closed handler will determine that the parallel cannot yet close because it has an outstanding executing child and the scheduler will run out of items in the queue and mark the workflow as idle.  The next exciting thing to happen is the timer will schedule a callback for the delay:
{delay1.OnTimer}

Again, this will cause child1.OnClose to be scheduled which will cause parallel.OnChildClosed(child1) to be scheduled which, finally, will result in parallel.OnClose being scheduled.

The important thing to notice is that with non-blocking activities we get predictable interleaved execution.  But, as soon as we add a blocking activity, the delay in our case, we get execution that approaches real world workflow scenarios.  Imagine that each branch has an event on which it is waiting ... the execution is no longer just interleaved, but whichever branch's event fires first gets executed first.  Unless all of the branches receive their events simultaneously, we get parallel processing with a single thread of execution.