This morning I was in an interesting email thread with some of my Microsoft colleagues about the use of Windows Workflow Foundation (WF) on a load balanced server farm. The question that started the whole thing was that a customer had asked their local Microsoft representative if there were any "issues" to think about when using WF on a server farm.
To be honest with you my first response was... why not use BizTalk for that? After all BizTalk has support for enterprise scale solutions with the kind of robust management and monitoring that would likely be required for this type of solution.
And then there is the way that the SqlPersistenceProvider works (yes you can always write your own persistence provider) but my first reaction was that for short and sweet simple things it would probably be ok but for long running stuff I wouldn't do it.
But then I went for a run and couldn't stop thinking about it... So I came back and wrote down some thoughts that you might find useful. Here they are.
After thinking a bit more deeply about this issue allow me to add some additional thoughts (this turned out to be much longer than I anticipated but hang in there and let me know if it makes sense).
Any multi-user application has concurrency issues right?
When you write your business logic you have to watch out for things like lost updates and concurrent writes, race conditions etc.
There are two basic philosophies for dealing with this. Optimistic and Pessimistic locking.
Optimistic assumes that contention over a shared resource is relatively rare and so developers assume (optimistically) that it will usually not happen. However they code defensively with things like row version identifiers to detect the rare cases when it does happen.
Pessimistic assumes that contention is likely to happen and locks a resource before attempting to work on it. Such locks typically have a lock expiration time span in the event that the locking process crashes or hangs. In the typical case the resource is locked, the work is done and the lock is released in relatively short order.
Let’s imagine a multi-step business process without using WF but instead a WCF service. Suppose you create a Service Contract like this one.
public interface IService1
Guid DoStep1(string initialData);
int DoStep2(Guid processId, string additionalData);
This contract represents a two-step business process that you want to complete. When you invoke step 1 you are creating a business process that has state that has to be stored and will need additional work to be completed at some later point. You return an instance identifier to the client as a way to connect the second step to the business process created by the first step.
Now to get back to the original question about load balancing
Does load balancing impact this concurrency problem?
Not really, load balancing across a server farm doesn’t make this problem any better or worse. Even with a single server, multiple users can send concurrent update requests which will cause a contention over a shared resource. If the shared resource is accessible by all servers in the farm you have exactly the same problem with one server or 100 servers.
Note: The way to know if load balancing will affect your problem is to consider the use of local resources on a server (memory, files etc.) and the relationship of local state (on the one server) to shared state (available to the farm).
What happens when you start the business process with DoStep1?
When a message to DoStep1 is received by the service endpoint, it creates a new business process and returns an identifier for that process.
Design Issue #1: How to detect and deal with the issue of duplicate business processes are being created by service consumers
There is a risk that a client might inadvertently send more than one DoStep1 message for the same process. A well designed service would detect that the process (with this data) has already been created and that the second attempt is a duplicate. Does load balancing a server farm make this problem any better or worse? No.
What happens when you complete the business process with DoStep2?
Now the client is ready to complete this multi-step process by sending the DoStep2 message.
You send a message to the load balancer which selects a destination node. At this point you have a concurrency issue. What if someone else is trying to update the same business process at the same time?
The Optimistic solution would say
· This is a relatively rare occurrence
· We can detect this by using some kind of a version identifier in the data and rejecting updates destined for a version that no longer exists because another user updated it.
The Pessimistic solution would say
· I’ll try to lock the business process instance prior to updating it. If another user has it locked I will return an error
· When I lock it I will set a lock expiration time in case I crash or hang during the update in that event no other node will be able to update this business process until the lock expires
In either case, most of the time the business process will complete successfully but occasionally you will have to deal with concurrency problems that lead to failures either by being unable to lock the business process (within the timeout) or being unable to update the process because the row version has changed.
Design Issue #2: How to detect and deal with failure to complete a business process due to concurrency issues
Does load balancing affect this issue? No it is the exact same problem regardless of the number of processing nodes.
How does this scenario change with Windows Workflow?
Imagine now our contract is implemented with a workflow that looks like the following
How does this change things?
1. Now there is a WorkflowServiceHost that will receive the message and try to load a business process prior to delivering the message to it
2. The persistence provider is making decisions about optimistic or pessimistic concurrency and lock management
The SqlPersistenceProvider by default uses pessimistic concurrency but it doesn’t have to. You can set the lock timeout to 0 and it will assume an optimistic concurrency mode.
When you do optimistic concurrency with your workflow, you take responsibility for managing concurrency issues. For this reason it makes sense that the persistence provider will default to managing it for you pessimistically.
Now let’s think through those design challenges and ask the question “How should we deal with these issues when using WF?”
The most likely place to deal with this is in your activities that are launched after receiving the message. Just as with your WCF service you will have to design a way in your code to detect that the business process has already been created. In the event that you do detect this you will want to log the event (probably) and terminate the workflow which will cause the persistence provider to remove it from the persistence store and return an error to the client perhaps along with a reference to the workflow instance ID that is managing that business process.
How does load balancing affect this issue?
It has little or no effect on it. Yes a workflow is created in memory on a particular server but as long as the persistence store is shared across the server farm there is no real affinity with that workflow instance. The same thing would occur with a WCF service that creates an instance of a service class on a per-call basis to process the request. As soon as the workflow instance completes all scheduled activities it will go idle and will be persisted to the shared persistence store. A request destined for that workflow could be addressed by any node in the server farm as long as they are able to load the instance based on the way that the persistence provider handles concurrent requests.
With pessimistic concurrency (the default behaviour of the SqlPersistenceProvider) the WorkflowServiceHost will call GetWorkflow which will invoke the persistence provider. When the provider detects that the instance is locked by another workflow instance (on any server in the farm) it will throw an
exception which will be returned to the client.
In the optimistic locking scenario your code will likely throw an exception when it detects that it cannot complete the operation because another user has updated the process before you did. In this case the most likely response is to abandon the request and move on.
What other options are there?
If you don’t want the client apps to have to deal with exceptions due to concurrency problems you are going to have to introduce a server side layer that can deal with them. This layer would likely accept requests from clients and immediately return some kind of identifier to them for future reference and queue the requests in a durable store of some kind (MSMQ or a database) for processing nodes to deal with. Now the server side components can implement policy on when to retry when to abandon and how to manage the results of failures when they occur.
When you think about it this is the sort of thing that BizTalk does.
So how does this affect a recommendation for using WF with long-running workflows on a server farm?
With either strategy optimistic or pessimistic locking you want to keep the work that you are doing short and sweet (and probably under the scope of a transaction). The workflow itself actually represents a series of steps that move the business process forward between persistence points.
In either case you have to deal with the (relatively rare) occurrence of concurrency problems.
Would using WF in a load balanced server farm of “Web Service Processing Nodes” make sense?
After thinking it through more I would have to say yes, it can make sense. Really you have the same problems to deal with whether or not you bring WF into the picture. You just need to understand how WF deals with them (via the persistence provider).
Whew... I hope that helps. Let me know if you think I was unclear or just plain wrong....
PingBack from http://www.basketballs-sports.info/basketball-chat/?p=1746
Great post. I'd like to add that just by considering Workflow means you can tackle some of those issues, i.e. a series of steps that may have required a long running transaction can be split into smaller, managed, workflow steps. Splitting long running tasks into these smaller tasks makes it easier for a load balanced farm to process the steps faster, or at least with less blocking of the underlying resources. Obviously it means you now have the worry of compensation...
one correction - the SqlPersistenceProvider only provides pessimistic locking if you use a constructor that specifies InstanceOwnershipDuration and this behavior is not the default with other constructors.
In my previous post I was thinking about persisted workflows in a server farm and thinking about how
This is just an observation, but there's an underlying assumption in your analysis. It's not necessarily a bad one though, since it's a fairly widespread one that not many people challenge. If your state is maintained by an RDBMS, it's reinforced even more.
That assumption is the prioritization of consistency. Amazon has some interesting stuff around CAP, and that's largely what I'm referring to.
In this case, you chose consistency, so you'll have to sacrifice partitionability or availability.
A more interesting scenario would be one where consistency is *eventual*. In your description above, consistency is checked on write (update, delete, etc). If you chose partitionability and availability with *eventual* consistency, you move the consistency checks to the reads.
This would mean, for example, that two inconsistent writes would only be detected on read. It sounded a little crazy the first time I heard about it, but it makes perfect sense. If you have two inconsistent writes on the shopping cart, for example, you can simple merge the two inconsitent versions into a new consistent version (manage the conflict). And, at least for shopping carts, that's a pretty simple process (take all items from both carts and add them to the new consistent version).
Using that new paradigm, you can change your quality attributes to achieve both partitionability and availability.
So, just to reiterate, your underlying assumption above is the prioritization of consistency in the CAP tradeoff.
Just an observation..