Welcome to MSDN Blogs Sign in | Join | Help

News

  • These postings are provided "AS IS" with no warranties and confer no rights. All code and tools presented are done so under the Microsoft Public License.
Feedback requested: Enumerating Concurrent Collections

The June 2008 CTP of Parallel Extensions contained a first look at some of the work we're doing to augment the .NET Framework with a set of additional coordination data structures that aid in the development of highly concurrent applications.  This included two thread-safe collections, ConcurrentQueue<T> and ConcurrentStack<T>, and it's probably not giving away too much to say that we're working diligently on additional collections that aid in key scenarios for parallel programming.  For the most part, these collections are similar to the exsiting non-thread-safe counterparts in the .NET Framework, except with APIs designed with concurrency patterns in mind, and of course implementations that tolerate multithreaded access.  Some of these designs come naturally and, from an implementation standpoint, are easily enabled.  However, there are times when we have to make significant trade-offs in functionality or performance in order to achieve certain designs, and for those situations, it's very important that we have a good understanding of the key scenarios and use cases in which the collections will be used.

One important design point we keep in mind for all of the collections we're working on is how to deal with concurrent modifications while a collection is being enumerated.  With the standard collections in the .NET Framework today, modifications to a collection invalidate any outstanding enumerators.  For example, consider the following code:

var q = new Queue<int>(new[] { 1, 2, 3, 4, 5 });
foreach (var item in q)
{
    if (item <= 5) q.Enqueue(item * 6);
}

This will compile fine, but upon execution, the call to q.Enqueue inside of the foreach loop will throw an InvalidOperationException: "Collection was modified after the enumerator was instantiated."  In contrast, if you change the Queue<int> to a ConcurrentQueue<int>, e.g.:

var q = new ConcurrentQueue<int>(new[] { 1, 2, 3, 4, 5 });
foreach (var item in q)
{
    if (item <= 5) q.Enqueue(item * 6);
}

no exception will be thrown, and upon examination of the queue after the loop, you'll find it to contain 10 elements.  This is an explicit design decision we've made to break the mold from the existing collections, the idea behind such a change in direction being that there are key scenarios where one thread needs to be enumerating while other threads are reading and writing to the collection (of course, as shown in this example, it could be the same thread reading and writing as the one enumerating).

This does raise some questions, of course, especially around the guarantees that can be provided for such an enumeration.  If data is added to the list while another thread is enumerating, must that enumerating thread see the new data?  If data is removed while another thread is enumerating, must that enumerating thread not see the new data?  Is it acceptable for data to be observed twice in the same enumeration or not at all due to the organization of the collection being changed concurrently?  And so forth.  For some collections, like ConcurrentStack<T> and ConcurrentQueue<T>, we're able to provide moment-in-time snapshot semantics, but for other collection types, doing so isn't necessarily possible without incurring serious performance penalties.

Given this, we're very curious to hear your feedback on a few things:

  1. Given the standard array of collections that exist in .NET, if you had thread-safe versions of them, are there scenarios where you would benefit from being able to enumerate concurrently with other threads modifying the same collection?  We already know about some, but we'd love to hear about more.
  2. Assuming you do have scenarios for #1, what are the minimum guarantees you'd need about the data returned in the enumerator for it to be useful?  For example, you could say for a thread-safe dictionary that if there are no concurrent modifications (adds/updates/removals), you'll get back in the enumeration exactly what's in the dictionary, and if there are concurrent accesses, you'll never get back something that wasn't in the dictionary, you may not see concurrent adds or updates, and you may still see items that are concurrently removed.
  3. If we did support enumeration on collections that were thread-safe in the face of concurrent modifications, would you ever want the ability to revert to the "throw on modifications" behavior?
  4. Finally, what are the most important collections that you'd like to see thread-safe and scalable counterparts for?

Thanks in advance for any input you may have!

Posted: Tuesday, August 12, 2008 8:51 AM by toub

Comments

Peter Ritchie said:

1.  I think there's many scenarios where there is benefit to this.  Refreshing a UI based on a collection while a background thread is updating that collection is an example.

2. I think by default enumeration should see all updates.  If the code doesn't want to see these updates it can make a copy of the collection (that it will have to ensure is done thread-safely.).

3. It's important to recognize that someone writing code to use these concurrent collections is using them concurrently.  If they don't want the data to be modified externally they can introduce their own locking or work with a copy (see 2).  In which case I don't see a need to introduce traditional "throw on modification" behaviour to these new collections.

4. I'd have to noodle on it a bit; but off the top of my head Stack and Queue seem to be data structures that would be most used by concurrent algorithms (that concurrently use data structures).

Having just had a conversation about this with someone else, I'm a little leery of calling them "thread-safe" collections.  This gives a programmer a false sense of security that they can use these collections in any way they want with multiple threads without there being any unexpected consequences.  What about describing them as having operations that perform atomically?

# August 12, 2008 5:32 PM

Justin James said:

OK, my thoughts, numbered according to the questions asked:

1. Any kind of token passing system, like an in-code "Token Ring" can make use of this. A while back, some of the folks on Skyscrpr were talking about these kinds of architectures, and they seemed very beneficial in certain circumstances.

2. The only realistic answer is, "it depends". In some scenarios, code will require the enumerator to work upon a "snapshot" of the collection, otherwise chaos will ensue. In other cases, you will want an enumerator to get real-time updates. This should be a toggle or a switch somewhere or an overload or a parameter to the enumberation, or maybe have 2 versions of each collection, one for each behavior (like "ConcurrentListSyncronized<T>" and "ConcurrentList<T>"?).

3. By using the current object. :)

4. List<T>, Dictionary<T> - the only collections I've ever found a need for in typical business scenarios, but a queue and a stack should be included too.

By the way, I love that you guys are asking for feedback here, I think that you should do more of this!

Thanks!

J.Ja

# August 12, 2008 11:02 PM

Omer van Kloeten said:

1. Having a Hashtable implementation (i.e. the generic Dictionary) would help when creating an app-domain level cache. I've long since created those on my own. A great addition to that would be to be able to asynchronously dump that cache from memory to persistent storage at times for backup, which would require safe enumeration. (Actually thinking about it - will serialization be thread-safe as well? What will be the default behavior there?)

2. I want to be able to get the behavior I want. Assume we have an enumerable, thread-safe collection e. I'd like to be able to get the enumeration via e.GetEnumerator(ConcurrentBehavior.Latest), e.GetEnumerator(ConcurrentBehavior.Current), e.GetEnumerator(ConcurrentBehavior.None) or other types. This would mean that I control the behavior and that I can expect to get exactly what I want.

3. I don't see a reason to, but there's no reason not to add it - see the None behavior in #2.

4. List and Dictionary. In 99% of the cases, I use either in my code. (you've pretty much taken care of another 0.9% by including a stack and a queue already :)

Maybe there would be a way to 'wrap' non-thread-safe collections with a sort of AsParallel method to get an IEnumerable that supports this behavior and then we won't have to change existing code's collections to the new collections for backwards compatibility's sake.

# August 13, 2008 1:49 AM

int19h said:

1. Yes - when using an enumerable collection as a lazy data stream, with background thread filling it, and UI rendering it.

2. I would like to have a choice between what would be equivalent to "read uncommitted" and "snapshot" in SQL - so either I see all the changes as they happen, or I see a snapshot of what it was when I started enumerating it, and all changes done on my own thread since then, but not any changes from other threads.

3. Yes, "throw on modification" is often a good safeguard to catch mistakes.

4. List, LinkedList, Dictionary, HashSet.

# August 13, 2008 3:50 AM

David said:

Another use case you may want to keep in mind is the one that most frequently introduces a beginning programmer to the "collection has changed" exception in the first place, the Remove Where Doesn't Belong pattern, where you iterate through (usually a copy of) a collection and remove items that no longer belong to it.

# August 13, 2008 1:24 PM

Thomas Danecker said:

One scenario I ran into multiple times is (as mentioned already) an app-domain wide cache.

For this scenario it would be really nice to have a CAS-like lookup-and-add operation. Or more concret, a way to look up an entry in the (concurrent) dictionary, and (in the case no entry exists) insert a future into the dictionary (all as one atomic operation). Afterwards the future could get initialized (and waited for by other threads).

With this approach objects don't have to be created twice (or synchronices with other techniques) when two threads simultanously don't find an entry for the same key in the cache.

# August 13, 2008 3:19 PM

Rick Brewster said:

One idea I had, but that I never really had a need to experiment around with, was the idea of using continuations for lock/do-stuff/unlock semantics. Especially for asynchronous programming.

Traditionally, to do multiple operations atomically, you lock, then do the stuff, then unlock. This has all the bug hazards we understand now: performance, what if you don't unlock, what if you access another collection and have A-B versus B-A deadlocks, etc.

But what about using a continuation? For example:

class ConcurrentList

{

   ...

   void Lock(Action continuation);

   ...

}

Then you would do:

ConcurrentList myList = ...;

myList.Lock(() => { if (myList.Count == 3) { myList.Add(something); } else { myList.Remove(somethingElse); } });

The idea being that the 'Lock' method automatically surrounds your callback with a lock/unlock type of construct. This way you are also not limited to the atomic operations that the class currently exposes. In fact, maybe instead of Lock() taking an Action, it takes a Func<ConcurrentListData>, and the ConcurrentList class itself doesn't actually have any methods for get, set, enumerate. This way you can ONLY access the class within a locked scope.

Also, if you have a 'myList2' and want to nest these locking operations,

var myList2 = ...;

myList.Lock(() => myList2.Lock(() => { if (myList.Count + myList2.Count > 5) { ... } else { ... } }));

Since you own the code for Lock() you can have better cooperation towards detecting and mitigating deadlocks (perhaps). Anyway like I said, I haven't experimented much with this yet.

It is also easy enough to have a TryLock() method that returns true/false.

The second part of this idea was to have a QueueLock() method. This would call your delegate immediately if it could lock it now, or it would call it 'later' if the lock was already held (perhaps on an appropriate UI thread via ISyncInvoke or a WPF Dispatcher). This of course would have its own bag of problems, but I'd love to see it explored. A naive implementation could be implemented as an extension method using TryLock().

# August 13, 2008 4:42 PM

Jonathan Allen said:

First of all, I would like to say we really, really need a non-broken sorted list. The current one that throws an exception when you have more than one item with the same sort key is useless. The number of times I needed to sort on a unique key is zero.

1. Not really. As long as you have a "Get Snapshot" method, I'm good.

There is a basic assumption that if I'm enumerating a shared collection it isn't a one-time event. I am going to be re-enumerating that collection over and over again, usually on a time.

2. N/A

3. Sure, if you take a parameter on the enumerator:

GetEnumerator(EnumerationOptions)

EnumerationOptions

Snapshot = 0

//Throw on these conditions

InvalidateOnInsert = 1

InvalidateOnMove = 2

InvalidateOnDelete = 4

//Special handling of changes

ShowInsertedItems = 8

HideDeletedItems = 16

ShowMovedItemsInNewLocation = 32 //may cause the same item to be seen twice

But as I said before, I will be using "Snapshot" the vast majority of the time.

4. Dictionaries, Stacks, and Queues are the ones I use the most when doing parallel programming.

I would really like a PriorityQueue and UniqueQueue (ignores attempts to add the same item twice).

# August 14, 2008 12:46 AM

David Larson said:

I agree with Jonathan. I'm consistantly faced with lists with duplicate values that need to be sorted...and stay sorted.

Also, I'm constantly faced with the problem of enumerating the elements and modifying them at the same time.

The above situation creates a problem that .Net does not address...how do you work with sorted listed where the objects change in value? If you use a sorted list your left with removing the item and then adding it back in so that the sort order comes out right. This appears not a fast operation...certainly not as fast as writing your own.

Clearly what is needed is the CHOICE of what sort algorithm to us...as well as support of concurrency.

# August 14, 2008 1:37 AM

David Larson said:

Oh and by the way... except for the concurrancy, the PriorityQueue and sorted list that allows duplicates are available in PowerCollections on the CodePlex site. It would be REALLY nice if MS would add them to .Net.

# August 14, 2008 1:40 AM

Ram Hegde said:

First of all, accept my deep appreciation for the wonderful work your team is doing. TPL is really looks great and I see lot of use in my domain.

Only, thing I am missing in TPL is enough practical code examples in MSDN other than reading your blogs.

I am thoroughly enjoying reengineering my enterprise application code using TPL.  I am still running into many issues but the performance gain I am getting makes the work worth carrying.  I am still a novice in this regard. What I am doing is traversing a tree of tasks in parallel and each branch task giving back more subtree tasks to be carried out level by level. Each task so generated is added to a global task array and which will be postprocessed before taken for processing during next iteration.

Now, I  have couple of requests:

1. I would like to assign tasks to number of threads/processors. But then how to direct the assignment to particular thread or thread which is free. We don't know which threads are finished and available for assignments. In old times, we could use Threadpool. In TPL any code implementation sample would be great.

I see lot of threads ( 20+) shown for my application in the TaskManager->Resource Monitor even if I had expliclty created just two threads.

2. In your example above, when I am using ConcurrentQueue,

foreach (var item in q)

{

   if (item <= 5) q.Enqueue(item * 6);

}

can I  use Enqueue for parallel.For also like below?

Parallel.For (0,q.length, ....

{

   if (item <= 5) q.Enqueue(item * 6);

});

Thanks

# August 15, 2008 5:33 PM

Keith Hill said:

I like the enumerator options idea perhaps added with extension methods to existing collections (List and Dictionary).  I could see AsSnaphot() and AsMutable().  In the AsMutable() case, it would be nice to be able to query the enumerator if there have been adds/deletes within the loop.  In the AsSnaphost() case, there would be bonus points if it delayed any copying until there was actually a modification to the collection.

# August 18, 2008 1:33 AM

toub said:

Peter, Justin, Omer, in19h, David, Thomas, Rick, Jonathan, Ram, and Keith: thank you all for the terrific and well-thought out feedback!  This is exactly the type of feedback we arelooking for, and it will help us to prioritize moving forwards.  Awesome.

# August 23, 2008 4:31 PM

Frank Levine said:

4) A Concurrent Priority Queue.  The sorted list is a poor man's priority queue, but I would love to have a real solid implementation that allows for items of the same priority to live in the queue as well (or groups them into buckets).  It should be more like a .NET 3.5 collection, where you can specify the type of object and a selector function:

var myQ = new ConcurrentPriorityQueue<Widget>( (w) => w.Size );

the above code would create a priority queue that always pushed the smallest widgets to the front.

# August 30, 2008 8:08 AM

Thomas van der Ploeg said:

I have a hard time imagining a scenario where an iterator pattern as such would be truly valuable on a concurrently updated collection. The set of values acquired and processed during the iteration will at best be somewhat similar as the set of values ‘contained’ in the collection. A snapshot variety doesn’t help much.  Since the ‘contents’ of the collection will be blurry at the moment the snapshot is taken the snapshot itself will be blurry as well.

Using an iterator you would want to do one of two things. Either calculate a result or modify some state.  In the first case I would imagine some transformation methods taking a continuation like the ones defined in LINQ. A concurrent ‘Select’ method, for example, would transform the values of the source and its result would be a concurrently updated collection itself. It would effectively shield the developer of the nastiness of concurrency.

When updating state some form of transaction would be required (STM?). A context wherein value sources can be queried and state updated in a safe manner. Such a context could be a required parameter for a new ‘GetEnumerator’ method.

It could be an option to have collection change notifications sent to a client of a snapshot iterator. The code to handle these will be complex and I don’t think this will lead to any convenient form of programming. Possibly such a mechanism could be used to construct higher level functions like the coined concurrent ‘Select’ method.

I think iterators in a truly concurrent environment without some form of transaction mechanism may have some niche uses but would in fact be a bad idea. They inherently have a very poor compatibility with multithreading and are an open invitation to software disasters.

# September 5, 2008 6:06 PM

Softlion said:

Your problem can be compared to the problem resolved (or not) by database builders. In SQL Server this is called ISOLATION LEVEL.

SQL Select statements (for reading a collection) have hints (options), among them "nolock" (reader may see new entries inserted by another thread) and "readpast" (reader will never see new entries) which can fully work as parallelized.

So it may be a good idea to provide either one enumerator with an isolation level option, or at least two enumerators which behave differently: one for a nolock like reading, the other for a readpast like reading.

I hope I'm not off subject !

# September 23, 2008 3:59 PM

toub said:

Frank, thanks for the suggestion.  

Thomas, thanks for the feedback.  There are a multitude of cases where seeing the latest and greatest version of a collection isn't all that important, and for those cases we've now heard quite a bit of feedback that support for concurrent enumeration is crucial.  As Softlion points out, this is very similar in nature to isolation levels in databases, and they have been used successfully for a long time.  Still, I definitely understand your concerns, as such capabilities do make it possible for developers to more easily make mistakes, assuming that the data they're looking at is both a snapshot and consistent with the current state of the collection.

Softlion, thanks for the comparison.  You're not off subject at all.

# September 26, 2008 9:30 AM

BK said:

This (below link) article describes a Concurrent collection in C# - MultiMap (similar to Dictionary)

http://www.codeproject.com/KB/cs/MultiMap_P_2.aspx

# March 11, 2009 1:04 PM

Yurik said:

I second all the requests for the concurrent priority queue implementation. It is badly needed in the framework, and being concurrent is paramount. Thanks!

# March 30, 2009 10:56 AM

supercat said:

The ability to support a "delete what doesn't belong" pattern is very useful, and I see no reason an iEnumerable should not be allowed to support such a thing (even though some styles of collection might have trouble with implementing it, I see no reason for the contract to require an exception in cases where an enumeration would otherwise be able to handle things sensibly).

If one is implementing a new design, I would suggest a "set options" method, along with an "available options" property, which would specify the acceptable/available behaviors for an enumerator in case of insertion and deletion (the bitmap passed to the method would offer a choice of acceptable options.  If none of the acceptable behaviors were available, an exception would be thrown when attempting to set the option.  Even if there were no settable options, the method could do nothing if the calling program's acceptable choices would work, and throw an exception if they would not.

For most scenarios, I would suggest that the preferred handling would be to specify that (1) an object which is inserted during an enumeration may or may not appear in that enumeration, with or without a reset; (2) an object which is deleted during an enumeration may or may not appear even with a reset, but if the object has already been enumerated it will not reappear until a reset; (3) all objects which existed prior to the start of enumeration and which are not deleted will be enumerated once each; (4) if an ordering is specified for the enumerator, items will be output consistent with that order (e.g. if a items are to be enumerated in alphabetical order and "aardvark" is added after "elephant" has been enumerated, "aardvark" should not be included in that enumeration); (5) ordering of items following a reset is arbitrary unless specified otherwise.

If one is designing a new interface for an enumerable type, it may be helpful to add a 64-bit 'ChangeSeed' value which, if sampled before the start of an enumeration and after its completion, will indicate whether anything in the collection may have changed during the enumeration.  There are many times when throwing an exception because something changed would be very nasty behavior, but knowing that something changed would nonetheless be very useful.

# March 31, 2009 11:50 AM

Jerome Paradis said:

Here's a scenario I'm looking at presently:

An highly scalable distributed in-memory architecture, for millions of users. On a single server, you want all operations done in memory. Instead of using a disk based database with caching, the primary source of data is in memory. However, you want to synchronize changes to some data on disk (in a database) as live backup.

On solution I'm looking at (that I'll have to benchmark) is to use queue(s) to flag updated data objects. Then, background process(es) could navigate the queue to make changes to disk while emptying objects from the queue while going through the queue.

During the queue navigation, objects can be added to the queue from other in-memory updates.

In that scenario, it does not really matter if added objects to the queue are visible while navigating the queue. If we reach the end, we can check if new objects where added to the queue. If not, we can sleep a bit and go back to check the queue.

Of course, the example is simplified, but you get the idea. Such system would need to be benchmarked and probably need some multitheading balance to make sure the queues don't grow faster than they can eventually be emptied.

# April 11, 2009 7:12 PM
Leave a Comment

(required) 

(required) 

(optional)

(required) 

  
Enter Code Here: Required

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Page view tracker