Feedback requested: Enumerating Concurrent Collections

Feedback requested: Enumerating Concurrent Collections

  • Comments 24

The June 2008 CTP of Parallel Extensions contained a first look at some of the work we're doing to augment the .NET Framework with a set of additional coordination data structures that aid in the development of highly concurrent applications.  This included two thread-safe collections, ConcurrentQueue<T> and ConcurrentStack<T>, and it's probably not giving away too much to say that we're working diligently on additional collections that aid in key scenarios for parallel programming.  For the most part, these collections are similar to the exsiting non-thread-safe counterparts in the .NET Framework, except with APIs designed with concurrency patterns in mind, and of course implementations that tolerate multithreaded access.  Some of these designs come naturally and, from an implementation standpoint, are easily enabled.  However, there are times when we have to make significant trade-offs in functionality or performance in order to achieve certain designs, and for those situations, it's very important that we have a good understanding of the key scenarios and use cases in which the collections will be used.

One important design point we keep in mind for all of the collections we're working on is how to deal with concurrent modifications while a collection is being enumerated.  With the standard collections in the .NET Framework today, modifications to a collection invalidate any outstanding enumerators.  For example, consider the following code:

var q = new Queue<int>(new[] { 1, 2, 3, 4, 5 });
foreach (var item in q)
{
    if (item <= 5) q.Enqueue(item * 6);
}

This will compile fine, but upon execution, the call to q.Enqueue inside of the foreach loop will throw an InvalidOperationException: "Collection was modified after the enumerator was instantiated."  In contrast, if you change the Queue<int> to a ConcurrentQueue<int>, e.g.:

var q = new ConcurrentQueue<int>(new[] { 1, 2, 3, 4, 5 });
foreach (var item in q)
{
    if (item <= 5) q.Enqueue(item * 6);
}

no exception will be thrown, and upon examination of the queue after the loop, you'll find it to contain 10 elements.  This is an explicit design decision we've made to break the mold from the existing collections, the idea behind such a change in direction being that there are key scenarios where one thread needs to be enumerating while other threads are reading and writing to the collection (of course, as shown in this example, it could be the same thread reading and writing as the one enumerating).

This does raise some questions, of course, especially around the guarantees that can be provided for such an enumeration.  If data is added to the list while another thread is enumerating, must that enumerating thread see the new data?  If data is removed while another thread is enumerating, must that enumerating thread not see the new data?  Is it acceptable for data to be observed twice in the same enumeration or not at all due to the organization of the collection being changed concurrently?  And so forth.  For some collections, like ConcurrentStack<T> and ConcurrentQueue<T>, we're able to provide moment-in-time snapshot semantics, but for other collection types, doing so isn't necessarily possible without incurring serious performance penalties.

Given this, we're very curious to hear your feedback on a few things:

  1. Given the standard array of collections that exist in .NET, if you had thread-safe versions of them, are there scenarios where you would benefit from being able to enumerate concurrently with other threads modifying the same collection?  We already know about some, but we'd love to hear about more.
  2. Assuming you do have scenarios for #1, what are the minimum guarantees you'd need about the data returned in the enumerator for it to be useful?  For example, you could say for a thread-safe dictionary that if there are no concurrent modifications (adds/updates/removals), you'll get back in the enumeration exactly what's in the dictionary, and if there are concurrent accesses, you'll never get back something that wasn't in the dictionary, you may not see concurrent adds or updates, and you may still see items that are concurrently removed.
  3. If we did support enumeration on collections that were thread-safe in the face of concurrent modifications, would you ever want the ability to revert to the "throw on modifications" behavior?
  4. Finally, what are the most important collections that you'd like to see thread-safe and scalable counterparts for?

Thanks in advance for any input you may have!

Leave a Comment
  • Please add 2 and 6 and type the answer here:
  • Post
  • 1.  I think there's many scenarios where there is benefit to this.  Refreshing a UI based on a collection while a background thread is updating that collection is an example.

    2. I think by default enumeration should see all updates.  If the code doesn't want to see these updates it can make a copy of the collection (that it will have to ensure is done thread-safely.).

    3. It's important to recognize that someone writing code to use these concurrent collections is using them concurrently.  If they don't want the data to be modified externally they can introduce their own locking or work with a copy (see 2).  In which case I don't see a need to introduce traditional "throw on modification" behaviour to these new collections.

    4. I'd have to noodle on it a bit; but off the top of my head Stack and Queue seem to be data structures that would be most used by concurrent algorithms (that concurrently use data structures).

    Having just had a conversation about this with someone else, I'm a little leery of calling them "thread-safe" collections.  This gives a programmer a false sense of security that they can use these collections in any way they want with multiple threads without there being any unexpected consequences.  What about describing them as having operations that perform atomically?

  • OK, my thoughts, numbered according to the questions asked:

    1. Any kind of token passing system, like an in-code "Token Ring" can make use of this. A while back, some of the folks on Skyscrpr were talking about these kinds of architectures, and they seemed very beneficial in certain circumstances.

    2. The only realistic answer is, "it depends". In some scenarios, code will require the enumerator to work upon a "snapshot" of the collection, otherwise chaos will ensue. In other cases, you will want an enumerator to get real-time updates. This should be a toggle or a switch somewhere or an overload or a parameter to the enumberation, or maybe have 2 versions of each collection, one for each behavior (like "ConcurrentListSyncronized<T>" and "ConcurrentList<T>"?).

    3. By using the current object. :)

    4. List<T>, Dictionary<T> - the only collections I've ever found a need for in typical business scenarios, but a queue and a stack should be included too.

    By the way, I love that you guys are asking for feedback here, I think that you should do more of this!

    Thanks!

    J.Ja

  • 1. Having a Hashtable implementation (i.e. the generic Dictionary) would help when creating an app-domain level cache. I've long since created those on my own. A great addition to that would be to be able to asynchronously dump that cache from memory to persistent storage at times for backup, which would require safe enumeration. (Actually thinking about it - will serialization be thread-safe as well? What will be the default behavior there?)

    2. I want to be able to get the behavior I want. Assume we have an enumerable, thread-safe collection e. I'd like to be able to get the enumeration via e.GetEnumerator(ConcurrentBehavior.Latest), e.GetEnumerator(ConcurrentBehavior.Current), e.GetEnumerator(ConcurrentBehavior.None) or other types. This would mean that I control the behavior and that I can expect to get exactly what I want.

    3. I don't see a reason to, but there's no reason not to add it - see the None behavior in #2.

    4. List and Dictionary. In 99% of the cases, I use either in my code. (you've pretty much taken care of another 0.9% by including a stack and a queue already :)

    Maybe there would be a way to 'wrap' non-thread-safe collections with a sort of AsParallel method to get an IEnumerable that supports this behavior and then we won't have to change existing code's collections to the new collections for backwards compatibility's sake.

  • 1. Yes - when using an enumerable collection as a lazy data stream, with background thread filling it, and UI rendering it.

    2. I would like to have a choice between what would be equivalent to "read uncommitted" and "snapshot" in SQL - so either I see all the changes as they happen, or I see a snapshot of what it was when I started enumerating it, and all changes done on my own thread since then, but not any changes from other threads.

    3. Yes, "throw on modification" is often a good safeguard to catch mistakes.

    4. List, LinkedList, Dictionary, HashSet.

  • Another use case you may want to keep in mind is the one that most frequently introduces a beginning programmer to the "collection has changed" exception in the first place, the Remove Where Doesn't Belong pattern, where you iterate through (usually a copy of) a collection and remove items that no longer belong to it.

  • One scenario I ran into multiple times is (as mentioned already) an app-domain wide cache.

    For this scenario it would be really nice to have a CAS-like lookup-and-add operation. Or more concret, a way to look up an entry in the (concurrent) dictionary, and (in the case no entry exists) insert a future into the dictionary (all as one atomic operation). Afterwards the future could get initialized (and waited for by other threads).

    With this approach objects don't have to be created twice (or synchronices with other techniques) when two threads simultanously don't find an entry for the same key in the cache.

  • One idea I had, but that I never really had a need to experiment around with, was the idea of using continuations for lock/do-stuff/unlock semantics. Especially for asynchronous programming.

    Traditionally, to do multiple operations atomically, you lock, then do the stuff, then unlock. This has all the bug hazards we understand now: performance, what if you don't unlock, what if you access another collection and have A-B versus B-A deadlocks, etc.

    But what about using a continuation? For example:

    class ConcurrentList

    {

       ...

       void Lock(Action continuation);

       ...

    }

    Then you would do:

    ConcurrentList myList = ...;

    myList.Lock(() => { if (myList.Count == 3) { myList.Add(something); } else { myList.Remove(somethingElse); } });

    The idea being that the 'Lock' method automatically surrounds your callback with a lock/unlock type of construct. This way you are also not limited to the atomic operations that the class currently exposes. In fact, maybe instead of Lock() taking an Action, it takes a Func<ConcurrentListData>, and the ConcurrentList class itself doesn't actually have any methods for get, set, enumerate. This way you can ONLY access the class within a locked scope.

    Also, if you have a 'myList2' and want to nest these locking operations,

    var myList2 = ...;

    myList.Lock(() => myList2.Lock(() => { if (myList.Count + myList2.Count > 5) { ... } else { ... } }));

    Since you own the code for Lock() you can have better cooperation towards detecting and mitigating deadlocks (perhaps). Anyway like I said, I haven't experimented much with this yet.

    It is also easy enough to have a TryLock() method that returns true/false.

    The second part of this idea was to have a QueueLock() method. This would call your delegate immediately if it could lock it now, or it would call it 'later' if the lock was already held (perhaps on an appropriate UI thread via ISyncInvoke or a WPF Dispatcher). This of course would have its own bag of problems, but I'd love to see it explored. A naive implementation could be implemented as an extension method using TryLock().

  • First of all, I would like to say we really, really need a non-broken sorted list. The current one that throws an exception when you have more than one item with the same sort key is useless. The number of times I needed to sort on a unique key is zero.

    1. Not really. As long as you have a "Get Snapshot" method, I'm good.

    There is a basic assumption that if I'm enumerating a shared collection it isn't a one-time event. I am going to be re-enumerating that collection over and over again, usually on a time.

    2. N/A

    3. Sure, if you take a parameter on the enumerator:

    GetEnumerator(EnumerationOptions)

    EnumerationOptions

    Snapshot = 0

    //Throw on these conditions

    InvalidateOnInsert = 1

    InvalidateOnMove = 2

    InvalidateOnDelete = 4

    //Special handling of changes

    ShowInsertedItems = 8

    HideDeletedItems = 16

    ShowMovedItemsInNewLocation = 32 //may cause the same item to be seen twice

    But as I said before, I will be using "Snapshot" the vast majority of the time.

    4. Dictionaries, Stacks, and Queues are the ones I use the most when doing parallel programming.

    I would really like a PriorityQueue and UniqueQueue (ignores attempts to add the same item twice).

  • I agree with Jonathan. I'm consistantly faced with lists with duplicate values that need to be sorted...and stay sorted.

    Also, I'm constantly faced with the problem of enumerating the elements and modifying them at the same time.

    The above situation creates a problem that .Net does not address...how do you work with sorted listed where the objects change in value? If you use a sorted list your left with removing the item and then adding it back in so that the sort order comes out right. This appears not a fast operation...certainly not as fast as writing your own.

    Clearly what is needed is the CHOICE of what sort algorithm to us...as well as support of concurrency.

  • Oh and by the way... except for the concurrancy, the PriorityQueue and sorted list that allows duplicates are available in PowerCollections on the CodePlex site. It would be REALLY nice if MS would add them to .Net.

  • First of all, accept my deep appreciation for the wonderful work your team is doing. TPL is really looks great and I see lot of use in my domain.

    Only, thing I am missing in TPL is enough practical code examples in MSDN other than reading your blogs.

    I am thoroughly enjoying reengineering my enterprise application code using TPL.  I am still running into many issues but the performance gain I am getting makes the work worth carrying.  I am still a novice in this regard. What I am doing is traversing a tree of tasks in parallel and each branch task giving back more subtree tasks to be carried out level by level. Each task so generated is added to a global task array and which will be postprocessed before taken for processing during next iteration.

    Now, I  have couple of requests:

    1. I would like to assign tasks to number of threads/processors. But then how to direct the assignment to particular thread or thread which is free. We don't know which threads are finished and available for assignments. In old times, we could use Threadpool. In TPL any code implementation sample would be great.

    I see lot of threads ( 20+) shown for my application in the TaskManager->Resource Monitor even if I had expliclty created just two threads.

    2. In your example above, when I am using ConcurrentQueue,

    foreach (var item in q)

    {

       if (item <= 5) q.Enqueue(item * 6);

    }

    can I  use Enqueue for parallel.For also like below?

    Parallel.For (0,q.length, ....

    {

       if (item <= 5) q.Enqueue(item * 6);

    });

    Thanks

  • I like the enumerator options idea perhaps added with extension methods to existing collections (List and Dictionary).  I could see AsSnaphot() and AsMutable().  In the AsMutable() case, it would be nice to be able to query the enumerator if there have been adds/deletes within the loop.  In the AsSnaphost() case, there would be bonus points if it delayed any copying until there was actually a modification to the collection.

  • Peter, Justin, Omer, in19h, David, Thomas, Rick, Jonathan, Ram, and Keith: thank you all for the terrific and well-thought out feedback!  This is exactly the type of feedback we arelooking for, and it will help us to prioritize moving forwards.  Awesome.

  • 4) A Concurrent Priority Queue.  The sorted list is a poor man's priority queue, but I would love to have a real solid implementation that allows for items of the same priority to live in the queue as well (or groups them into buckets).  It should be more like a .NET 3.5 collection, where you can specify the type of object and a selector function:

    var myQ = new ConcurrentPriorityQueue<Widget>( (w) => w.Size );

    the above code would create a priority queue that always pushed the smallest widgets to the front.

  • I have a hard time imagining a scenario where an iterator pattern as such would be truly valuable on a concurrently updated collection. The set of values acquired and processed during the iteration will at best be somewhat similar as the set of values ‘contained’ in the collection. A snapshot variety doesn’t help much.  Since the ‘contents’ of the collection will be blurry at the moment the snapshot is taken the snapshot itself will be blurry as well.

    Using an iterator you would want to do one of two things. Either calculate a result or modify some state.  In the first case I would imagine some transformation methods taking a continuation like the ones defined in LINQ. A concurrent ‘Select’ method, for example, would transform the values of the source and its result would be a concurrently updated collection itself. It would effectively shield the developer of the nastiness of concurrency.

    When updating state some form of transaction would be required (STM?). A context wherein value sources can be queried and state updated in a safe manner. Such a context could be a required parameter for a new ‘GetEnumerator’ method.

    It could be an option to have collection change notifications sent to a client of a snapshot iterator. The code to handle these will be complex and I don’t think this will lead to any convenient form of programming. Possibly such a mechanism could be used to construct higher level functions like the coined concurrent ‘Select’ method.

    I think iterators in a truly concurrent environment without some form of transaction mechanism may have some niche uses but would in fact be a bad idea. They inherently have a very poor compatibility with multithreading and are an open invitation to software disasters.

Page 1 of 2 (24 items) 12