PLINQ and Int32.MaxValue

PLINQ and Int32.MaxValue

Rate This
  • Comments 8

In both .NET 4 and .NET 4.5, PLINQ supports enumerables with up to Int32.MaxValue elements.  Beyond that limit, PLINQ will throw an overflow exception.  LINQ to Objects itself has this limitation with certain query operators (such as the indexed Select operator which counts the elements processed), but PLINQ has it with more.

This limitation impacts so few scenarios that it’s relatively benign and I rarely hear about it.  That said, it does come up now and again, and I was in fact asked about it earlier this week.  So, to help in case anyone else runs into this…

There is a relatively straightforward workaround you can apply if you do run up against this limit: ensure the enumerable passed to PLINQ has no more than Int32.MaxValue.  That answer might sound flippant and impractical, but in reality it’s often easily accomplished by batching a longer enumerable into multiple shorter enumerables, where none of the shorter enumerables exceed the limit.

Consider a basic query like the following:

IEnumerable<Output> outputs =
    from input in inputs.AsParallel()
    where Filter(input)
    select Map(input);

If the inputs enumerable has more than Int32.MaxValue elements, this will result in an overflow.  To workaround this, imagine if we had a Batch extension method for IEnumerable<T> that would partition the IEnumerable<T> into multiple sequential IEnumerable<T> instances, each of which had no more than Int32.MaxValue elements.  With that, we could rewrite this as follows:

IEnumerable<Output> outputs = inputs.Batch(Int32.MaxValue).SelectMany(batch =>
    from input in batch.AsParallel()
    where Filter(input)
    select Map(input)
);

The original query remains intact, except that instead of operating on the original inputs enumerable, it’s operating on the new batch enumerable, which is a subset provided by the Batch method.  We’ll now be parallelizing our processing of each sequential batch, moving on to the next batch to be processed in parallel once we finish the previous one.

Such a Batch method doesn’t actually exist built-in to .NET 4.5, but writing one is relatively straightforward.  Here’s one implementation which you could customize as your needs demand:

static class MyLinqExtensions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(
        this IEnumerable<T> source, int batchSize)
    {
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return YieldBatchElements(enumerator, batchSize - 1);
    }

    private static IEnumerable<T> YieldBatchElements<T>(
        IEnumerator<T> source, int batchSize)
    {
        yield return source.Current;
        for (int i = 0; i < batchSize && source.MoveNext(); i++)
            yield return source.Current;
    }
}

Leave a Comment
  • Please add 3 and 2 and type the answer here:
  • Post
  • FWIW, Batch *does* exist in the _awesome_ MoreLinq project by Jon Skeet and Aziz Atif.

    code.google.com/.../Batch.cs

    And it looks like Aziz is working on updating the NuGet build for it as well!

    nuget.org/.../morelinq

    There's a ton of other useful things in that project - the *By methods (DistinctBy, MinBy, MaxBy, etc) are incredibly nice to have.  Pipe lets me write simple preprocess logic 'inline', etc.

    Know it, learn it, love it. :)

  • @James: While that's true and a good reference, folks should keep in mind that the implementations differ in a very important way: the MoreLinq solution buffers each batch into an array before yielding any elements. That has some advantages in some cases, but for this purpose I explicitly opted away from doing that, as otherwise here you'd be buffering Int32.MaxValue elements, and beyond the delays/pauses in processing that would cause, it would also be much more likely to hit out of memory conditions due to inability to find contiguous memory of 2GB or greater for the array allocations.

  • I think that is a valid approach. One needs to be careful not to introduce a sequential bottleneck doing this. Amdahl's Law applies.

  • You could also look at the recently open sourced IX Extensions -- Extensions to Enumerable from the RX team.

  • @Stephen: The reason Batch in MoreLINQ buffers is for correctness. The batch your implementation yields is not restart-able and has the side-effect of potentially producing wrong results or breaking an existing query downstream when it is batched. For example,  you would expect the following:

       var query =

           from b in Enumerable.Range(1, 20).Batch(3)

           select b.Count() + ": " + string.Join(", ", b);

       Array.ForEach(query.ToArray(), Console.WriteLine);

    to print:

       3: 1, 2, 3

       3: 4, 5, 6

       3: 7, 8, 9

       3: 10, 11, 12

       3: 13, 14, 15

       3: 16, 17, 18

       2: 19, 20

    Instead, it prints:

       3: 3, 4, 5

       3: 8, 9, 10

       3: 13, 14, 15

       3: 18, 19, 20

    To get it to work right, you'd have to buffer per batch like this:

       var query =

           from b in Enumerable.Range(1, 20).Batch(3)

           select b.ToArray() into b // buffer

           select b.Count() + ": " + string.Join(", ", b);

       Array.ForEach(query.ToArray(), Console.WriteLine);

    Then again, if you just ToArray on the batches, like this:

       var query =

           from b in Enumerable.Range(1, 20).Batch(3).ToArray()

           select b.ToArray() into b

           select b.Count() + ": " + string.Join(", ", b);

       Array.ForEach(query.ToArray(), Console.WriteLine);

    you get yet another (and more) bizarre output:

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

       1: 20

    With MoreLINQ's Batch, all of the above produce the same results and while buffering has its downsides at the scale at which you're discussing, correctness becomes increasingly important as your LINQ queries get large and you need to be able to reason about them as you compose and often without execution.

    I understand your motivation for your implementation but it needs a big warning/disclaimer as the user needs to be more cautious when combining it with different operators.

    I think if I had to work around the same 2GB limitation, I'd start by creating batches of batches with MoreLINQ's Batch. :) The cost is then diminished significantly to buffering of the inner-most batch (ideally just shy of avoiding the LOB heap) and that too during only one iteration.

  • @raboof.com: Thank you for your comments.  I'm well aware that there can be advantages to the implementations in MoreLINQ, IX, etc., advantages I actually already alluded to in my previous reply.  There are also disadvantages, and for this particular situation, I'm suggesting that for the kinds of queries folks have run into with this PLINQ limit, the buffering hasn't been necessary and the cost of the buffering has been prohibitive.  (I also don't understand your comment about creating batches of batches, as I don't see how that's significantly different from just using a smaller batch size to begin with.)  

    Just to put things in perspective, on my machine, the cost of the first MoveNext call on an enumerator from such a large Batch call takes upwards of 30 seconds, just to retrieve the first element, due to all of the elements needing to be enumerated in order to buffer them so that the first can be returned (and that's if I didn't OOM first from the large, contiguous buffer being allocated).  

    In any event, this comment stream has devolved into a discussion of a implementing a Batch extension method, which wasn't the point of this post; the point of the post was that there's a limitation which a user can work around, and I provided one possible wayto do so.  If folks would prefer to do so in another way, that's perfectly fine.

  • This might be a nitpick (or I might be overlooking something): Would it not be better to move the "batchSize - 1" down into the YieldBatchElements<T>() method?  That way, YBE could be used on its own with the correct number of elements yielded.

  • @Caleb Bell: You could if you wanted to, though you still probably wouldn't end up using YBE "on its own", given that it does "yield return source.Current" without having called MoveNext() directly, which means it's relying on the caller to have already done that.

Page 1 of 1 (8 items)