Computing a Cartesian Product with LINQ

Computing a Cartesian Product with LINQ

Rate This
  • Comments 18

And here we have yet another post inspired by a question on StackOverflow: how do you compute the Cartesian product of arbitrarily many sequences using LINQ? (*)

UPDATE: Ian Griffiths has an interesting series of articles that approaches this question in considerably more depth than I do; check it out!

First off, let's make sure that we know what we're talking about. I'll notate sequences as ordered sets {a, b, c, d,...}. The Cartesian product of two sequences S1 and S2 is the sequence of all possible two-element sequences where the first element is from S1 and the second element is from S2. So for example, if you have the sequences {a, b} and {x, y, z} then their Cartesian product is the sequence of two-element sequences {{a, x}, {a, y}, {a, z}, {b, x}, {b, y}, {b, z}}.

For simplicity's sake, let's assume that S1 and S2 are sequences of the same element type. We certainly could define a Cartesian product of a sequence of strings with a sequence of ints as a sequence of tuples of (string, int), but then it gets quite difficult to generalize this because the C# generic type system does not handle arbitrarily-sized tuples particularly nicely.

LINQ has an operator specifically for making Cartesian products: in "fluent" syntax it is SelectMany, in "query comprehension" syntax it is a query with two "from" clauses:

var s1 = new[] {a, b};
var s2 = new[] {x, y, z};
var product =
    from first in s1
    from second in s2
    select new[] { first, second };


We can of course generalize the Cartesian product to more than two sequences. The Cartesian product of n sequences {S1, S2, ... Sn} is the sequence of all possible n-element sequences where the first element is from S1, the second element is from S2, and so on.

There's a trivial case missing from that definition of course; what is the Cartesian product of zero sequences? Let's say that the Cartesian product of a sequence containing a single empty sequence, that is, { { } }. (See the comments for a justification of why this is a good idea; I had originally thought to use the empty sequence { }, but this is way better. Thanks to commenter Apollonius for the excellent suggestion.)

Note that this gives a reasonable definition for the Cartesian product of a single sequence. The Cartesian product of a sequence containing one sequence, say, {{a, b}}, is the sequence of all possible one-element sequences where the first (and only) element is from {a, b}. So the Cartesian product of {{a, b}} is {{a}, {b}}.

With LINQ you can make the Cartesian product of any number of sequences easily enough if you know how many sequences there are to begin with:

var product =
    from first in s1
    from second in s2
    from third in s3
    select new[] {first, second, third};


But what do you do if you do not know how many sequences there are at compile time? That is, how do you write the body of

public static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)


?

Well, let's reason using induction; that's almost always a good idea when working on recursively-defined data structures.

If sequences contains zero sequences, we're done; we just return { { } }.

How do we compute the Cartesian product of two sequences, say {a, b} and {x, y, z} again? We start by computing the Cartesian product of the first sequence. Let's make the inductive hypothesis that we have some way to do that, so we know its {{a}, {b}}. How do we combine {{a}, {b}} with {x, y, z} to produce the desired Cartesian product?

Well, suppose we go back to our original definition of the Cartesian product of two sequences to get some inspiration. The Cartesian product of {{a}, {b}} and {x, y, z} is the mess {{{a}, x}, {{a}, y}, {{a}, z}, {{b}, x}, {{b}, y}, {{b},z}} which is tantalizingly close to what we want. We do not want to only compute the Cartesian product of {{a}, {b}} and {x, y, z} by making a sequence containing {a} and x, we want to compute the Cartesian product by appending x to {a} to produce {a, x}! Or, put another way, by concatenating {a} with {x}.

In code: suppose we already have an old Cartesian product, say {{a}, {b}}. We wish to combine it with sequence {x, y, z}:

var newProduct =
    from old in oldProduct
    from item in sequence
    select old.Concat(new[]{item}};


And now we have a successful recursive case. If oldProduct is any Cartesian product then we can compute the combination of it with another sequence to produce a new Cartesian product.

Just to make sure: does this work with the base case? Yes. If we want to take the Cartesian product of { { } } with {a, b} then we concatenate { } with {a} and concatenate { } with {b} to get {{a}, {b}}.

Let's put it all together.

static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)
{
  // base case:
  IEnumerable<IEnumerable<T>> result = new[] { Enumerable.Empty<T>() };
  foreach(var sequence in sequences)
  {
    var s = sequence; // don't close over the loop variable
    // recursive case: use SelectMany to build the new product out of the old one
    result =
      from seq in result
      from item in s
      select seq.Concat(new[] {item}); 
    }
  return result;
}


That's fine, but we could actually be a bit fancier here if we wanted to. We are essentially using an accumulator. Consider a simpler case, say, adding up the total of a list of integers. One way to do that is to say "the accumulator starts at zero. The new accumulator is computed from the old accumulator by adding the current item to the old accumulator." If you have a starting value for an accumulator and some way to make a new accumulator from an old accumulator and the current item in the sequence then you can use the handy Aggregate extension method. It takes the starting value of the accumulator and a function that takes the last value and the current item and returns you the next value for the accumulator. The result is the final value of the accumulator.

In this case we'll start our accumulator off as the empty product, and every time through we'll "add" to it by combining the current sequence with the product so far. At every step of the way, the accumulator will be the Cartesian product of all the sequences seen so far.

static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)
{
  IEnumerable<IEnumerable<T>> emptyProduct = new[] { Enumerable.Empty<T>() };
  return sequences.Aggregate(
    emptyProduct,
    (accumulator, sequence) =>
      from accseq in accumulator
      from item in sequence
      select accseq.Concat(new[] {item}));
}


Now, a word to the wise here. Remember that with LINQ the result of a query expression is a query that can deliver the results when asked, not the results of the query. When we construct this accumulation, we're not actually computing the Cartesian product. We are computing a big complicated query that when executed, results in the Cartesian product. The query will be built eagerly, but executed lazily.


(*) The picky reader will note that strictly speaking there are two minor issues here. First, the Cartesian product on arbitrarily many products is properly called the "n-ary Cartesian product". Second, Cartesian products are defined on sets, not on sequences. In this post we make the obvious restriction of the n-ary Cartesian product from arbitrary, possibly infinite and unordered sets, to finite sequences.

  • I think it is wiser to define the Cartesian product of zero sequences as a sequence containing one empty sequence, i.e. { { } } instead of { }. This eliminates the second base case from your recursion. It also satisfies the mathematical property that the Cartesian product of n sequences with m_1...m_n elements should have m_1*m_2*...*m_n elements - in the case of the Cartesian product of zero sequences this product is the empty product (en.wikipedia.org/.../Empty_product) which is defined to be one.

    Your suggestion has great merit and shall be implemented immediately.

    I don't know how I missed that; thanks! Your way is much much better. - Eric

  • Was my answer-

    stackoverflow.com/.../3094591

    somewhat helpful? Actually this was my first attempt to solve a problem having complexity of this level.

    Well, it wasn't helpful to me, since I didn't read it until just now. But I don't think your aim was to help me; the person you should be asking that question of is probably the person who posted the problem in the first place. - Eric

  • Eric,

    Great post, as usual !

    However there's a bug in your first version of the CartesianProduct method : you're closing over the loop variable, so, due to the deferred execution, it makes the cartesian product of the last sequence with itself. You need to add a temporary local variable inside the foreach loop to make it work (the second version works fine, though).

    Regards,
    Thomas

    Good catch, thanks. I actually wrote this article backwards; I wrote the aggregating version first and then unrolled it into the loop version, which I did not adequately test. - Eric

  • My problem with the final version is that no one will understand till unroll it into the loop version. :) Theese super-compact code-fragments are simply unmaintable for a mid-size or larger team, where the skills konverges to average.

    More worse, this Cartesian problem has many-many, many-many LINQ-ed solution.

    (An interesting question: How can we generate all the C# 3.0 solutions for computing Cartesian Product? :)

  • Eric,

    Thanks for the excellent post.

    May I suggest a follow-up: what this Cartesian thing will look like in C# 4 / PLINQ? Will there be any difference if we make it all parallel?

    @Szindbad,

    Where do you guys find such "compassionate" teams where everyone if forced to dumb themselves down to the "average" level? In 20 years, I've seen nothing but the pirate code: who falls behind is left behind (meaning, fired on the spot, unless he/she is an executive's relative\friend\lover). And I would not have it any other way! :-)

  • I thought the base case for Cartesian proofs was supposed to be "cogito".  I may have that wrong.

  • Great article, where were you yesterday (literally) when I had to learn how use SelectMany to generate a cartesian product on my own! I believe there is one unfortunate limitation to this generic implementation that I don't believe can be overcome (maybe in 4.0?). T is defined to be the same type throughout, so I can't use this to find the cartesian of { 1, 2, 3 } and { a, b, c },

    Real world example: I generated a list of int years { 2007, 2008, 2009, 2010 } and crossed that with DateTimes { 1/1/2010 , 4/1/10, 7/1/10, 10/1/10 } to create a list of all the KeyValuePairs of [string, DateTime] for the quarters for each year (overkill, yes *grin* ). Could I have gotten this to work by creating DateTimes for the years as well, yes, but cartesians of heterogeneous types that are then projected out to a new object seems pretty practical. Any thoughts how this might be accomplished?

  • Thank you v much for the blog post :) I'm really delighted that a question I asked on SO got so much attention and got a blog post on  MSDN :) . seems i need to do some more learning with LINQ :) .

    Answer turned out to be bit more complex than i initially thought  :)

    The way the solution was found was pretty amazing..amazing in thought and process :)

  • gotta love that cartesian product, this reminded me of some code I'd written to find prime numbers using the Sieve of Eratosthenes once. It's not meant to be uber efficent or anything, just call it a programming challenge. I wrote it like so:

    sealed class Program

    {

    static IEnumerable<BigInteger> GetPrimes(IEnumerable<BigInteger> ns)

    {

    var p = ns.First();

    return ns.Take(1).Concat(GetPrimes(ns.Where(n => n % p != 0)));

    }

    static IEnumerable<BigInteger> From(BigInteger n)

    {

    while (true) { yield return n++; }

    }

    static void Main(string[] args)

    {

    var two = new BigInteger(2);

    var primes = GetPrimes(From(two));

    foreach (var p in primes.Take(10)) { Console.WriteLine(p); }

    }

    }

    But it goes into infinite recursion and blows the stack to pieces (how's that for StackOverflow). Anyway, it works just fine in Haskell:

    Prelude> let primes = let sieve (x:xs) = x:sieve [y | y <- xs, y `mod` x /= 0] in sieve [2..]

    Prelude> take 10 primes

    [2,3,5,7,11,13,17,19,23,29]

    Haskell has non-strict semantics (lazy) and C# doesn't but it is supposed to for IEnumerable... best I could conclude is that Haskell is "more lazy" than C#. Anyone have a more concrete explanation?

  • Instead of var s = sequence, you can probably also switch the froms:

     foreach(var sequence in sequences)

     {

       // recursive case: use SelectMany to build the new product out of the old one

       result =  

         from item in sequence

         from seq in result

         select seq.Concat(new[] {item});

     }

  • Article makes perfect sense: excellent functional programming example, I think.

    @Szindbad: If you have unit tests to verify functional behavior and the method is documented nobody should give a rats ass about the implementation and if they can (immediately) understand it, right? Just because you don't understand it (as the "average" team member) doesn't mean the writer didn't understand. Or the team cannot use the code and and benefit?

    @Dave: This works fine as far as I know (I've worked succesfully with infinate return values like your From() function). Aren't you using the whole sequence again in you "return ns.Take(1).Concat(GetPrimes(ns.Where(n => n % p != 0)));"? Shouldn't that second ns be ns.Skip() or something? As in ns = x:xs and you're doing the where clause on ns instead of just on xs?

  • @Dave:  Your GetPrimes method is infinitely recursive and overflows the stack while building the LINQ expression.

    Try using EnumerableEx.Defer() from the Reactive framework.  Something like this might work for you:

    return ns.Take(1).Concat(EnumerableEx.Defer(() => GetPrimes(ns.Where(n => n % p != 0))));

  • @Jarno: Good catch, I changed it to ns.Skip(1).

    @bman654: You're right, that fixes it. I understand now, basically .NET only defers the evaluation of the enumerable elements... it does not defer method application unless explicitly told to do so (with the EnumerableEx.Defer method). So when the sequence defintion is recurisvle defined, it will stack overflow because of the method application (that is, the recurive call). It works in Haskell becuase the opposite is true there, everything is defered unless explicitly told not to do so.

    Thanks.

  • Totally awesome!  Thank you for helping me out bigtime!

  • This blogpost just saved my day.

    Thank you!

Page 1 of 2 (18 items) 12