Zip Me Up

Zip Me Up

Rate This
  • Comments 31

Suppose you’ve got a sequence of Foos and you want to project from that a sequences of Bars. That’s straightforward using LINQ:

IEnumerable<Bars> bars = from foo in foos select MakeBar(foo);

or, without the query sugar:

IEnumerable<Bars> bars = foos.Select(foo=>MakeBar(foo));

But what if you have two sequences that you want to project from? Say you’ve got two sequences of doubles that are the same length, and you want to project a sequences of points.

The operation of “project from two sequences of the same length” is called a “zip join”, because its like doing up a zipper – you need each member of the left sequence to correspond to a member of the right sequence. In the version of the base class library that will ship with C# 4.0, we’ll be adding a Zip sequence operator to the standard extension methods.

The code to do so is pretty trivial; if you happen to need this in C# 3.0, I’ve put the source code below.

We considered adding a query syntax for zip joins. Something like

from foo in foos, bar in bars select MakeBlah(foo, bar)

or

from (foo, bar) in (foos, bars) select MakeBlah(foo, bar)

or some such thing, which would be syntactically transformed into

foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

However, we decided that adding a query syntax for a relatively obscure operation that is not actually supported by most relational databases was not worth the cost of complicating the grammar and the syntactic translation rules.

Here’s the source code:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>
    (this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    if (first == null) throw new ArgumentNullException("first");
    if (second == null) throw new ArgumentNullException("second");
    if (resultSelector == null) throw new ArgumentNullException("resultSelector");
    return ZipIterator(first, second, resultSelector);
}

private static IEnumerable<TResult> ZipIterator<TFirst, TSecond, TResult>
    (IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> e1 = first.GetEnumerator())
        using (IEnumerator<TSecond> e2 = second.GetEnumerator())
            while (e1.MoveNext() && e2.MoveNext())
                yield return resultSelector(e1.Current, e2.Current);
}

  • Sounds similar to the  List.zip and List.unzip functions in F#.

    Interesting why F# decided to have this on List instead of any sequence i.e IEnumerable<T>

  • Just a small correction: I believe bar should be bars in

    foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

  • Although I generally like the syntax of extension methods, it seems like an odd fit for the zip method.  To me, it seems like it would be more readable if there were either static method imports (similar to java) or to just put the method on a class and not make it an extension. For example:

    instead of

    foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

    you would have

    Lists.Zip(foos,bars, (foo, bar)=>MakeBlah(foo, bar))

    or just

    Zip(foos,bars, (foo, bar)=>MakeBlah(foo, bar))

    or without the last lambda, and some sort of tuple support

    Zip(foos,bars)

    which could be used like this (also assuming good tuple support):

    foreach( (foo,bar) in Zip(foos,bars){

    //do something

    }

    I like the concept, but the way it sounds when I read it doesn't seem to convey the meaning very well (at least not for how my brain works).

    Either way its still a great addition :D

  • Well, nothing stops you from calling it explicitly as Enumerable.Zip(), right? I often do that with Concat() for precisely the same reasons.

    (by the way, I wish Enumerable.Concat() was made vararg...)

  • Nice... A good part of C#3 IEnumerable<> extensions resemble what exists in Haskell, so I always felt like Zip() was missing. But Unzip() can probably not be added, unless some standard tuples are added too.

    I feel like I could have used the general syntax for zip joins to replace the majority of my 'for' loops (as opposed to 'foreach' loops). The main use of indices is for looking up in other collections in a synchronized way.

    Sure, the database querying analogy doesn't extend to zip joins, but I mostly use linq for functional iterations anyway. :)

  • I like how zip join is defined in Haskell:

    zipWith          :: (a->b->c) -> [a]->[b]->[c]

    zipWith z (a:as) (b:bs)            =  z a b : zipWith z as bs

    zipWith _ _ _                          =  []

    20 lines of code converted into 3! Granted, the Haskell version doesn't check for nulls, but that's because it doesn't have null =).

  • Nice...  Though in my opinion a drawback of the given implementation is that if the enumerables are of different lengths, it silently drops the extra elements.  Would probably be better to have a test that if either enumerable hits its end, the other one hit its end at the same time.  (and of course you want to behave properly if the inputs have infinite length)

    Doing so is problematic for common cases. For example, suppose you want to zip a finite sequence against an infinite sequence of integers to form a finite indexed sequence. With your proposal you'd have to iterate the finite sequence twice, once to count it, and once to zip it with a finite index sequence. By allowing slop on either side we allow one of the sequences to be infinite. -- Eric 

     

    Something like

         using(var e1 = first.GetEnumerator()) {
           using(var e2=second.GetEnumerator()) {
             while(true) {
               var hasMore1=e1.MoveNext();
               var hasMore2=e2.MoveNext();
               if(!hasMore1 || !hasMore2) { //one of them has ended
                 if(hasMore1 || hasMore2) { //one of them has not ended
                   throw new Exception("Enumerables have different lengths?");
                 }
                 break;
               }
               yield return resultSelector(e1.Current, e2.Current);
             }
           }
         }

  • Funny, I actually had implemented something like this already, although I did it a little differently.

    public static IEnumerable<IMulti<T1, T2>> Multi<T1, T2>(IEnumerable<T1> source1, IEnumerable<T2> source2)

    {

    IEnumerator<T1> enumerator1 = source1.GetEnumerator();

    IEnumerator<T2> enumerator2 = source2.GetEnumerator();

    while (enumerator1.MoveNext() && enumerator2.MoveNext())

    yield return new Multi<T1, T2>(enumerator1.Current, enumerator2.Current);

    }

    With:

    public interface IMulti<T1, T2>

    {

    T1 Data1 { get; }

    T2 Data2 { get; }

    }

    class Multi<T1, T2> : IMulti<T1, T2>

    {

    readonly T1 data1;

    readonly T2 data2;

    public T1 Data1 { get { return data1; } }

    public T2 Data2 { get { return data2; } }

    public Multi(T1 data1, T2 data2)

    {

    this.data1 = data1;

    this.data2 = data2;

    }

    }

    That way, it can also be used with a foreach loop to iterate through sequences of the same length. The only drawback is that you get ugly identifiers like Data1 an Data2. I wish there was something like:

    foreach (Foo foo, Bar bar in foos, bars) ;

  • I understand why you've decided against the query comprehension syntax. But it would be great if you could add it to System.Linq.Enumerable so we still have parity with Python.

  • I would also have checked lengths first.

    If you check lengths first, you have to iterate both sequences twice. That potentially doubles the cost to every call. -- Eric

    And would prefer static syntax, but that may just be the influence of F#

    Love you blog, though I only understand 70% of it (but that's better than a few years ago - so either you're dumbing down, or I'm smarting up)

  • This is a good thing, but it seems to be somewhat limited. I would prefer to have a index(foo) pseudo-function in LINQ syntax to achieve maximum flexibility when joining by index.

    Also, the name is confusing -- Zip is a very well known archiving method, and the functional meaning will be unfamiliar to many developers. I would name it something like Merge, which defines the semantics.

  • @Kosak I see IEnumerable<T>'s of different lengths as a feature not a bug..

    The fact that Eric didn't require the the zip to work on bounded sequences, where the upper bound (Length) is known aprori, is a feature not a bug. It will allow the methods use with unbounded generator based streams.

    I for one ask, please don't add any checking to make sure the two IEnumerable<T>'s are the same cardinality.

  • In a slightly more abstract view, Zip and Unzip are really the same function, and that function is Transpose. Zip turns a pair of lists (2xN) into a list of pairs (Nx2). Unzip just transposes the dimensions again.  In APL and derivatives (J, K, etc.) it actually works that way because lists, tuples, and function parameters are either the same thing or interchangeable.

  • I use two variations of a method like this. Zip is as above - it stops once either enumerable expires. ZipEqual throws an exception if one enumerable expires before the other. It doesn't do any sort of checking in advance, it just throws the exception during iteration when one expires and the other doesn't. I tend to use ZipEqual whenever possible because it makes it clear that the sequences are expected to be the same length. I think I got the name and the behaviour from ML. Still, if you're only going to provide one in the standard library I think it should be the more flexible Zip.

  • Thank you for submitting this cool story - Trackback from DotNetShoutout

Page 1 of 3 (31 items) 123