Fabulous Adventures In Coding

Eric Lippert's Blog

Zip Me Up

Suppose you’ve got a sequence of Foos and you want to project from that a sequences of Bars. That’s straightforward using LINQ:

IEnumerable<Bars> bars = from foo in foos select MakeBar(foo);

or, without the query sugar:

IEnumerable<Bars> bars = foos.Select(foo=>MakeBar(foo));

But what if you have two sequences that you want to project from? Say you’ve got two sequences of doubles that are the same length, and you want to project a sequences of points.

The operation of “project from two sequences of the same length” is called a “zip join”, because its like doing up a zipper – you need each member of the left sequence to correspond to a member of the right sequence. In the version of the base class library that will ship with C# 4.0, we’ll be adding a Zip sequence operator to the standard extension methods.

The code to do so is pretty trivial; if you happen to need this in C# 3.0, I’ve put the source code below.

We considered adding a query syntax for zip joins. Something like

from foo in foos, bar in bars select MakeBlah(foo, bar)

or

from (foo, bar) in (foos, bars) select MakeBlah(foo, bar)

or some such thing, which would be syntactically transformed into

foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

However, we decided that adding a query syntax for a relatively obscure operation that is not actually supported by most relational databases was not worth the cost of complicating the grammar and the syntactic translation rules.

Here’s the source code:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>
    (this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    if (first == null) throw new ArgumentNullException("first");
    if (second == null) throw new ArgumentNullException("second");
    if (resultSelector == null) throw new ArgumentNullException("resultSelector");
    return ZipIterator(first, second, resultSelector);
}

private static IEnumerable<TResult> ZipIterator<TFirst, TSecond, TResult>
    (IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> e1 = first.GetEnumerator())
        using (IEnumerator<TSecond> e2 = second.GetEnumerator())
            while (e1.MoveNext() && e2.MoveNext())
                yield return resultSelector(e1.Current, e2.Current);
}

Published Thursday, May 07, 2009 7:19 AM by Eric Lippert
Filed under: , ,

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

 

Abhijeet Patel said:

Sounds similar to the  List.zip and List.unzip functions in F#.

Interesting why F# decided to have this on List instead of any sequence i.e IEnumerable<T>

May 7, 2009 2:34 PM
 

DRBlaise said:

Just a small correction: I believe bar should be bars in

foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

May 7, 2009 3:06 PM
 

Dan said:

Although I generally like the syntax of extension methods, it seems like an odd fit for the zip method.  To me, it seems like it would be more readable if there were either static method imports (similar to java) or to just put the method on a class and not make it an extension. For example:

instead of

foos.Zip(bars, (foo, bar)=>MakeBlah(foo, bar))

you would have

Lists.Zip(foos,bars, (foo, bar)=>MakeBlah(foo, bar))

or just

Zip(foos,bars, (foo, bar)=>MakeBlah(foo, bar))

or without the last lambda, and some sort of tuple support

Zip(foos,bars)

which could be used like this (also assuming good tuple support):

foreach( (foo,bar) in Zip(foos,bars){

//do something

}

I like the concept, but the way it sounds when I read it doesn't seem to convey the meaning very well (at least not for how my brain works).

Either way its still a great addition :D

May 7, 2009 5:58 PM
 

Pavel Minaev [MSFT] said:

Well, nothing stops you from calling it explicitly as Enumerable.Zip(), right? I often do that with Concat() for precisely the same reasons.

(by the way, I wish Enumerable.Concat() was made vararg...)

May 7, 2009 6:15 PM
 

Olivier Leclant said:

Nice... A good part of C#3 IEnumerable<> extensions resemble what exists in Haskell, so I always felt like Zip() was missing. But Unzip() can probably not be added, unless some standard tuples are added too.

I feel like I could have used the general syntax for zip joins to replace the majority of my 'for' loops (as opposed to 'foreach' loops). The main use of indices is for looking up in other collections in a synchronized way.

Sure, the database querying analogy doesn't extend to zip joins, but I mostly use linq for functional iterations anyway. :)

May 7, 2009 7:09 PM
 

Bubba blub blop said:

I like how zip join is defined in Haskell:

zipWith          :: (a->b->c) -> [a]->[b]->[c]

zipWith z (a:as) (b:bs)            =  z a b : zipWith z as bs

zipWith _ _ _                          =  []

20 lines of code converted into 3! Granted, the Haskell version doesn't check for nulls, but that's because it doesn't have null =).

May 7, 2009 9:38 PM
 

kosak said:

Nice...  Though in my opinion a drawback of the given implementation is that if the enumerables are of different lengths, it silently drops the extra elements.  Would probably be better to have a test that if either enumerable hits its end, the other one hit its end at the same time.  (and of course you want to behave properly if the inputs have infinite length)

Doing so is problematic for common cases. For example, suppose you want to zip a finite sequence against an infinite sequence of integers to form a finite indexed sequence. With your proposal you'd have to iterate the finite sequence twice, once to count it, and once to zip it with a finite index sequence. By allowing slop on either side we allow one of the sequences to be infinite. -- Eric 

 

Something like

     using(var e1 = first.GetEnumerator()) {
       using(var e2=second.GetEnumerator()) {
         while(true) {
           var hasMore1=e1.MoveNext();
           var hasMore2=e2.MoveNext();
           if(!hasMore1 || !hasMore2) { //one of them has ended
             if(hasMore1 || hasMore2) { //one of them has not ended
               throw new Exception("Enumerables have different lengths?");
             }
             break;
           }
           yield return resultSelector(e1.Current, e2.Current);
         }
       }
     }

May 7, 2009 10:55 PM
 

Opfer said:

Funny, I actually had implemented something like this already, although I did it a little differently.

public static IEnumerable<IMulti<T1, T2>> Multi<T1, T2>(IEnumerable<T1> source1, IEnumerable<T2> source2)

{

IEnumerator<T1> enumerator1 = source1.GetEnumerator();

IEnumerator<T2> enumerator2 = source2.GetEnumerator();

while (enumerator1.MoveNext() && enumerator2.MoveNext())

yield return new Multi<T1, T2>(enumerator1.Current, enumerator2.Current);

}

With:

public interface IMulti<T1, T2>

{

T1 Data1 { get; }

T2 Data2 { get; }

}

class Multi<T1, T2> : IMulti<T1, T2>

{

readonly T1 data1;

readonly T2 data2;

public T1 Data1 { get { return data1; } }

public T2 Data2 { get { return data2; } }

public Multi(T1 data1, T2 data2)

{

this.data1 = data1;

this.data2 = data2;

}

}

That way, it can also be used with a foreach loop to iterate through sequences of the same length. The only drawback is that you get ugly identifiers like Data1 an Data2. I wish there was something like:

foreach (Foo foo, Bar bar in foos, bars) ;

May 8, 2009 1:22 AM
 

RichB said:

I understand why you've decided against the query comprehension syntax. But it would be great if you could add it to System.Linq.Enumerable so we still have parity with Python.

May 8, 2009 1:53 AM
 

Ben said:

I would also have checked lengths first.

If you check lengths first, you have to iterate both sequences twice. That potentially doubles the cost to every call. -- Eric

And would prefer static syntax, but that may just be the influence of F#

Love you blog, though I only understand 70% of it (but that's better than a few years ago - so either you're dumbing down, or I'm smarting up)

May 8, 2009 2:42 AM
 

Andrey Shchekin said:

This is a good thing, but it seems to be somewhat limited. I would prefer to have a index(foo) pseudo-function in LINQ syntax to achieve maximum flexibility when joining by index.

Also, the name is confusing -- Zip is a very well known archiving method, and the functional meaning will be unfamiliar to many developers. I would name it something like Merge, which defines the semantics.

May 8, 2009 4:14 AM
 

Anthony Tarlano said:

@Kosak I see IEnumerable<T>'s of different lengths as a feature not a bug..

The fact that Eric didn't require the the zip to work on bounded sequences, where the upper bound (Length) is known aprori, is a feature not a bug. It will allow the methods use with unbounded generator based streams.

I for one ask, please don't add any checking to make sure the two IEnumerable<T>'s are the same cardinality.

May 8, 2009 6:00 AM
 

Ken Hirsch said:

In a slightly more abstract view, Zip and Unzip are really the same function, and that function is Transpose. Zip turns a pair of lists (2xN) into a list of pairs (Nx2). Unzip just transposes the dimensions again.  In APL and derivatives (J, K, etc.) it actually works that way because lists, tuples, and function parameters are either the same thing or interchangeable.

May 8, 2009 6:47 AM
 

Weeble said:

I use two variations of a method like this. Zip is as above - it stops once either enumerable expires. ZipEqual throws an exception if one enumerable expires before the other. It doesn't do any sort of checking in advance, it just throws the exception during iteration when one expires and the other doesn't. I tend to use ZipEqual whenever possible because it makes it clear that the sequences are expected to be the same length. I think I got the name and the behaviour from ML. Still, if you're only going to provide one in the standard library I think it should be the more flexible Zip.

May 8, 2009 6:54 AM
 

DotNetShoutout said:

Thank you for submitting this cool story - Trackback from DotNetShoutout

May 8, 2009 7:58 AM
 

Joe said:

Clearly the query syntax is pared down to some notion of what is most commonly used, but I'm not sure whether or not an operator is supported by the majority of relational databases is a good criteria to base that decision on.  The query syntax is part of Linq, not part of a specific ORM like LinqToSql or the EF.

May 8, 2009 9:26 AM
 

Jason Haley said:

Interesting Finds: May 8, 2009

May 8, 2009 9:54 AM
 

Sam Saffron said:

I would go with an implementation that matches Ruby for objects:

irb> [1,2,3].zip([4,5])

=> [[1, 4], [2, 5], [3, nil]]

irb> [1,2,3].zip([4,5,6,7])

=> [[1, 4], [2, 5], [3, 6]]

Either exception out or have nil as the second member of the tuple when the zipee is shorter.

Keep the existing behavior for zippor is shorter than the zipee

Also this seems a bit more natural and as @OlivierLeclant said in an earlier comment, it is unzippable,  

public static IEnumerable<Tuple<TFirst, TSecond>> Zip<TFirst, TSecond> (this TFirst first, TSecond second);

May 8, 2009 11:22 AM
 

kosak said:

Anthony, I agree, and the point is that the code I posted works perfectly well on unbounded streams, as should have been obvious from both the code and the parenthetical comment in my explanation.

Nothing in my code requires that the enumerables are bounded.  The only additional requirement that my code imposes is that IN THE CASE THAT the code encounters the end of one stream, it will check that the other stream terrminates there also.

May 8, 2009 11:26 AM
 

Anthony Tarlano said:

@Kosak  : I just don't think it is an Exception to have different lengths. Are you telling me you code doesn't throw on different lengths?

""" throw new Exception("Enumerables have different lengths?"); """

May 8, 2009 12:26 PM
 

Greg said:

Can you include how that would be done without using LINQ and possibly a real world scenario where this could be used?

May 8, 2009 12:40 PM
 

Ed said:

Reviewing what I learned from your excellent post on naming things, I have to admit that Andrey Shchekin's comment is spot on.

May 8, 2009 4:16 PM
 

Lucas said:

Why did you choose a 2-method implementation? What is the need for the private ZipIterator()? Can't you yield return from Zip()?

Good question. I covered that already a while back: http://blogs.msdn.com/ericlippert/archive/tags/Psychic+Debugging/default.aspx

-- Eric

 

May 8, 2009 5:30 PM
 

Lucas said:

Thanks Eric!! I use iterators all the time and was not aware that my null checks (just like in your psychic debugging example) where being deferred... doh!

May 9, 2009 1:04 AM
 

Daniel Lehmann said:

"However, we decided that adding a query syntax for a relatively obscure operation that is not actually supported by most relational databases was not worth the cost of complicating the grammar and the syntactic translation rules."

That's exactly the reason why I think Scala made the right decision to put Linq-similar things in libraries and instead create an expandable language. While Linq surely feels nice it can't be extended by anyone except the C# developers...

May 9, 2009 5:49 AM
 

Pavel Minaev [MSFT] said:

Scala still has syntactic sugar for query comprehensions, no? So it exposes "filter" and "map" already, and in a non-extensible way, too.

Sure, C# goes further than that, with joins and ordering, but fundamentally there's no difference. You still want the common operations to be sugared, the only question is where to draw the dividing line.

May 10, 2009 3:55 AM
 

Daniel Earwicker said:

Eric, will there be an Unfold in version 4?

Not to my knowledge. But it would be trivial to write such a beast yourself. How about this?

static IE<T> Unfold<S, T>(this S state, Func<S, bool> done, Func<S, S> next, Func<S, T> transform)
{
  while (true) 
  {
    yield return transform(state);
    if (done(state)) yield break;
    state = next(state);
  }
}

-- Eric

May 10, 2009 11:11 AM
 

Pop Catalin said:

Daniel Lehmann wrote:

> That's exactly the reason why I think Scala made the right decision to put Linq-similar things in libraries and instead create an expandable language. While Linq surely feels nice it can't be extended by anyone except the C# developers...

Linq is a pattern, anyone can adhere to the Linq pattern, which means that the implementation part of Linq is extensible at the core, the only part of Linq that is not extensible is the comprehension syntax (you can't add new comprehension keywords, but you can make the comprehension syntax translate to entirely different "member" calls, usually methods). Other than adding new language keywords, you can extend Linq in any possible way imaginable (you can completely replace Linq to SQL, or Linq to XML, or Linq to Objects with your implementation, or simply mix other implementations with your own selectively)...

The comprehension syntax in linq is simply a loose pattern matching (that works by translating to appropriate "member" calls (not just methods), using regular member resolution rules), the rest of Linq is straight forward libraries with standard CLR types that adhere to the linq pattern, and which can be implemented by anyone.

May 11, 2009 5:46 AM
 

Daniel Earwicker said:

Re: Unfold, thanks! I already wrote my own in a simpler form where the transform part is left out (just stick a Select after it if you need that). I guess everyone using Linq is writing their own slightly different version of it.

May 14, 2009 6:42 PM
 

Greg said:

The funny thing is that the code to use the zip join is overly expensive to maintain and develop for in production quality commercial software.  The cost is not really in lines of code but in the mental effort needed to understand and verify what is actually happening (easy to verify in a single 10 line chunk but not easy to do when you have 500+ methods using similar coding constructs).

It fails the Monday morning, did not sleep the night before, code takes too long to understand test (even for code you wrote a week ago).

list ret = new list()

int x = max( list1.length, list2.length )

for (int ctr = 0; ctr < x; ctr++)

{

 element o1 = o2 = null

 if (x < list1.length)     o1 = list1[x];

 if (x < list2.length)     o2 = list2[x];

 e = ret.addnew(o1, o2)

}

Concentrating effort on better high level algorithms, data base organization and project organization will yield many times more productivity.

The straight line sequential code is easier to take over and maintain by a developer unfamiliar with the rest of the application's code.

May 19, 2009 6:08 PM

Leave a Comment

(required) 
(optional)
(required) 

  
Enter Code Here: Required
Submit

About Eric Lippert

Eric Lippert is a senior developer on the Microsoft C# compiler team. Before that he worked on the framework of Visual Studio Tools For Office. Before that, he worked on the compilers, runtimes and tools for VBScript, JScript, Windows Script Host and other Microsoft Scripting technologies. He lives in Seattle and spends his free time editing books about programming languages, playing the piano, and trying to keep his tiny sailboat upright in Puget Sound.

This Blog

Syndication


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker