Iterator Blocks, Part Five: Push vs Pull

Iterator Blocks, Part Five: Push vs Pull

Rate This
  • Comments 13

A while back I posted some commentary I did for the Scripting Summer Games where I noted that there is an isomorphism between "pull" sequences and "push" events. Normally you think of events as something that "calls you", pushing the event arguments at you. And normally you think of a sequence as something that you "pull from", asking it for the next value until you're done. But you can treat a stream of event firings as being a sequence of event argument objects. And similarly, you could implement a sequence iterator so that it called a method for every object in the collection. Heck, you could implement all of LINQ on this model if you really wanted.

The implementation of iterator blocks is clearly on the "pull" paradigm. It didn't have to be. We could have gone with a sort of "inversion of control" approach. "Pull" iterators have to simulate coroutines with a little state machine that knows how to get back to the state where the code left off. But we already have a mechanism for code to "get back to the state where it left off" -- that's what you do every time you call a method! You remember what you were doing, call the method, and then pick up where you left off. We could have done the same thing with iterators. We could say that this:

void Integers(int length, IObserver<int> observer)
{
  for (int i = 0; i < length; ++i) yield return i;
}

is a syntactic sugar for

void Integers(int length, IObserver<int> observer)
{
  for (int i = 0; i < length; ++i) observer.Yield(i);
  observer.Break();
}

That is, we "raise an event" by calling the observer every time something yields, and raise another event when we're done.

This would be a much more trivial transformation to make than our current state machine approach, but since most people want sequence iteration on the "pull" model, we do the harder work to make that happen.

I said this had something to do with exception handling. What's the connection?

Did you notice something weird about how we handle finally blocks? Consider what happens with the "push model":

TextReader reader = File.Open(file);
try
{
    blah blah blah yield the lines

finally
{
    reader.Close();
}

If something in the "blah blah blah" is realized as a call to observer.Yield(line), what happens if the code consuming the result throws? Easy. It is just a method call like any other. The call stack unwinds, we find the finally, the file gets closed, everything is good.

Now suppose this is realized as the MoveNext of a "pull" iterator. What happens when the code that is consuming the result throws? If we're consuming the result then we've returned from the call to MoveNext! There is no "try", there is no "finally". And yet usually the finally gets executed anyways! If the code that is consuming the results throws, odds are good that it was in a foreach; when foreach throws, the enumerator is disposed, and when the enumerator is disposed, we figure out which "finallys" were around when we last left MoveNext, and we run a special method that does whatever was in the pending finally blocks.

This is pretty weird. What's going on here is that for finally blocks, "pull" iterators have the same semantics as "push" iterators. When you yield from a try containing a finally, it looks like the finally is still "on the stack" and runs when the consumer throws.

So what if the try block has a catch?

The original designers of C# 2.0 -- and remember, this is long before my time on the team -- had a huge debate over this. A debate which was repeated in miniature when I sent them all an email asking for the justification for this decision. There are three basic positions:

1) Do not allow yield returns in try blocks at all. (Or blocks that are sugars for try blocks, like using blocks.) Use some other mechanism other than "finally" to express the idea of "this is the code that you run when the caller shuts down the enumeration."

2) Allow yield returns in all try blocks.

3) Allow yield returns in try blocks that have finally blocks, but not if they have catch blocks.

The down sides of (1) are that you have to come up with some syntax for representing the "finally" logic that isn't actually a "finally", and that it becomes more difficult to iterate over stuff that requires cleanup, like iterating over the contents of a file. This makes the feature both confusing and weak; we should avoid this option if at all possible.

The down side of (2) is that the exception handling behaviour beomes deeply, weirdly inconsistent between finally blocks and catch blocks. If you have a yield in a try block with a finally, and the consumer of the iteration throws, then the finally runs, as though you were on the "push" model. But if you have a yield in a try block with a catch, and the consumer of the iteration throws, then the catch does not run, because its not on the call stack. Users have a reasonable expectation that finally and catch work more or less the same way when an exception happens; that is, that control will be consistently transferred to a matching catch or the finally.

The down side of (3) is that the rule seems arbitrary and weird -- until you read five unnecessarily prolix blog entries that explain what on earth the design team was thinking.

Obviously, they picked (3), and now you know why.

Next time, we'll finish up this series with a look at unsafe code. Now that you know all this stuff, seeing why there's no unsafe code allowed in an iterator block is pretty straightforward by comparison.

  • Any chance you could add an "afterword" to this series talking about why there's no way to say "I want this bit of code here (usually validation) to be executed when the iterator method is first called, not on the first call to MoveNext()"? Obviously if it's just a case of "the design team didn't realise how useful that would have been" then it's not much of a post, but maybe there are subtler reasons...

    Well, I agree that it would be useful. The question is whether it is useful enough to justify adding new syntax to the language, and I don't think it is. The design team already rejected adding a new syntax for "clean this thing up if the enumeration is disposed early", instead opting to go for our rather goofy "finally that's not on the stack still runs" strategy. The design team tries to only add new syntax if its absolutely necessary for the feature. This is a bit of a "gotcha", I know, but it's not that bad and the workaround is straightforward. Put it this way: if you know to use the new syntax to make some code "eager", then you also know how to do it "manually". If you don't know it, then you'll write the bug whether there's a syntax to fix it or not. So the syntax is unnecessary. -- Eric

    (Oh, and thanks for the shout out. I wondered why there was a comment left on that blog post tonight :)

  • Eric,

    Can you also provide some insight into why "yields" are not allowed inside an anonymous method or lambda expression

    Good question. I would love to have anonymous iterator blocks. It would be totally awesome to be able to build yourself a little sequence generator in-place that closed over local variables. The reason why not is straightforward: the benefits don't outweigh the costs. The awesomeness of making sequence generators in-place is actually pretty small in the grand scheme of things and nominal methods do the job well enough in most scenarios. So the benefits are not that compelling.

    The costs are large. Iterator rewriting is the most complicated transformation in the compiler, and anonymous method rewriting is the second most complicated. Anonymous methods can be inside other anonymous methods, and anonymous methods can be inside iterator blocks. Therefore, what we do is first we rewrite all anonymous methods so that they become methods of a closure class. This is the second-last thing the compiler does before emitting IL for a method. Once that step is done, the iterator rewriter can assume that there are no anonymous methods in the iterator block; they've all be rewritten already. Therefore the iterator rewriter can just concentrate on rewriting the iterator, without worrying that there might be an unrealized anonymous method in there.

    Also, iterator blocks never "nest", unlike anonymous methods. The iterator rewriter can assume that all iterator blocks are "top level".

    If anonymous methods are allowed to contain iterator blocks, then both those assumptions go out the window. You can have an iterator block that contains an anonymous method that contains an anonymous method that contains an iterator block that contains an anonymous method, and... yuck. Now we have to write a rewriting pass that can handle nested iterator blocks and nested anonymous methods at the same time, merging our two most complicated algorithms into one far more complicated algorithm. It would be really hard to design, implement, and test. We are smart enough to do so, I'm sure. We've got a smart team here. But we don't want to take on that large burden for a "nice to have but not necessary" feature. -- Eric

  • http://themechanicalbride.blogspot.com/2009/07/introducing-rx-linq-to-events.html

    Coincidence?

    Yep. -- Eric

  • One nice advantage of the pull model is that you can simply stop pulling when you have enough data.

  • Excerpt from .NET Framework class library documentation.

    -------------------------------------------------------------------------

    bool MoveNext();

    Return Value

    Type: System.Boolean

    true if the enumerator was successfully advanced to the next element;

    false if the enumerator has passed the end of the collection.

    SHOUD BE CORRECTED:

    Return Value

    Type: System.Boolean

    true if the enumerator was successfully advanced to the next element;

    otherwise false.

    -------------------------------------------------------------------------

    First version of the specification requires that method MoveNext()

    somehow must represent whole iteration, effectively disallowing yields

    in various code blocks related to exceptions.

    While second version of the specification deals only with single step

    of the iteration, and effectively allows yields in any code block.

    I think second version is better, because anyway we do not have

    reliable means to tell how successfully we iterated through

    all elements in the sequence or even tell, by using iteration,

    is sequence finite or endless.

  • @Bart - and with the Push model, you can stop listening... this is partly how we did things like Take; once it has enough data, it unsubscribes itself. Note that you can't terminate the entire through, because one of the big plus points of a push model is that you can push data through multiple observers *at the same time*.

  • @Eric: In that case would you just hold onto a type that imlements IEnumerable<T> (such as a List<T>) and then just return the List<T> out (which of course would not be differed execution) or is there a recommended "best practices" way to achieve the same end result?

    Thanks.

    Abhijeet

  • In a sense, option 1 (no yield return in a try block) is actually quite easy to support, even for iterators that have disposeable resources.  In fact, you almost point out the obvious implementation yourself:

    If you forbid all yield statements inside a try block, you can still permit "using" - you simply need to implement it differently.  Rather than being syntactic sugar for a Dispose() call in a finally block, it becomes syntactic sugar for a Dispose() call in a the IEnumerator<>'s Dispose() method.

    In a sense that's actually more intuitive - finally blocks seem to have "stronger" semantics than IDisposable methods.  You expect a finally statement to execute regardless of the rest of the control flow before any code after it does (essentially it executes unless something is terribly wrong or your program is non-terminating - in which case code after the finally block won't execute either).

    The finally block in a generator doesn't have these semantics, and by forcing resource cleanup via IDisposable the true behaviour of the generator would be closer to your intuitive expectations (without deep thought, you can expect an IDisposable to be called asap, but you never know quite when, and furthermore it might never be called if the caller forgets to call IDisposable).

    Not that I'm opposed to the current implementation - but I think option (1) - no yield in try... would have been perfectly reasonable so long as one permits resource disposal in some other fashion - such as via 'using'.

  • You could make enumerable fully orthogonal by allowing light-weight threads to asynchronously co-execute on iteration.

    The attribute could be something like:

    EnableExtremelySlowAsynchroousExecution

    Then you could optimize the trivial cases into your macro-like definition (respectectably profoundly complex macro-like definition), and the remainder could follow nested iterations and the whole functional iteration programming model without changing any syntax. This also enables multithreaded approaches.

    F# has this but recruiting maintainers would be tough and C# is a fantastic mechanism of access to .Net.  Compared with C++ or Java, C# is way ahead.  Poor Sun seems to be stuck with programmers who mostly don't like Java and still want to twiddle bits at absolute locations in memory.

    Microsoft is on the leading edge of the trend of full functional programming beyond the specific example of OOP.

    I'm working on a C# interpreter in C# that would be scripting-level fast and maybe have a simplified syntax allowing only prefix, infix, and postfix operators of varying scope (all theold Prologs have this). If I don't hit a dead end it could be interesting.  First step is a class which localizes all thread complexity and allows COM interop with Office's apartment scheme.

  • "The down side of (2) is that the exception handling behaviour beomes deeply, weirdly inconsistent between finally blocks and catch blocks. If you have a yield in a try block with a finally, and the consumer of the iteration throws, then the finally runs, as though you were on the "push" model. But if you have a yield in a try block with a catch, and the consumer of the iteration throws, then the catch does not run, because its not on the call stack. Users have a reasonable expectation that finally and catch work more or less the same way when an exception happens; that is, that control will be consistently transferred to a matching catch or the finally."

    To think this through, how about:

       public interface IEnumerableExceptionHandler

       {

           void Handle(Exception e);

       }

    A type that supports IEnumerable could optionally also support the above interface.

    The `foreach` statement would look at the static type of the loop collection, and if it supports IEnumerableExceptionHandler, it would generate something like:

       foreach (var item in collection)

       {

           try

           {

                // loop body

           }

           catch (Exception e)

           {

               collection.Handle(e);

           }

       }

    Then the compiler-generated iterator class would implement IEnumerableExceptionHandler, and would store the exception, and then simulate a throw by executing the appropriate catch (and finally) on the next call to MoveNext.

    (I hate the fact this catches NullReferenceException, but never mind that - it's fixed in CLR 4 isn't it?)

    This would implement the "expected" behaviour, but now I look at it, I don't think that would be expected. I wouldn't personally expect exceptions in the loop body to magically wormhole their way to somewhere else. I think this is because catch and finally are very different things. Catch blocks run *instead* of something else, where finally blocks run *in addition* to something else.

    I'd expect yield return itself to never throw, once it has a value to yield. And I'd find it very useful to be able to catch exceptions from the code intermingled with yield return (especially the expression that provides the value to yield return).

    But it sounds like these debates were already had five years ago.

  • Great discussion. Just one question. How do you think it could hit on the overall performance of the process when you implement the "push" pattern ? I did some work and research on different event's oriented patterns and usually I've found that the use of event's has some important impact on performance.

  • Hm.. a very very strange thing you have discovered. What the hell is inside the CLR that have forced the people to handle it in a such way? Dynamically-typed languages like Ruby or Lua doesn't have ANY problems in successfully interleaving return/yield/catch/rescue/finally/etc..

  • By the way

    Why a method that returns an IEnumerable can return directly another IEnumerable, if alone, and cannot return an IEnumerable if it has already returned something?

    Why can I write:

    public IEnumerable<int> AnotherEnumeration()
    {
        ....
    }

    public IEnumerable<int> MyTest()
    {
       return AnotherEnumeration();
    }

    And I cannot write

    public IEnumerable<int> MyTest()
    {
      yield return 0;
      yield return 1;
      return AnotherEnumeration();
    }

    // err: Iterator cannot contain 'return' statement

    But I need to expand the AnotherEnumeration enrolling it in a for/foreach, yielding again the single items?

    Thanks

    This frequently-requested feature we call "yield foreach" because it replaces the "foreach" loop you currently write. It would be awesome to have this feature, but it has never been high enough priority. 

    Here's a post about some of the problems that yield foreach would solve.

    And here's a paper about how this feature was implemented in a research version of C#.

    -- Eric

Page 1 of 1 (13 items)