Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • "Closures close over variables, not over values".  It's not a matter of variables vs. values, it's a matter of scope.  Placing this text in bold will not help.

    IIRC, C# _forbids_ the programmer from modifying this variable.  It's no wonder programmers are surprised by this behavior -- one moment the variable is immutable, and the next moment the language pulls the rug out from beneath you and rebinds the unrebindable variable.

    Perhaps you should put something in bold to that effect.

  • "Second, it makes the foreach syntax lexically inconsistent".  Lexical scoping isn't the concern, but rather it's a matter of dynamic extent -- the same storage cell gets re-used in the next loop iteration, getting reassigned.

    This "inconsistency" the present semantics is trying to avoid already exists elsewhere in the language.  Consider "public void foo(int x) { ... }" -- in this case, int x's lifetime is bound to the scope of the _body_ of the function, and does _not_ get re-used the next function invocation.  If you were striving for consistency, the other proposed behavior has more in common with function argument dynamic extent than the present behavior does.

  • I'm completely in agreement with Jon Skeet here - make the change (the current behaviour almost inevitably isn't what folk want their code to do), but make it in some future version.  For the next release, just emit a warning and indicate that this behaviour is to change.  The scenario that Jon gives is a very real one, and having identical code have different semantics in two adjacent versions of the compiler sounds like a bad idea.

  • I much prefer Java’s rule here: only allow closing over variables which are declared `final`. But then, this only works because Java allows local variables to be declared `final`, which C# doesn’t.

    That said, I am all *for* changing this. You’ve already given the rationale: The current behaviour creates a lot of bugs. And from a programmer’s point of view, the loop variable is actually inside immutable inside the loop, and it’s only valid inside the loop. It is extremely unintuitive that this behaviour is not really backed by the language. Consistency and logic would dictate that the loop variable is indeed local to the loop body, immutable, and re-declared/initialized for every round of the loop.

    That this is currently handled differently is – for me – only a historical accident.

    To address your objections to making this change:

    * First, I don’t think that’s an issue. Just *assuming* that it might be is a wild guess. Yes, it’s erring on the safe side but it’s not a basis for an informed decision.

    * Second, I disagree that this is inconsistent. As I’ve said above, I’ve always considered the loop variable to pertain to the loop body. Yes, there are two parts in the head, one which belongs to the loop body and one which doesn’t. But I honestly don’t see a problem in this.

    * Third, “certainly would not make this proposed change apply to "for" loops.” – Whyever not? The `+= 1` in your example doesn’t change the value of `i`, it creates a *new* value and binds it to the symbol of `i`. But `i` has value semantics, right? So I expect the lambda to access the current *value* of `i`, not a reference to it.

  • Duh, I’t like to clarify a point which was incorrect from my previous posting:

    Of course the lambda shouldn’t close over the *value* if a loop variable `i` because that’s precisely what lambdas don’t do. Rather, my point should have been that classical `for` loops are just a big problem when it comes to lambdas and their inevitable functional mindset. A variable which is “recycled” in this manner just doesn’t sit well with functional programming. It would be much cleaner if the conventional `for` loop actually *did* create a new *variable* and initialize it with the value of the old one – plus one.

    So, all things considered, the behaviour of the classical `for` loop probably shouldn’t be changed after all. Much better would be to deprecate the old-style `for` loop (which of course won’t happen) – Python demonstrates that one can live very well without one, even in an imperative language (`foreach (var i in Enumerable.Range(1, 10)` …).

  • My vote would be to make the change and to break consistency between for and foreach in this matter.

    In any case, I don't think for is really all that similar; for iterates "over" an indexer, and logically increments it all the time; whereas foreach hides the indexer (i.e., the enumerator) and shows you only the indexed _value_.

    In short, the index in a typicial for loop is not at all the same as the value in a typical foreach loop.

    Another reason to make the change is that the "correct" version currently looks like duplicated code - and correct code should look correct if at all possible.

  • Issue a warning in C#4, and change it in C#5.

  • ++ to compiler warning. I understand the logic fully but I've still accidentally done it myself.

  • Please don't change it! This forms the basis of an interview question I give to developers. If you change it I'd have to come up with a new question and I'm very lazy...

    Do you find that this is a good interview question? It seems like an odd corner case; I wouldn't expect most developers to know about this off the top of their heads, though I would expect them to be able to figure out what was going on eventually. -- Eric

    A warning along the lines of ReSharper's "access to modified closure" would be good though.

  • What about nonbreaking change:

    foreach (new var iten in sequence) { /* ... */ }

    ---------

    Excerpt from MSDN "from clause (C# Reference)"

    ... The range variable is like an iteration variable in a foreach statement except for one very important difference:

    a range variable never actually stores data from the source. It just a syntactic convenience that enables the query

    to describe what will occur when the query is executed. ...

    Therefore closures close not over range variables but over items in the sequence.

  • I would avoid a breaking change like this, because the behavior change would be so subtle that developers will spend days before understanding why their code suddenly stopped to work.

    Think about this, you have 3 types of developers who are using this scenario:

    1. I wrote my code, I rely on the fact that var v is always the same, I am happy

    2. I wrote my code, it works, by pure luck, because var v is always the same, I am happy

    3. I wrote my code, it doesn't work because var v is not new, I research/fix my code, I am happy (could be happier if foreach worked like I want, but still...)

    The breaking change would make developer 2 really, really, really unhappy :-)

    Have you considered adding a new syntax? Something like this, that would be translated in the alternative code with the declaration inside the loop:

    foreachnew(var v in M())

    foreach(new v in M())

    foreach(new var v in M())

  • I'm with Jon Skeet on this one. It is the way it is regardless of how it should be - changing it now would incur bigger problems. Perhaps instead, slightly new syntax could be used as in:

    foreach (var v from M())
    {
    }

    Note the "from" instead of "in". It's a subtle change but maybe that allows the old and the new to sit side-by-side.

  • I only disagree with the second statement about the lexical inconsistency that the change would cause in the “foreach” command. The terms of the foreach command are stronger that the “left to right precedence” concept.

    I agree with some of the comments above that says that the "each" and the "in" terms in the command implies that the scope of the variable should occur within the enumeration loop.

    The code below makes more sense if you consider that each m is in the enumeration:

         while(e.MoveNext())

         {

           int m; // INSIDE

           m = (int)(int)e.Current;

           …

         }

    It’s not a single variable. It’s several, each one occurring inside the loop.

    So I’m in favor of the change.

    For those who need to maintain the previous behavior I propose a new command:

    iterate({variable} through {collection})

    that command would make clear that a single variable would iterate through the whole collection receiving it’s value.

  • Delegates Are Easy, Closure Requires Craftsmanship.

    If this is true, and I think it is, a compiler or tool warning leaves full control in the hands of your customers. I completely sympathize with the reality that the overwhelming bugged bug report is based on misunderstanding of closure. Warnings seem to be the best you can do in this situation.

    Perhaps mentioned earlier (no time to read every single comment), this "breaking change" would have to be propagated to other "foreach" iterators, no? For example rewriting the code above using List<int>.ForEach would have to work the new way as well.

    I have no fear that you guys would introduce the change properly - and I would learn the special case - but it seems like a lesser option than the warning approach.

  • this should be fixed but just cannot be. i guess 1000s of people depend on this without knowing.

Page 4 of 9 (134 items) «23456»