Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • I vote for a compiler warning, but I'm very interested in the issues around adding compiler warnings.

  • At first, I thought having a compiler warning would be the best, but after reading the above comments, I too realized that there are way too many common, legal and wanted situations to close over the loop variable. Note that in those situations, the proposed change would not make a (functional) difference (it might have non-functional implications, e.g. performance - but not against the manual fix we need to do now).

    I say make the change.

    The breaking change argument is valid, but has lost it from other arguments before. I admit in some cases breaking changes were allowed to bring the implementation in line with the spec, but that is not the point. The point is that breaking changes were made before.

    I find the "for each" argument very strong, as well as the consistency argument when applied to the read-only nature of the loop variable. I also agree with the argument that the change is unlikely to break really existing code, since very few would (or even should) rely on the existing behavior. Finally, the readability argument is important: code with the manual fix looks redundant and wrong.

    The statement that the compiler correctly implements the specification is true, but not relevant. The bug is in the specification. Fix it, and adjust the compiler to the improved specification.

  • That's why I like resharper. Resharper kindly warns stupid developers like me about access to modified closure and offers refactorings like introducing a variable as your example states.

    Daniel

  • I vote for behaviour change in C#5.

    I vote for a warning in C#4.

    I think its too late to change the behaviour in C#4, but C#4 should definitely give a warning. This warning should be on by default in C#4 and C#5.  The warning could be off by default in C#6 and later.

    In C#5 the behaviour should be changed to the new, expected way.

    For those cases, where the old behaviour is still needed, I like the syntax Joe suggested above:

    SomeType item;

    foreach (item in collection) { .. } // note: no var before item, so the variable outside the loop is used.

    However, I have failed so far to find a situation that needs the old behaviour ...

    In my humble opinion, the outside expansion

    // item var OUTSIDE loop

    try {

     int m;

     while (e.MoveNext ()) {

       m = (int) (int) e.Current;

       funcs.Add (() => m);

     }

    }

    is just bad code, because it conflicts with a rule I consider good:

    Declare/initialize/allcote a variable as late as possible with the smallest scope possible.  

    Proof that this rule is good: Pulling variable m out of the loop body for no apparent reason introduces weird and unexpected behaviour ... ;-P

    Therefore, I think the inside expansion

    // item var INSIDE loop

    try {

     while (e.MoveNext ()) {

       int m = (int) (int) e.Current;

       funcs.Add (() => m);

     }

    }

    is better code.  I think thats the code that the majority of prgrammers would wrote by hand if they had to replace foreach.

    To guesstimate the pain of this breaking change, it might make sense to feed all NET source plus codeplex and then something through a modified c#4 compiler version ..

  • I say to give the warning only where it would make a difference. That is, if you close over a loop variable and that closure escapes the loop, give a warning. In theory the only thing it should break is code that gets compiled with warnings as errors, and has closures over escaping loop variables.

  • > I say to give the warning only where it would make a difference. That is, if you close over a loop variable and that closure escapes the loop, give a warning.

    The problem, as pointed out earlier, is that a closure may not actually escape the loop, and there's no way the compiler can find out. Consider some hypothetical L2S code:

       foreach (var customer in ctx.Customers) {

           var orders = ctx.Orders.Where(o => o.Customer == customer).ToList();

           ...  

       }

    Now we have a lambda that captures "customer". It doesn't really escape here, because the enumerator in which it is stored is immediately discarded, and cannot be referenced outside the loop. But how is the compiler supposed to know that?

    So it would have to give a warning for the above, perfectly reasonable, code, and also the equivalent LINQ comprehension.

  • + for compiler warning.

    Anyone who stumbles across this issue once, and realizes the mistake - won't do it again. It's not worth introducing breaking changes, and making the language inconsistent. The compiler warning should suffice for all matters.

  • I would prefer that the current foreach semantics not be changed, but rather add another use for the 'new' keyword such that -

    foreach (new Object in Collection)...

    - implements the altered semantics of relocating the declaration.

  • +1 for a compiler warning, I got caught by this pitfall once and gave the weirdest program behaviour ever. Once I found the cause I immediately knew the mistake I'd made, but it is such a subtle semantic one it was difficult to find when reading my code. I think a compiler warning for it is a must. It would have saved me a lot of time tracking down the issue.

  • I give a vote (+1) for the warning. I have struggled with this a couple of times but not enough to want you guys to change the way the compiler behaves.

  • Make the breaking change.  Some seem to think a compiler warning is "more acceptable" than a breaking change, but in reality it just gives many of the negatives of breaking changes without any of the benefits of the actual change.  Many companies set "warnings as errors" making it the equivalent of an error, without the benefit of being able to use the corrected behavior.

    This change would be very beneficial in terms of making the language more closely follow the "principle of least astonishment."  Some have said that once they "discover" the behavior, surely it makes perfect sense.  However, I think something cannot possibly make "perfect sense" yet have to be discovered after getting it wrong.  I think "natural" and "intuitive" should take precedence here.

    Finally, as Pavel has mentioned repeatedly, making the change would make the foreach construct more consistent with Linq's "from item in items".

  • I vote for making the change. I mostly code in VB so at least I get a warning but still, it's annoying and counter-intuitive to have to create a local variable for this.

    If you decide to not make the change, at the very least add the warning in C#. If there wasn't a warning for this in VB I would have been bitten by this several times.

  • I'd suggest to make the breaking change too. I didn't got caught at all by this pitfall but that's only thanks to ReSharper. The first time I saw this warning I was like "but what the hell is resharper talking about?". Two minutes of googling later I understood the problem and immediately thought that this behavior was really counter-intuitive. That being said, I always keep in mind the existence of that behavior now that I know it and r# didn't caught me again.

    As many people stated, I too cannot think of any code that would rely on this behavior. So make the breaking change. If you don't, at least issue a warning so that people get used to it and change the behavior in a future version.

    Some have proposed to use "foreach (new ...)" but IMHO, it's just worse. What does the "new" here, it certainly doesn't construct a new instance like everywhere else in the language: it's not intuitive at all. To understand why 'new' is here, you have to understand the problem it solves. If you do, you don't have any use for this keyword since you're already aware of the modified closure problem. If you don't, you'll get caught by the old behavior. The same goes for the "inside/outside" keyword. Just read this sentence: "for each value outside enumeration" and you'll see what I mean: as stated before, "each" prevails in the understanding of "for each" and there shouldn't be any keyword affecting the loop variable.

    However, I'm totally for the "int item; foreach (item in data) {}" construct for those who need it since it's simple, natural and intuitive.

  • m = (int)(int)e.Current;

    Why the double cast?

  • @Brad: according to the definition of foreach in the language spec, there are two types involved: element type (which is normally the type of Current property of the enumerator), and iteration variable type (which is the type you explicitly specify in foreach, or same as element type if you use "var"). The associated line of the expansion of foreach is this:

           v = (V)(T)e.Current;

    So far as I can see, the (T) cast will almost always be a no-op, because it will correspond to type of Current. Seemingly the only case where this isn't so is when you're iterating over an array (the spec requires the use of IEnumerable over IEnumerable<T> in that case, and element type is derived not from Current, but from array type).

    I wonder why it wasn't simply redefined to use IEnumerable<T> for arrays as well (which would remove the need to special-case that, as well as the cast) - so far as I can see, the change wouldn't be observable to the user...

Page 7 of 9 (134 items) «56789