Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • Here's my broad suggestion:

    Leave the existing syntax as is, but add a warning in the same way VB does (yes, even though the lambda doesn't always outlive the iteration). Introduce a new syntax which unifies foreach and LINQ:

       foreach (from x in xs) { ... }

    The expression inside () can be any LINQ comprehension that is allowed today, but without the final "select". All range variables that would be visible in "select" are visible in the body of foreach, and bound for every iteration (consistent with plain LINQ). So, for example, in:

       foreach (from x in xs group x by x%3 into xg where xg.Count() > 1) { ... }

    the body of foreach would be able to reference xg, but not x.

    This has the added beneficial effect of simplifying the existing pattern of iterating directly over simple queries. You'd think you can at least write:

      foreach (var x in from x in xs where x > 0 select x) { ... }

    but this won't work because two "x" will clash, so you need to come up with "x1" or other similarly descriptive name for one of them.

    Getting back to the original problem, it's also why I also propose to make a warning for lambda capturing a variable in old-style foreach(var) loop - any existing code that does it for legitimate reasons (e.g. because lambda doesn't escape) can be trivially rewritten to use foreach(from).

    Thoughts?

  • Yes, please change it. This problem caught me out. The 'each' in the keyword 'foreach' led me to believe I'd get a new variable for each element in the collection. It turns out that the name of that keyword is currently a lie.

    When I fixed the bug this problem caused I felt I had to put lots of comments next to the code to prevent any of my colleagues from trying to helpfully clean up the code by removing the unnecessary looking variable I had to declare to fix the problem.

    The only reason not to change is if a significant number of people can cite existing code bases that would break if the change was made. If people do have code that would break, then we should at least have a warning.

    Keep up the good work.

    Regards,

      Scott

  • I would like a syntax that forces the evaluation of a variable at the time of the lamda's declaration.  It would save the necessity of creating dummy variables (as in your example) and have the secondary effect of allowing the compiler to produce better code.

  • I agree with the calls to change it.

    Some small percentage of users will have written code that relies upon the existing behaviour. Some percentage of this group will be unaware of the change and they will see code that does not execute as expected. A lot of developers don't understand the existing behaviour of foreach and so write code that does not execute as expected today.

    Most people (not all) agree that from the reading of foreach, the current behaviour is counter-intuitive. People having trouble today are being caught by the intuitive way being incorrect. People who would have trouble with the change are people who have adapted to unintuitive behaviour. Following Rico Mariani's "Pit of Success" concept, making the change would lead to the "right" way being the obvious way, which is certainly not true today.

    The number of people suffering from breaking changes will diminish over time and most of the pain will be dealt with by the following (hypothetical) version of C#. If the change is not made, the current pain is not likely to get much better over time. Each successive release of C# will carry the same burden.

    Problems like this collect as baggage in the language. I think C# is really evolving well, but the evolution can't just be about picking up new great things. Taking the pain to change the bad things early keeps the language ... sharp? Oh god that was terrible :P

  • The bugs caused by this common misunderstanding of closures are very easy to catch. I do agree with most of Pavel Minaev's points, but I just don't care strongly whether the foreach expansion is changed. I do feel strongly that something should be done to remedy the situation. For me, the biggest win that closures provide in C# is making my code clearer by eliminating noise like the v2 local var. If you do not change the foreach expansion, please add the warning and add a mechanism to avoid the inner variable declaration. It could be something like this:

    foreach(var v in values)
    {
       funcs.Add( ()=>valueof(v) );
    }

    In this case you could use:

       values.ForEach( v => funcs.Add( () => v ) );

    Why isn't .ForEach() part of IEnumerable?

     

  • I want to emphasize: it should be an error rather than a warning. Because for someone who made this error mistakenly, it's an error. And someone who wrote that code deliberately (for whatever unimaginable reason) better rewrite it as

       int item_global;

       foreach (int item in items) {

           item_global = item;

           /* do whatever you want now */

       }

    Really, making it a warning makes no sense at all. Even people porting legacy code are better served with an error.

  • Change it so new developers don't get bitten. Or at least start giving warnings like VB seems to do according to Pavel Minaev [MSFT]

  • @TheCPUWizard -  In C++ the for loop variable did exist past the end of the block in most implementations up until just before the original C++ standard was approved in 1998 - several months after VC6 was released.  VC7.1 introduced the standard-compliant behavior as an option, with VC8 changing to the standard-compliant behavior as the default and the pre-standard behavior as the option.

  • For the record, after reading many of these reponses, I think the right thing to do would be to issue a warning in the nearest release possible and change the behavior in the release after that.  

    The current behavior, while consistent (and intuitive if you understand enough of the details) is UNuseful - it's likely that any code today that's relying on the current behavior is already broken.  

  • Perhaps *because* I come from a FP background, the behavior of the foreach variable surprised me. I have made this mistake several times! You never see the variable being mutated explicitly, and indeed you're not allowed to change it, so you don't realize you're capturing the variable. I think the behavior should be changed so that the variable was bound freshly for each loop iteration.

  • These weird semantics are the first thing I noticed when I first saw code samples for the then-new closures in C# 3.0.  They go against all of the functional programming semantics that I've ever experienced (OCaml, Haskell, F#, Scheme, ...) and I find it very unnatural.  Closing on a variable reference is a state-heavy concept in the first place, it's contrary to the aspirations of functional programming, and closures are meant to fulfill these aspirations.

    I think it's still time to change.

  • @Pavel - I love your foreach(from x in xs) solution for LINQ syntax, but not really for regular foreach syntax.  It just seems too much of a subtle difference in syntax that does not make the difference obvious.  When there are subtle differences in meaning that has caused a lot of problems in the past, I think it is best to be as blunt and obvious as possible in syntax.  That is why I suggested the 3 foreach syntaxes:

    foreach (var x in xs) -- error or warning (if people insisted) on closure of x

    foreach (var x inside xs) -- x is inside body and is instantiated once per loop (closure allowed)

    foreach (var x outside xs) -- x is outside body and is instantiated once (closure allowed)

    However, if your foreach(from) syntax was the solution chosen, I would not complain ... it would be much better than the current state.

  • Before reading Frank Bakker's comment about valid examples I thought a warning would be nice. But closing over the loop variable inside the loop is quite common, so either make the change or don't do anything (don't add a warning).

  • Pavel, thanks for the VB note.  I was wondering if anyone was going to say that the competing compiler gives this warning already.   I get it each time I do it, and fix it.   I'd hate to think how many of those might have turned into bugs... I didn't realize c# doesnt remove the burden of omniscience like vb did.

    ** I say CHANGE IT.  And change it in VB too, please!  I hear there is some sorta cooperation on things like this now :)

    It always surprises me, and everyone else, that it works the way it does.  It reads as pavel says, and .   If I wanted torture I'd go back to C++ :)  or C# can just become as hard and we can call it C##!

    However, if the backwards compatibility issues are real and not imagined, then it's up to you guys.  We can always code in the workaround variable.

    I don't think that anyone who isn't trying to say some variation of how cool/smart/experienced/rough and tough they are would *expect* it to behave how it does.  Even those who do understand always figured it out after it bit them, never before.  For every weird behavior there is always a chorus of "but you just need to learn the language more!" or my favorite "I'd fire anyone who wrote that code".  

  • Another potentially interesting data point is that Python also mutates range variable (and its lambdas consequently capture it in the same way C# does that):

       fs = []

       for x in range(0, 3):

           fs.append(lambda: x)

       for f in fs:

           print f()

    Output:

       2

       2

       2

    But so far Python is the only other language (apart from C# and VB) in which I can observe this.

    Furthermore, in Python, it is at least consistent with its list comprehensions:

       fs = [(lambda: x) for x in range(0, 3)]

       for f in fs:

           print f()

    Also outputs:

       2

       2

       2

    And Python lets you change range variable in "for" (it will simply be overwritten on the next iteration), and doesn't scope it to the loop body:

       for x in range(0, 3):

           pass

       print x

    Output:

       2

    So at least its behavior, while still (IMO) unintuitive, is at least fully self-consistent: a range variable really is just a repeatedly assigned variable, nothing more. There's nothing magical about it - it's not "semi-readonly", it's not specially scoped, etc.

Page 6 of 9 (134 items) «45678»