Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • While reading this blog entry, I was reminded of an earlier entry where you described a breaking change that was made in order to introduce generics:

    public static void Method(bool one, bool two) { }

    public static void Main() {

       int a = 0, b = 0, c = 0, d = 0, e = 0;

       Method(a < b, c > (d + e));  // used to compile in C# 1, now compiler error

       Method(a < b, c > (int)d);    // even worse; compiler errors rather hard to understand

    }

    I was surprised that this kind of breaking change would have been made; it was easy for me to imagine that some amount of code might exist that uses this construct and would now fail to compile.

    With the foreach loop, the danger of the breaking change is greater because the change is not caught by a compiler error; however, my intuition is that the frequency of incidence is vastly lower than in the above case. I can see the above construct used intentionally in meaningful code, but I cannot see the current behaviour of the foreach loop used intentionally at all. More likely it was used accidentally in the belief that the foreach variable is scoped inside the loop, and the bug never noticed; hence, this change would be more of a fixing change than a breaking change.

    I agree that the variable scoping would become inconsistent with that in 'for' loops; however, I find that argument rather weak because it is not the kind of inconsistency that would trip anyone up. You can declare a variable in the head of a 'for' loop but not in a 'while' loop; this is strictly an "inconsistency" but no-one really bothers because it doesn't cause people to write subtle bugs.

    The lexical inconsistency I feel is the weakest argument, because it is inconsistent _either_ way, and so is the 'for' loop. Consider:

    for (var i = 0; i < 10; i++)

     DoSomething(i);

    for (var i = 0; i < 10; i++)

     DoSomethingElse(i);

    If the variable 'i' was scoped outside the 'for' loop, then you shouldn't be able to declare it twice. But you can, and they are separate variables. And yet it is not scoped inside the loop. It seems to be scoped somewhere in the middle; it has a scope all to itself. The same is true of 'foreach' at the moment, by being scoped inside the 'try' but outside the 'while'. To fully appreciate this, the programmer needs to realise the existence of that 'try' and 'while'. This is bad enough in the case of 'foreach' but at least you can find it in the specification and lots of blog entries and websites; but in the case of 'for', it is really rather mysterious.

    My conclusion, therefore, is that the arguments against are weak, and the benefits reaped are large, given the apparent notoriety of this gotcha and the desirable goal of a "pit of success" language.

  • According to the article, int acts as a reference type, not a variable type. Where do I make a mistake?

    I do not understand the question. Ints are of value type, not of reference type. I do not know what "variable type" means. Can you rephrase the question to explain what it is that you do not understand? -- Eric

  • According to the article, int acts as a reference type, not a variable type. Where do I make a mistake?

    [Rephrased]: In C# all instances of all classes are of reference types. All instances of structs are simple "variable" types. So if two references are pointing to one object and you will make some changes in this object via first reference, changes will also be reflected if you examine this object via second reference. This is of course not true for simple types, like int. Sorry for not being clear at the beginning.

    You seem to be confusing variables with value types. There's no such thing as a "variable type". A variable HAS a type, which can be a reference type or a value type. Either way, a variable is a storage location which can store a value or a reference, depending on its type. Closures close over variables, not values. You can of course change the value stored in a variable; that's what "variable" means: that the value can vary. -- Eric

  • C# Reference clearly states that each element in the loop should be treated as a separate entity:

    "The foreach statement repeats a group of embedded statements for each element in an array or an object collection. The foreach statement is used to iterate through the collection to get the desired information, but should not be used to change the contents of the collection to avoid unpredictable side effects."

    As was repeatedly stated above: The syntax "foreach(var x in Y)" basically conveys that you pick out objects "in" the bucket Y and name each object x - not the you place those Y objects in a separate container x.

    Current implementation is not conforming to the above and so should be fixed.

  • Issue a warning in C#4, and change it in C#5.

  • @Konrad: "I much prefer Java’s rule here: only allow closing over variables which are declared `final`. But then, this only works because Java allows local variables to be declared `final`, which C# doesn’t."

    The reason Java supports final variables is exactly to fix this problem.  The designer of inner classes was a Lisp refugee who had been bitten by the DOTIMES issue (see Peter Seibel above), and didn't want to allow the problem into Java.

    This particular barn door is hard to close after the mutable iteration variable has galloped away.

  • I vote for making the change.

    I think adding a warning or an error is the worst possible thing to do - I hate it when the message I get from a system is "you did this wrong, we know what you did wrong, we would rather tell you then do the right thing for you." - that's not the kind of system I enjoy using, be it programming language or business application.

    This one may go even further - "we tricked you into doing the wrong thing, we know what you did was wrong, but we'd rather tell you about it than do the right thing for you."

    All kinds of annoyance there. The VB error message is exactly that. See Alan Cooper for how error messages should be (mainly, don't have them).

  • Python allows both ways:

    >>> f = []

    >>> for i in range(10): f.append(lambda : i)

    >>> f[0]()

    9

    On the other hand:

    >>> f = []

    >>> for i in range(10): f.append(lambda v = i: v)

    >>> f[0]()

    0

    I've called this "v" for clarity. Of course, this also works:

    >>> f = []

    >>> for i in range(10): f.append(lambda i = i: i)

    >>> f[0]()

    0

  • I say change it since few people used it knowingly.

    Other languages should follow suit too in the same VS release.

    The warning is too often overlooked by many peoples.

  • Thanks for the clear explanation!

  • Something has got to change.  Otherwise, the StackOverflow developers are going to have to create a new site called closeoverloopvariable.stackexchange.com dedicated specifically to this problem.  It has to be one of the most asked questions on SO.  I bet there are literally thousands of them lurking there already.

  • Yes, please, Eric, consider changing the scope of the foreach variable so that it isn't shared between iterations.

    C# is generally fantastic at doing just the right thing, without introducing ambiguity and without the developers even fully understanding just what really happens behind the scenes. The foreach variable capturing is an unfortunate example of where this doesn't happen.

  • I just saw it claimed elsewhere that this change would be made in C# 5, and I came here to see if it were true.  It is!  Hooray!  I shall bid a not-so-fond farewell to the bugs this used to lead to.

  • How do you solve this?

    //---

    var actions = new List<Action>();

    for ( int counter = 0; counter < 10; ++counter )

    {

       actions.Add ( () => Console.WriteLine(counter) );

    }

    foreach (var action in actions)

    {

       action();

    }

    //---

    The range-based-for (foreach) sugar is wrong, but you need a capture-by-value lambdas.

    Especially in a language that tries to support value semantics.

Page 9 of 9 (134 items) «56789