Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • If you have a good understanding of lambdas and closures, this code does exactly what you would expect it to do (nothing like your foreach (var m in from m in data orderby m select m) from a few days ago).

    It seems like changing it would do far more harm than good.  I think you adequately summed up the arguments in favor of not changing it.

    In particular, if you use re-sharper (and pay attention to it) you'll catch this right away.  Perhaps you should just buy JetBrains and integrate their product?  :P

  • Please do NOT change this, it is 100% correct. If people do not understand "Closures close over variables, not over values" then it is a matter of training [I actually use a similar sample in my assessment screening for midlevel and senior developers]

    Remember the imortal words of Scott Meyers from nearly 20 years ago (has it really been that long?).."Do not Mollycoddle your users (programmers in this case)"

  • You should definitely do something about this.  I see it on StackOverflow all the time.

    I like the "Add a compiler warning" idea. I like it a lot, because it side-steps the breaking change and still throws something in front of the user to help avoid the mistake.

    As for the "lexical" argument, I think the stronger argument is that the declaration of the variable occurs outside of the { } block.  Consider:

    " { int x; } " versus " int x; { } ".  This code feels much more like the latter.  Since you declare the variable outside of the scope block, obviously it should be the same variable.  Changing it to something else would be deeply weird.

  • Initially I was all for making this change. I haven't ever seen a legitimate use of closing over the iteration variable (except as a demonstration of why you shouldn't). However, I hadn't thought about the consistency issue with for loops. Considering that, I would argue that consistency wins. Reducing the cognitive load of understanding a language is highly valuable, and having these constructs behave differently in this regard seems like a recipe for confusion.

    Definitely do make a warning for this though. Surely in upwards of 99% of cases closing over the iteration variable is not going to produce the result the programmer expects.

  • Please do not change this.  The behavior is consistant with the rest of the language.  I can't think of another case where declaring a variable actually declares a bunch of variables that appear to point to the same data but don't.

  • In my opinion, this should at worst be a compiler warning.  The current behavior fits exactly with the delayed execution I would expect from a lambda closure.  For example:

    void Main()
    {
       int i = 0;
       Action foo = () => Console.WriteLine(i);
       i = 6;
       Console.WriteLine(i); // 6, as expected
       foo(); // also 6, not 0
    }

    The foreach is currently behaving consistently with this and this is the behavior I would expect.  If this change was made to foreach, consistency would suggest to me that this program print out 6 and then 0.  That change would mean that lambdas closed over values, not variables which would be a tremendous breaking change.

    I take your point. But let's be clear on this. The proposal is not to make lambdas close over values. The proposal is to make the loop variable of a foreach loop into one variable per iteration, not one variable shared between all iterations. Since a foreach loop variable is logically immutable anyway, it's already a very weak sort of "variable". The question is not whether lambdas close over variables; clearly they do. The question is about how many variables a foreach loop generates, one per loop, or one per loop iteration. -- Eric

  • I'd say change. The strongest argument I can think of is that it is inconsistent with seemingly equivalent LINQ:

      var funcs = new List<Func<int>>();

      funcs.AddRange(from i in Enumerable.Range(0, 10) select(Func<int>)(() => i));

      funcs.ForEach(f => Console.WriteLine(f()));

    This prints 0 to 9, as expected. It's also a counter-argument to this claim:

    > So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

    Because you already do just that in at least one place in the language.

    The other strong argument is that the existing behavior is deeply unintuitive. Vast majority of C# developers get it wrong the first time they use lambda inside a foreach (and many get it wrong repeatedly even after they already know how it actually works). This is a clear sign that it's not a good approach.

    Finally, the existing behavior is inconsistent with regard to pseudo-readonly status of the range variable. For normal readonly fields, I can expect that not only I cannot write to them, but noone else can, either (except during initialization). Foreach range variable is kinda weird - it's readonly to my code, but there's some other magic code there that can change it. IMO, a much more straightforward definition is as a true readonly local, which it can be if it's "auto-declared" inside the loop body.

    > Please do NOT change this, it is 100% correct. If people do not understand "Closures close over variables, not over values" then it is a matter of training

    People do understand "close over variables" just fine. They don't understand the lifetime of the range variable.

    > I can't think of another case where declaring a variable actually declares a bunch of variables that appear to point to the same data but don't.

    Uh, how about any time you explicitly declare a local inside the body of the loop?

  • >Uh, how about any time you explicitly declare a local inside the body of the loop?

    If it's in the body of the loop, you expect that line of code to be executed multiple times, therefore you expect multiple allocations.  If it's not inside the body of the loop, you expect that line to get hit once

  • I agree with Mike Van Til, if you understand how it all works, it only makes sense that it should work that way. I ran into this problem about a month ago, and once I discovered the source of the bug, I immediately realized what was going on and thought it made perfect sense. Although I use resharper with visual studio, at the time I was using monodevelop. If anything is done about this at all, I like the idea of a compiler warning.

  • There's one more argument that I've missed. It is perhaps weak, but nonetheless...

    The problem is with wording. C# foreach statement reads rather naturally:

      foreach (var item in items) ...

    That's literally "for each item in items, do ...". The "each" word here strongly implies that we're taking each _separate_ item in turn, and doing something to it.

    In contrast, with a plain for, the fact that variable is shared is also quite explicit when you "say it aloud":

      for (int i = 0; i < 10; ++i) ...

    That's "for i, initially assigned to 0, incremented by 1, until it reaches 10, do ...". It's very explicit and obvious here that there's a single i for the whole loop, and it gets mutated on each iterator. Not so with "for each".

    For people arguing strongly against the change (and to remind, the change is _not_ "make lambda capture the value"; the change is to the scope of the range variable), I'm genuinely curious: is there any particular reason apart from backwards compatibility? Do you actually ever _relied_ on the fact that there's only one range variable in foreach, and it changes? Can you think of any scenario in which you'd prefer the existing behavior with respect to foreach and lambdas to be the case, because it's more convenient? If not, then what difference would it make to you either way?

  • @Eric,

    I think I poorly phrased a portion of my response.  That is what I get for hurriedly writing responses while at work! Anyway, I really intended to convey that if the foreach behaved as proposed, it would "feel" as if the foreach were closing over the values instead of the variables, which could lead one to expect similar behavior out of the sample program.  I certainly did not intend to suggest that you were proposing that lambdas would close over values.

  • @Pavel - [IMHO] programmers should realize that:

         foreach(var x in y)

    should be "read" as:

         for(var x assign each element from y)

  • > If it's in the body of the loop, you expect that line of code to be executed multiple times, therefore you expect multiple allocations.  If it's not inside the body of the loop, you expect that line to get hit once

    Well, conditions inside while and for loops are not inside the body of the loop, either:

      while (GetFoo() != 0) ...

      for (int i = 0; i < 10; ++i)

    but you expect "that line" - specifically, the call to GetFoo(), and ++i - to be hit repeatedly nonetheless.

  • > "programmers should realize that"

    Why _should_ they?

    I understand what you probably mean - that someone coding in C# should know this particular thing about the language. But we aren't talking about knowing what's there, but rather about the "perfect" state of affairs, and then I think that it's best to stick to most intuitive interpretation of the keyword. It doesn't say "assign" anywhere, or "=", or anything else to imply assignment.

    In any case, the strongest argument is still the existing mismatch in lifetime of range variable between loop "foreach (var x in xs)" and LINQ "from x in xs". Everything else is just the lack of icing on the cake that isn't there.

  • In Javascript, I generally resort to a 'double closure' - this is particularly needed in JS as nested functions are the only way to create a new scope block.

    funcs.Add((function(v2){return function(){return v2};})(v));

    translated to mostly equivalent [except i'd have to know the type of v, using int here] C#

    funcs.Add(((Func<int,Func<int>>)((v2) => () => v2))(v));

Page 1 of 9 (134 items) 12345»