Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • I have to agree with Pavel on all his responses, especially the one about the plain meaning of foreach.  It would be fantistic and best if foreach actually meant "for each item in the collection."  I thought Eric's second reason was particularly weak as the foreach statement already has the problem of the right side being evaluated before the left as pointed out in the last post about "foreach (var m in from m in data orderby m select m)".

    That being said, I believe Eric's reason number 1 "breaking change" out ways everything else.  The train has already left the stop, it's too late to get on board.

  • If the lifetime of the range variable in foreach was to be changed, then it would not only conflict with a simple "for" it would conflict with the meaning for foreach in other environments.

    IF consistancy was to be the prime driver [I do NOT believe it should be] then LINQ as the new-comer to the block would be the one to change [I DONT think it should be changed]

    Consider code:

       for (var x=C.First(); x<=C.Last(); x=C.Next) {}

    and

       foreach (var x in C) {}

    If these behaved differently many more [IMHO] people would be confused.

    ps: I DO expect people to study and learn a language over a period of time [months to years], I am not a believer in the idea of picking up a programming language in "21 days" [even though I have been the Technical Editor of such books <grin>]. In general I would expect a person to have about 3 months [ about 225 hours] of study/usage of C# before they ever see LINQ or Lambdas....

  • It seems to me that the responses that run "this is obvious to anyone who understands lambdas and closures" have missed the point. Anyone who understands lambdas and closures can state the behavior of the "desweetened" (soured?) code at a glance, but few people correctly predict this code. I think that Pavel Minaev has divined the reason correctly: the "each" in foreach makes a stronger impression than the "for", and to most people it suggests a per-instance lifetime.

    I am tempted to remind you about that hobgoblin, a foolish consistency, but perhaps you simply can't bear to change the lifetime of foreach variables. I also understand that you must avoid head-to-head competition with partners like Resharper. But perhaps, in cases where the specification defines one syntactic construct as a sweetener for another, it would be permissible to allow Visual Studio to mark such sugar and display the true structure on demand.

  • i don't believe it really does matter,

    i think in both of situations one should  (i mean not intuitively, but  by knowledge) know what he is using, and how it behaves.

    but...

    In the 'change' case you'll have allowed someone (most people by most of your sayings) code and it works fine, but they don't know what is under the hood...  and that's what i'll never appreciate.

    if you say how 'foreach' reads naturally,  (i dont want to be arguing readibility) than i remind its a formal language .

    if you stick on with "i suppose this work by this way"ers to change this,,, uueh , that s another thing.

  • IMHO the current behavior is fine, although a bit confusing the first time you realize it. However a compiler warning would be nice...

  • The foreach variable and the range variable in a from clause are conceptually nearly identical, so I think they should be as similar in behaviour as possible. Declaring the foreach variable inside the generated while loop would be a great step in that direction.

  • Have you considered the possibility that the people who submitted these bug reports:

    A) don't understand lambda expressions

    instead of

    B) don't understand foreach

    No, you're right. Usually its a failure to understand that lambdas close over variables, not values. But who cares? The point is that people are writing broken code because the language semantics do not match people's intuition in some way. My goal is not to write a language which punishes people for failing to understand the nuances of closure semantics; my goal is to write a language that naturally encourages writing correct code, even when the developer has a slightly incorrect mental model. Believe me, plenty of developers have incorrect mental models of how C# works in almost every respect. A language that can *only* be correctly used by experts isn't what we're after here. -- Eric

  • > Have you considered the possibility that the people who submitted these bug reports:

    > A) don't understand lambda expressions

    > instead of

    > B) don't understand foreach

    If people were simply misunderstanding lambdas, we'd see a similar amount of confusion when capturing for-loop counters, but it doesn't seem to be the case (at least judging by the number of SO questions). In fact, I've yet to see anyone being confused about for-loops in a similar case; it's invariably foreach only.

  • Pavel - I believe that is because most people who are using for-loop are not using lambdas and that this accounts for the skew.

    If you don't mind please contact me privately to discuss further...

    My contact info is in my MVP Profiile...

    https://mvp.support.microsoft.com/profile=9CAAC837-1258-4B5F-B6D0-7D9CD800EBBB

  • I'm strongly in favor of making this change. The main reason is not so much that the current behavior is unintuitive - which it is, even when you understand the reasons for it - it's that it's unUSEFUL.

    If there were two possible things you might *want* to do - even if one of them was more common than the other - the argument against breaking changes would clearly be a reason not to ever consider making the change, because you never want to break working code.

    But as it stands, the current behavior, while "consistent" in some senses (and I really don't buy most of the consistency arguments; the variable is readonly so it's inconsistent that it changes, too), is useful for nothing at all. C# had the good sense to not keep the C behavior where loop variables are still in scope after the loop is finished and holding their last value around - because it's confusing and not useful and leads to bugs. This is similar.

    I suggest that if you aren't going to actually change the scope of the loop variable in this case, I'd go one stronger than a warning and make attempts to close over a foreach loop variable actually an error. Someone who wants the weird current behavior for some reason can still get it:

    int sharedValue;

    foreach (var value in values) {

     sharedValue = value;

     doSomething(() => sharedValue);

    }

    Basically, making it an error *forces* the programmer to disambiguate. Making it work the "expected" way would be preferable, as far as I'm concerned, but if there's too much resistance to that, at least require the programmer to spell out what they want if they want something so deeply counterintuitive.

  • I would have to agree with not changing the current behavior.

    The current behavior also matches what happens when you attempt the same thing in javascript:

    var a = [1, 2,3,4]

    var funcs = []

    for (var i in a) {

        funcs.push(function() { return i; });

    }

    for (var f in funcs) {

      console.log( f() );

    }

    will print out "4 4 4 4".

    Its helpful to think as lambdas and closures working on a semantic scope;  A method is a natural definition for semantic scope (I'm just kind of winging that definition; I am not a compiler guy), so in JS the preferred solution is to define another lambda that closes around what you need like this:

    for (var i in a) {

        (function(x) {

             funcs.push(function() { return x; });

        })(i);

    }

    --OR--

    for (var i in a) {

       funcs.push(function(x) {

          return function() { return x; };

      }(i));

    }

    Fiirst class functions are a beautiful thing. :-)

  • > The current behavior also matches what happens when you attempt the same thing in javascript ... a method is a natural definition for semantic scope

    Not in any C-derived language with the notable exception of JavaScript. Natural definition of scope was always block - {} - and never the entire function. In fact, it seems to be one of the most hated things about JS:

       var x = 1;

       if (...) {

         var x = 2; // no error

       }

       // x is now 2

    It's bad enough that they've added "let(x=1) {...}" specifically to support better scoping in JS1.7 (https://developer.mozilla.org/en/New_in_javascript_1.7#let_statement).

  • Avoid writing code like this at all, regardless of how C# handles it. But if you must write code like this to impress your friends, C# appears to be handling things correctly as is.

  • Parallel.ForEach in c# 4.0 behaves like the proposed change to foreach, not like the current implementation.  It also somewhat breaks the scoping for the loop variable.

    In fact, implementing my own Parallel.ForEach (because I'm impatient) was the first time I ran across this unexpected behavior.  The behavior so surprised my coworker that he bet me that wasn't it.  This seems like an argument for changing the behavior, however it's also the kind of thing that if you've seen it once, you'll learn your lesson and not repeat the mistake - and get a better understanding of closure semantics in the process - so I'd favor a warning over any kind of breaking behavior change.

  • And *I* think we should paint the bike shed cornflower blue!

Page 2 of 9 (134 items) 12345»