Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • Nice analysis, this is a classic trap to fall into.

  • The way you described how the foreach variable should be scoped inside the foreach loop, seems the right way to be.

    But,

    I'm afraid this has allot of potential to introduce multi targeting bugs, code written with the new foreach rules will not work correctly when re targeted to older compilers (C# 2.0, C# 3.0). There are allot of libraries (we have 3 internal libraries for Desktop and CF version of .Net and 1 for Desktop and Silverlight) that are multi targeted for various versions of .Net framework (Desktop, Compact Framework or Silverlight).

    If this change is put in the new compiler there will be a new multi targeting gotcha, instead of a lambda over foreach variable gotcha.

    I don't think there is a real use for the way closures work right now with foreach, and that there is someone who depends on the feature, but I don't think that's the real issue, I think the real issue is multi targeting.

  • While we're on about counter intuative things with the foreach loop how about the fact that

               List<Animal> AnimalList = new List<Animal>();
               AnimalList.Add(new Dog());
               AnimalList.Add(new Cat());
               foreach (Dog d in AnimalList)
                   d.Bark();

    Produces an InvalidCastException instead of barking the dog but not the cat?  After all, I said for each dog in the animal list.  If I wanted to try to bark every animal I would've written

               List<Animal> AnimalList = new List<Animal>();
               AnimalList.Add(new Dog());
               AnimalList.Add(new Cat());
               foreach (Animal a in AnimalList)

    The foreach loop was designed for a language without generics. In the C# 1 world, most of the time you're iterating over a sequence of objects that all happen to be of the same type, but all you know about them at compile time is that they're "object". Now, suppose you are iterating over what you think is a sequence of dogs, but a cat got in there by accident. Remember, we don't know that it's a list of animals, we know that its a sequence of objects and that's all. Is the right thing to do to ignore the error and keep on trucking, or to inform the developer that their assumptions are wrong, and this in fact is not a list of dogs? C# is not a "muddle on through and hope for the best" language. Rather, a design point of C# is that being brittle is better than being lax. A lax compiler hides your bugs, and we do not want to hide your bugs, we want to bring them to your attention.

    If what you want to do is iterate through only the members of a particular sequence that are of a particular type, we have an extension method for that: 

    foreach (Dog d in AnimalList.OfType<Dog>())

    -- Eric

  • Yeah, you've got me there.  The two situations aren't really analogous at all.  I bow to your superior reasoning.  Although AnimalList.OfType<Dog> still requires extra knowledge about the problem, just like the proposed foreach(new dog d in AnimalList) syntax, I'd say that gets outweighed by the risk of compiling code that could be silently broken.

  • @P.Baughman - one way to avoid your problem is to always use var for the loop variable in foreach. This has the effect of neatly disabling the casting behaviour; then as Eric says you can also use OfType to filter the list.

  • @Pavel

    Thank you! That clears it up :)

  • My vote: change this!

    I agree mostly with Stuart Ballard and Pavel Minaev.

  • I say: change it!

    Like you said, nearly nobody is (or should be) depending on the current behavior, and it is counter intuitive.  

    About consistency, I think a control structure is special anyway, so this perceived inconsistency is not important. Consistency is there to make a language easier to understand, not for it's own sake. The fact that we are writing foreach(var foo in foos) instead of foreach(foos as var foo) is basically a historical accident, with no good reason for one or the other except for what people are used to. [php uses the second variant]

    The standard for loop isn't very consistent in this respect either: in for(int i=0; i<10; i++) the first clause is evaluated once, while the second and third clauses are both evaluated every loop. Sure, they are separated by a semicolon instead of a keyword, but they are still within brackets in the same line in a single control structure. And if the first semicolon means that expressions are evaluated at different times, why doesn't the second mean the same? There is no real consistency to start with.

    So do whatever you can to fix language warts if at all possible. The longer you wait, the harder it becomes to fix.

  • I think is better a warning because that's the purpose of a waring, it isn't?. Also it's important for backward compatibility.

  • I see a lot of comments here about backward compatibility, and that surprises me in this case.

    The set of cases that actually change is minuscule - in fact, I doubt the current behavior is used very often at all right now.  How often do people really make closures including the loop variable inside a foreach loop and then actually use that closure after the current iteration?

    That's got to be a really weird corner case.

    After all, even with this proposal, the closures that are immediately used (say, as part of an immediately executed linq query) would behave identically.

    I mean, when you close over a loop variable, you must be inside the loop, and you thus must be doing so repeatedly as part of the loop.  The whole point of a loop is to execute code for various values - but constructing various (identical) closures over a single variable is inherently an odd  thing to do.

    Does *anybody* actually rely on the current behaviour?  Backwards compatibility is only a reasonable argument if there actually exists code to retain compatibility with.

  • Eamon: It's  unlikely that anybody knowingly relies on the current behavior, but it's quite likely that many people unknowningly rely on the behavior. You can easily write code to rely on any obscure behavior, and never realize it until the behavior changes.

  • I vote for the change.

    I don't think than the behavior of the foreach should match the behavior of the for because it is a higher-level language construct. It feels more natural to me that foreach(var x in y) is implemented as:

    for(int i = 0; i < y.Count; i++)

    {

     const var x = y[i];

     // your foreach body goes here

    }

    (actually it is the way I implement it once as a macro in C code).

    And I think this even can enable more agressive optimization of simple foreach body: use IEnumerable.Current value instead of assigning it first to a variable, for example in loop like this:

    int sum = 0; foreach(int x in y) sum += y;

  • > And I think this even can enable more agressive optimization of simple foreach body: use IEnumerable.Current value instead of assigning it first to a variable

    Why do you think it's an optimization?

  • From a Functional Programmer's perspective: make the change.  Wait, that's what Sebastian Good said!  Perhaps our minds have been warped by excessive exposure to Matthias Felleisen.

    Final comment: grep for the word "clearly". These (in general) represent points where the author is not convinced of his own argument.

    Pavel +1

    Sebastian +1

  • Sorry for getting in the discussion quite late. I understand the delicacy of the issue and I hate inconsistency but  I do lean towards a change. HOWEVER I think this bad can be cured by introducing a compiler flag somewhere to control the behaviour. If you convert your old project to C# 5 you get the "old behaviour" flag if you have some code that would run into the issue (along with some warning). New project = new behaviour flag. As versions of C# progress the compiler flag setting gets more hidden somewhere deep in the GUI :)

    I think warnings are a bad idea. Either you define the construct as a shorthand for the variable being declared outside the loop or inside the loop. Both behaviuors are logically sound and ok for usage. I'd argue that the actual problem is that developers assume one behaviour whereas the language is implemented using the other. I may be cynical but I'd argue that this mismatch can only be remedied by changing the compiler behaviour or prohibiting it (warning/errors). If you continue having a mismatch, people will produce errors. Which is the worst outcome of all. If you have a problem, take the cure now, not later. And - people WILL loathe the C# compiler for having to introduce that weird copying all the time.

    You give four arguments to why change is a bad idea:

    1. breaking changes

    2. Lexical inconsistency since "left" should execute before "right"

    3. Inconsistency with "for" semantics

    4. There is already a good workaround

    As for 1, I agree, but I think the bad things can be overcome (see my first paragraph).

    As for 4, this is not really an argument. If people make errors, then it's bad. And it is a common error.

    As for 2, I don't agree. The syntax "foreach(var x in Y)" basically conveys that you pick out objects "in" the bucket Y and name each object x - not the you place those Y objects in a separate container x. They're suppose to be "in" Y, right? Contrary to your argument I feel that "in" implies that it has to execute after and that the left -> right is a vague language idea. I believe users misunderstanding the current compiler's behaviour is evidence for this.

    As for 3, I don't think it's a fair comparison. In a for loop you declare an index variable that HAS to live throughout the iteration. It's clear. The foreach loop on the other hand is subject to a totally different idea interpretation.

Page 8 of 9 (134 items) «56789