Closing over the loop variable considered harmful

Closing over the loop variable considered harmful

Rate This

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The "for" loop will not be changed. We return you now to our original article.


I don't know why I haven't blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That's a terrible situation for everyone; we very much wish to design a language which does not have "gotcha" features like this.

But I'm getting ahead of myself. What's the output of this fragment?

var values = new List<int>() { 100, 110, 120 };
var funcs = new List<Func<int>>();
foreach(var v in values)
  funcs.Add( ()=>v );
foreach(var f in funcs)
  Console.WriteLine(f());

Most people expect it to be 100 / 110 / 120.  It is in fact 120 / 120 / 120. Why?

Because ()=>v means "return the current value of variable v", not "return the value v was back when the delegate was created". Closures close over variables, not over values. And when the methods run, clearly the last value that was assigned to v was 120, so it still has that value.

This is very confusing. The correct way to write the code is:

foreach(var v in values)
{
  var v2 = v;
  funcs.Add( ()=>v2 );
}

Now what happens? Every time we re-start the loop body, we logically create a fresh new variable v2. Each closure is closed over a different v2, which is only assigned to once, so it always keeps the correct value.

Basically, the problem arises because we specify that the foreach loop is a syntactic sugar for

  {
    IEnumerator<int> e = ((IEnumerable<int>)values).GetEnumerator();
    try
    {
      int m; // OUTSIDE THE ACTUAL LOOP
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }

If we specified that the expansion was

    try
    {
      while(e.MoveNext())
      {
        int m; // INSIDE
        m = (int)(int)e.Current;
        funcs.Add(()=>m);
      }

then the code would behave as expected.

It's compelling to consider fixing this for a hypothetical future version of C#, and I'd like to hear your feedback on whether we should do so or not. The reasons FOR making the change are clear; this is a big confusing "gotcha" that real people constantly run into, and LINQ, unfortunately, only makes it worse, because it is likely to increase the number of times a customer is going to use a closure in a loop. Also, it seems reasonable that the user of the foreach loop might think of there being a "fresh" loop variable every time, not just a fresh value in the same old variable. Since the foreach loop variable is not mutable by user code, this reinforces the idea that it is a succession of values, one per loop iteration, and not "really" the same variable over and over again. And finally, the change has no effect whatsoever on non-closure semantics. (In fact, in C# 1 the spec was not clear about whether the loop variable went inside or outside, since in a world without closures, it makes no difference.)

But that said, there are some very good reasons for not making this change.

The first reason is that obviously this would be a breaking change, and we hates them, my precious. Any developers who depend on this feature, who require the closed-over variable to contain the last value of the loop variable, would be broken. I can only hope that the number of such people is vanishingly small; this is a strange thing to depend on. Most of the time, people do not expect or depend on this behaviour.

Second, it makes the foreach syntax lexically inconsistent. Consider foreach(int x in M()) The header of the loop has two parts, a declaration int x and a collection expression, M(). The int x is to the left of the M(). Clearly the M() is not inside the body of the loop; that thing only executes once, before the loop starts. So why should something to the collection expression's left be inside the loop? This seems inconsistent with our general rule that stuff to the left logically "happens before" stuff to the right. The declaration is lexically NOT in the body of the loop, so why should we treat it as though it were?

Third, it would make the "foreach" semantics inconsistent with "for" semantics. We have this same problem in "for" blocks, but "for" blocks are much looser about what "the loop variable" is; there can be more than one variable declared in the for loop header, it can be incremented in odd ways, and it seems implausible that people would consider each iteration of the "for" loop to contain a fresh crop of variables. When you say for(int i; i < 10; i += 1) it seems dead obvious that the "i += 1" means "increment the loop variable" and that there is one loop variable for the whole loop, not a new fresh variable "i" every time through! We certainly would not make this proposed change apply to "for" loops.

And fourth, though this is a nasty gotcha, there is an easy workaround, and tools like ReSharper detect this pattern and suggest how to fix it. We could take a page from that playbook and simply issue a compiler warning on this pattern. (Though adding new warnings brings up a whole raft of issues of its own, which I might get into in another post.) Though this is vexing, it really doesn't bite that many people that hard, and it's not a big deal to fix, so why go to the trouble and expense of taking a breaking change for something with an easy fix?

Design is, of course, the art of compromise in the face of many competing principles. "Eliminate gotchas" in this case directly opposes other principles like "no breaking changes", and "be consistent with other language features". Any thoughts you have on pros or cons of us taking this breaking change in a hypothetical future version of C# would be greatly appreciated.

(This is part one of a two-part series on the loop-variable-closure problem. Part two is here.)

 

  • Initially I wanted the warning, but now I think .. just change it.

    My reason has nothing to do with consistency or the backward change.

    Apparently iterating through a collection and making a bunch of lambdas is common.  (Otherwise the bug would not come up.)  Everyone seems to agree that what is wanted is not what happens.  If the syntax exists and the desired semantic is clear, just do what I want and be over it.  Why should I have to clutter up my code with an extra local variable just to show you that I want what you already know I wanted?  

    I like the semantic idea that a foreach loop runs once on each element of a collection, not on a variable that changes with each iteration.  This is very different from the semantic of a for loop in which you explicitly create and increment a single variable.  Consistency is not always a virtue -- why should dissimilar ideas be forced to be consistent?

    I have another question about backwards compatability.  I suspect that most of the code using this pattern right now is buggy code that the bugs haven't been found yet.  It is clear that backwards compatability requires correct code to continue tp work correctly -- does it require buggy code that happened to work by chance to continue to do so?

  • I vote for a compiler warning at least, or a breaking change to the language at most. This bug has biten some of our devs, who are by no means C# rookies, but at least mid-level devs.

    A compiler warnings seems like good fit to this problem.

  • @Stuart - FYI: C++ for loop variables existing past the scope of the loop has NEVER been part of the specification. It was a common bug in many compilers [including VC++ upto and including 6.0].

  • After reading the additional responses, I like Stuart Ballard's answer the best.  Instead of making a WARNING message or "fixing" the problem, "make attempts to close over a foreach loop variable actually an error."  This would make people fix buggy code and make a program "breaking change" into a compile time breaking change.  If the error is well documented it should get rid of all the "incorrect bug reports", force the fixing of a potential bug, and educate the developer at the same time.  Sounds like a WIN-WIN-WIN to me.

    By the way, great post and fantastic responses all around.

  • I'm with Pavel: the "each" is stronger. The natural expectation, to someone actually reading the code (as opposed to a commenter trying to be pedantic about the expansion), is that the item variable should be immutable and scoped to that loop iteration. I was shocked the first time ReSharper warned me about this (and it was right, I had exactly the bug it was warning me about). And when I tell ReSharper to use a different color to highlight mutable variables, every foreach variable gets highlighted. This is not reasonable behavior, especially as C# borrows more and more from functional languages where immutability is the norm.

    As to your arguments about the syntax suggesting that the variable declaration happens first... that just means the syntax is wrong. To steal from another language I'm familiar with, Ruby chose a syntax that doesn't suffer from that ambiguity:

    collection.each do |item|

     ...

    end

    I.e., call the collection's "each" method and pass it a lambda. I know it's a little late to rework the syntax of C#'s foreach keyword, but if LINQ shipped an .Each or .ForEach(Action<T>) method, then C# could use the exact same pattern, with the same clarity:

    collection.ForEach(item => {

     ...

    });

    The extra parentheses around the lambda are yucky, but this would have the desired scoping; and the scoping would be clear at a glance, even to someone unfamiliar with the esoteric details of the expansion.

    If the performance overhead of the extra method call was a concern, the compiler could always look for this pattern and inline the same codegen as foreach does today (but with the right variable lifetime). I'm told that's what most Smalltalks do with their "if" syntax -- syntactically it looks a method call that takes two closures, but in real life it's optimized to use native branch instructions just like you'd expect in any other language.

    My vote is to add an .Each or .ForEach method to LINQ, and then add a compiler warning for this pattern. (If it were me, I'd make sure the text of the warning recommends that you use .Each instead for clarity. But that's just me.)

  • +1 on adding a compiler warning, but absolutely not changing the current behaviour.

  • I'm in favor of making the change.

    Here:

    foreach (int x in GetNumbers()) { ... }

    Is x assigned before GetNumbers is called? Of course not. So something on the left doesn't necessarily happen before something on the right. Now, if I were to exapnd it manually (forget error checking and dispose for now) I'd do:

    var iterator = GetNumbers().GetEnumerator();

    while (iterator.MoveNext()) {

       int x = iterator.Current;

       ...

    }

    Most people I believe would do the same thing. I've commented asking why the compiler does this weirdly on your previous post, and your response was (I'm paraphrasing): It only makes a difference in closures, and even than it's for the worse.

    I say make the change. But maybe someone is using that oddity on purpose? I have thought of a theoretical reason to do that, although I'd do this completely differently:

    bool first = true;

    foreach (int x in GetNumbers()) { // returns numbers 1..Max

       if (first) {

           UpdateProgress(x);

           first = false;

       }

       DoLongOperation(x);

    }

    void UpdateProgress(int x) {

       while (x < Max) {

           UpdateProgressBarInUI(x);

           Thread.Sleep(500);

       }

    }

  • I believe it should have originally been specified in the suggested new way - but I'm slightly opposed to it changing now. In terms of my "pro" vote, I vehemently concur with Pavel about the implication of the "each" part. The way it reads implies a new variable every time - and this is reinforced by the fact that the variable is readonly within the body of the loop. It's readonly, and yet the same variable gets a new value each time through the loop? Weird! It's far easier to conceive of a *new* readonly variable for each iteration of the loop.

    However, my "con" hat is the backward compatibility argument - not for breaking current code, but for making future code break under current compilers. Stay with me, it makes sense :)

    Currently I'm building Noda Time, and we have a Visual Studio solution for both VS2008 and VS2010. We're not using any C# 4 features, but it's nice to be able to build in both. If someone *does* try to use the new C# 4 features, their code will fail to compile in VS2008 - I *think* that even the overload resolution issue (where variance introduces makes some previously invalid candidates valid) won't be an issue so long as we target .NET 3.5. (It may cause problems if we have a configuration targeting .NET 4.0 for the purpose of Code Contracts, but I think it's unlikely to actually hit us.)

    Now fast forward a couple of years, and suppose we have a similar situation with VS2010 and VS2012 - but C# 5 has made a change to the scope of foreach iteration variables. A developer writes perfectly valid C# 5 - which is also perfectly valid C# 4, but which has the wrong semantics. No compiler warnings or errors - at best a ReSharper warning which could easily go unnoticed (as the developer with VS2010 may not even be looking at that code). At that point, you'd better hope your unit tests spot the difference...

    One possible solution to this issue is to make closing over a foreach variable (in a way which lets it escape from the loop; if the delegate never leaves the loop, it's fine - how you detect that is tricky though!) issue a warning in C# 4, and then change the behaviour for C# 5. That way you'd have to be in a situation where two people are compiling the same code, one in VS2012 and one in VS2008, to see a silent change in behaviour.

    I hope this makes sense - I haven't had my first coffee of the day yet - but it does mean that the backward compatibility issue is worse than I had originally imagined.

  • There are situations were closing over a loop variable does not create 'unexpected' results in eighter the current or the proposed situation, like in:

    foreach(var item in list)

    {

       Console.WriteLine(otherlist.where(i => i.name == item).Count());

    }

    As long as the lamba is not executed outside of the current loop itteration there is not really a problem. I think this is a pretty common and valid scenarion that should not yield a warning. problem is ofcourse that for the compiler it will be harder to detect if the lamda 'surives' the loop body.

  • Can you just ask Anders? The answer seems clear, just wondering if C# losing its momentum in conversations abstracting out into account its core metaphors and principles.

    Just come back to English: 'for each apple x in the basket'. *Each* x, come on!

    I think the designers of C# should feel more contact with the ground, do not leak simplicity and original C# spirit in favour of obscure synthetic ideas.

  • Since both sides (language purism vs. practical application) have compelling arguments, why not allow for both scenarios?

    SomeType x;

    foreach(x in collection) { ... }

    foreach(SomeType y in collection) { ... }

    If you want the behavior that the iteration variable should be declared exactly once, then you have to pay the price of stating that intent explicitly and live with the fact that your variable is declared in the containing scope.

    If you want to iterate through a collection and have it "just work," then you use the second syntax, which would declare the variable in the loop body since you didn't take the extra step of declaring it yourself elsewhere.  This aligns well with the code you would have otherwise written by hand:

    for(int index = 0; index < upperBound; index++)

    {

     SomeType y = collection.ItemAt(index);

     // do stuff with y and never touch index or upperBound

    }

    using(IEnumerator e = collection.GetEnumerator())

    {

     while(e.MoveNext())

     {

       SomeType y = (SomeType)e.Current;

       // do stuff with y and never touch e

     }

    }

  • This might be opening an entirely new can of worms... but what if we added some sort of syntax to the foreach construct that specified whether the range variable is declared outside (like it is currently) or inside (like the proposed change) the body of the loop. Consider:

    var values = new List<int>() { 100, 110, 120 };
    var outerFuncs = new List<Func<int>>();
    var innerFuncs = new List<Func<int>>();
    foreach(var v in values)
    {
       // default behavior (variable declared outside loop body)
       outerFuncs.Add( ()=>v );
    }

    foreach(var v in values) inner
    {
       // new behavior (variable declared inside loop body)
       innerFuncs.Add( ()=>v );
    }

    foreach(var f in outerFuncs)
    {
       Console.WriteLine(f());
    } // 120 120 120

    foreach(var f in innerFuncs)
    {
       Console.WriteLine(f());
    } // 100 110 120

  • Also note, that the current implementation DEMANDS use of variable names such as 'v1'.

    I am sure there's a lot of people who will argue 'v1' is an excellent and natural thing to have.

  • My immediate thought was to leave it as it, and avoid the breaking change, despite the fact that I view the variable declared in the foreach as immutable.

    However, I tested the following.

    var funcs = new List<int>() { 100, 110, 120 }.Select(v => () => v).ToList();
    foreach (var f in funcs)
    {
       Console.WriteLine(f());
    }

    And found that it prints 100, 110, 120. I don't know much about the internals of the Select extension method, but I would have thought it used something similar to foreach. I guess it uses enumerators somewhat more explicitly with a new variable declared for each item.

    Your code is the moral equivalent of:

    foreach(int x in intList) funcs.Add(MakeMeAFunc(x));
    ...
    Func<int> MakeMeAFunc(int y) { return ()=>y; }

    See why that works? Every time through the loop you create a new variable y by calling MakeMeAFunc. The closure is closed over the parameter, and the parameter is a fresh new variable on each invocation. (Imagine if it wasn't; recursive functions would be nigh-impossible!) -- Eric

    To be honest, knowing what happens with foreach, I really didn't know what to expect from the above code. I'm glad it did what I thought it should do though.

  • My preference would be to allow 'val' as an alternative to 'var' for an implicitly typed local *value*. When used in a foreach loop this would cause the delegates to capture the value of the iteration variable.

    Admittedly this is a much larger change.

Page 3 of 9 (134 items) 12345»