Simple names are not so simple

Simple names are not so simple

Rate This
  • Comments 14

C# has many rules that are designed to prevent some common sources of bugs and encourage good programming practices. So many, in fact, that it is often quite confusing to sort out exactly which rule has been violated. I thought I might spend some time talking about what the different rules are. We'll finish up with a puzzle.

To begin with, it will be vital to understand the difference between scope and declaration space. To refresh your memory of my earlier article: the scope of an entity is the region of text in which that entity may be referred to by its unqualified name. A declaration space is a region of text in which no two things may have the same name (with an exception for methods which differ by signature.) A "local variable declaration space" is a particular kind of declaration space used for declaring local variables; local variable declaration spaces have special rules for determining when they overlap.

The next thing that you have to understand to make any sense o this is what a "simple name" is. A simple name is always either just a plain identifier, like "x", or, in some cases, a plain identifier followed by a type argument list, like "Frob<int, string>".

Lots of things are treated as "simple names" by the compiler: local variable declarations, lambda parameters, and so on, always have the first form of simple name in their declarations. When you say "Console.WriteLine(x);" the "Console" and "x" are simple names but the "WriteLine" is not. Confusingly, there are some textual entities which have the form of simple names, but are not treated as simple names by the compiler. We might talk about some of those situations in later fabulous adventures.

So, without further ado, here are some relevant rules which are frequently confused. It's rules 3 and 4 that people find particularly confusing.

1) It is illegal to refer to a local variable before its declaration. (This seems reasonable I hope.)
2) It is illegal to have two local variables of the same name in the same local variable declaration space or nested local variable declaration spaces.
3) Local variables are in scope throughout the entire block in which the declaration occurs. This is in contrast with C++, in which local variables are in scope in their block only at points after the declaration.
4) For every occurrence of a simple name, whether in a declaration or as part of an expression, all uses of that simple name within the immediately enclosing local variable declaration space must refer to the same entity.

The purpose of all of these rules is to prevent the class of bugs in which the reader/maintainer of the code is tricked into believing they are referring to one entity with a simple name, but are in fact accidentally referring to another entity entirely. These rules are in particular designed to prevent nasty surprises when performing what ought to be safe refactorings.

Consider a world in which we did not have rules 3 and 4. In that world, this code would be legal:

class C
{
  int x;
  void M()
  {
    // 100 lines of code
    x = 20; // means "this.x";
    Console.WriteLine(x); // means "this.x"
    // 100 lines of code
    int x = 10;
    Console.WriteLine(x); // means "local x"
  }
}

This is hard on the person reading the code, who has a reasonable expectation that the two "Console.WriteLine(x)" lines do in fact both print out the contents of the same variable. But it is particularly nasty for the maintenance programmer who wishes to impose a reasonable coding standard upon this body of code. "Local variables are declared at the top of the block where they're used" is a reasonable coding standard in a lot of shops. But changing the code to:

class C
{
  int x;
  void M()
  {
    int x;
    // 100 lines of code
    x = 20; // no longer means "this.x";
    Console.WriteLine(x); // no longer means "this.x"
    // 100 lines of code
    x = 10;
    Console.WriteLine(x); // means "local x"
  }
}

changes the meaning of the code! We wish to discourage authoring of multi-hundred-line methods, but making it harder and more error-prone to refactor them into something cleaner is not a good way to achieve that goal.

Notice that the original version of this program, rule 3 means that the program violates rule 1 -- the first usage of "x" is treated as a reference to the local before it is declared. The fact that it violates rule 1 because of rule 3 is precisely what prevents it from being a violation of rule 4! The meaning of "x" is consistent throughout the block; it always means the local, and therefore is sometimes used before it is declared. If we scrapped rule 3 then this would be a violation of rule 4, because we would then have two inconsistent meanings for the simple name "x" within one block.

Now, these rules do not mean that you can refactor willy-nilly. We can still construct situations in which similar refactorings fail. For example:

class C
{
  int x;
  void M()
  {
    {
      // 100 lines of code
      x = 20; // means "this.x";
      Console.WriteLine(x); // means "this.x"
    }
    {
      // 100 lines of code
      int x = 10;
      Console.WriteLine(x); // means "local x"
    }
  }
}

This is perfectly legal. We have the same simple name being used two different ways in two different blocks, but the immediately enclosing block of each usage does not overlap that of any other usage. The local variable is in scope throughout its immediately enclosing block, but that block does not overlap the block above. In this case, it is safe to refactor the declaration of the local to the top of its block, but not safe to refactor the declaration to the top of the outermost block; that would change the meaning of "x" in the first block. Moving a declaration up is almost always a safe thing to do; moving it out is not necessarily safe.

Now that you know all that, here's a puzzle for you, a puzzle that I got completely wrong the first time I saw it:

using System.Linq;
class Program
{
  static void Main()
  {
    int[] data = { 1, 2, 3, 1, 2, 1 };
    foreach (var m in from m in data orderby m select m)
      System.Console.Write(m);
  }
}

It certainly looks like name "m" is being used multiple times to mean different things. Is this program legal? If yes, why do the rules for not re-using simple names not apply? If no, precisely what rule has been violated? 

[Eric is on vacation; this posting was pre-recorded.]

  • I admit I had to compile this to see whether it would work (and it does). From my reading of the standard, it sounds like "var m" belongs to the declaration space of the body of the foreach and therefore doesn't overlap with the range variable "m" declared in the query expression. I'm assuming -- but I can't find the exact wording that would back me up -- that the query expression creates its own declaration space nested directly inside that of Main.

  • The foreach expression get transformed into (something like) this:

    foreach (var m in data.OrderBy(m=>m))

    The lambda function parameter of the OrderBy() method gets turned into its own method, which has a different declaration space that the Main() method.

    Therefore none of the rules are violated.

  • I think the whole thing opens up once you realize that the foreach can be translated to something like:

    foreach(var m in data.OrderBy(m => m).Select(m => m))

    {

       System.Console.Write(m);

    }

    The only remaining concern is does the 'var m' declaration overlap with the declarations in the lamda expressions.  I have to admit, I was somewhat surprised that they did not at first, but I guess since:

    1. The m's from the lambda expressions cannot be referenced outside of the lamda

    2. The 'var m' cannot be referenced inside the lamda

    there is no overlap in the declaration space.

  • @ JD:

    I believe you are correct about the foreach's declaration space....that is another compiler translation that I missed. (I did not verify with the spec either, though)  You could actually decompose the foreach and the lambda all the way down to something like

    var sorted = data.OrderBy(delegate(int m){ return m; });  // m exists only in the anonymous delegate

    var e = sorted.GetEnumerator();

    while(e.MoveNext())

    {

      int m = e.Current; // exists only in the while loop

      Console.WriteLine(m);

    }

    When you look at it this way, there is no cause for overlap in declaration space.

  • totally valid, but wouldn't be the other way, as in:

    foreach(var m in data.Select(m => m).OrderBy(m => m))

  • @Chris:

    Try compiling this:

      var m = from m in in new[] { 123 } select m;

    or:

      var m = new[] { 123 }.Select(m => m);

    You'll find that "inner" m is conflicting with "outer" m here in both cases. Lambdas are subject to rules Eric listed just as everything else is, and are not treated as a separate and totally distinct methods for their purpose (which makes sense - after all, lambda can refer to variables in outer scope, so they are in scope for it).

    The only reason why Eric's example is different is because of foreach.

  • Yay. I reasoned it would compile/run, and it did. The declaration space of the linq (lambda?) is separate from that of the foreach statement...

  • @ericlippert> We wish to discourage authoring of multi-hundred-line methods

    Banning regions within methods would be a good start towards that goal.

  • @Pavel "because of foreach"

    The compiler generates an anonymous iterator object for the foreach.  Is the difference, therefore, that the "var m" is "outside"the iterator (i.e., in code that consumes it), while the query is compiled before the iterator can iterate, so the scopes are distinct?

  • This is why simple code should be dumbed down to reduce the likelyhood of an error and reduce ongoing support costs.

    The line below is nearing write-only code and would not pass our code review:

       > foreach (var m in from m in data orderby m select m)

  • Foreach introduces a variable declaration space - the 'm' variable defined in the foreach is only accessible inside the loop (not in the definition part).

    Consider this:

    foreach (int m in DoSomething(m)) { }

    Clearly, the m here is inaccessible for DoSomething because that line is before its declaration space.

    For that reason,

    foreach (int m in DoSomethingWithLambda(m => m)) { }

    is valid - the DoSomethingWithLambda(m => m) part comes before the declaration space.

  • > We can still construct situations in which similar refactorings fail. <...>

    ...And the bottom line is: don't ever, ever be too quick to refactor and rewrite a piece of code you don't like because of aesthetic reasons, or because you're too religious about the Coding Conventions (like that, from the capital letter). Some people may not *know* some of the language rules but may have sussed them out intuitively and so write code that does not look pretty but works (and then stops working when you barge in with your holy Coding Conventions and re-write it based on them).

    I have actually known at least one guy who lost his job that way: he had inherited a *very ugly* code (in C) that worked and, in his arrogance, he re-wrote it *beautifully*, only to cause it to stop working at the exactly wrong moment (on the eve of a pre-scheduled demo).

    Seriously, I think those rules (1 through 4) should be somewhere on a highly visible place in the Visual Studio help (I have VS 2010 beta 2 and that one doesn't have much of a help yet , so maybe it's the plan already...)

  • @Pavel: I actually knew that code equivalent to your example wouldn’t compile, so I figured that the same would apply to `foreach` and thus got the answer wrong. It’s funny that most people here found the right answer for the wrong reasons.

    “The only reason why Eric's example is different is because of foreach.”

    I had a hard time figuring out why this is. The secret is that both lambdas and `foreach` start a new scope, and these scopes are oddly enough not conflicting. So the following code won’t compile because `x` is used both in an outer and inner scope:

       var ys = from x in xs orderby x select x;

       var x = ys.First();

    By contrast, the following *will* compile just fine.

       var ys = from x in xs orderby x select x;

       {

           var x = ys.First();

       }

    So there are three declaration spaces here: the outer declaration space of the `Main` method, the declaration space of the lambda, and the declaration space of the `foreach` loop. The latter two don’t overlap.

  • All you need is the definition of foreach from the language spec (8.8.4):

    ... The above steps, if successful, unambiguously produce a collection type C, enumerator type E and element type T. A foreach statement of the form

    foreach (V v in x) embedded-statement

    is then expanded to:

    {

    E e = ((C)(x)).GetEnumerator();

    try {

    V v;

    while (e.MoveNext()) {

    v = (V)(T)e.Current;

    embedded-statement

    }

    }

    finally {

    … // Dispose e

    }

    }

    Note the strategic placement of curly braces.

Page 1 of 1 (14 items)