String interning and String.Empty

String interning and String.Empty

Rate This
  • Comments 26

Here's a curious program fragment:

object obj = "Int32";
string str1 = "Int32";
string str2 = typeof(int).Name;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // false !?

Surely if A equals B, and B equals C, then A equals C; that's the transitive property of equality. It appears to have been thoroughly violated here.

Well, first off, though the transitive property is desirable, this is just one of many situations in which equality is intransitive in C#. You shouldn't rely upon transitivity in general, though of course there are many specific cases where it is valid. As an exercise, you might want to see how many other intransitivities you can come up with. Post 'em in the comments; I'd love to see what obscure ones you can come up with. (Incidentally, one of the interview questions I got when applying for this team was to invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm.)

Second, what's happening here is we're mixing two different kinds of equality that just happen to use the same operator syntax. We're mixing reference equality with value equality. Objects are compared by reference; in the first and third comparison we are testing if the two object references both refer to exactly the same object. In the second comparison we are checking to see if the two strings have the same content, regardless of whether they are the same object or not. In fact, the compiler warns you about this situation; this should produce a "possible unintended reference comparison" warning.

That might need a bit more explanation. In .NET you can have two strings that have identical content but are different objects. When you compare those strings as strings, they're equal, but when you compare them as objects, they're not.

That explains why the second comparison is true -- it's a value comparison -- and why the third comparison is false -- it's a reference comparison. But it doesn't explain why the first and third comparisons are inconsistent with each other.

This is the result of a small optimization. If you have two identical string literals in one compilation unit then the code we generate ensures that only one string object is created by the CLR for all instances of that literal within the assembly. This optimization is called "string interning".

String.Empty is not a constant, it's a read-only field in another assembly. Therefore it is not interned with the empty string in your assembly; those are two different objects.

This explains why the first comparison is true: the two literals in fact get turned into the same string object. And it explains why the third comparison is false: the literal and the computed value are turned into different objects.

Knowing that, you can now make an educated guess as to why we have this bizarre behaviour:

object obj = "";
string str1 = "";
string str2 = String.Empty;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // sometimes true, sometimes false?!

Some versions of the .NET runtime automatically intern the empty string at runtime, some do not!

But why, you might ask, do we not perform this interning optimization at runtime on every string? Why not aggressively turn all value-equal strings into reference-equal strings? Surely it is wasteful to have two identical strings around when you could have half as much memory.

The answer is that the TANSTAAFL Principle applies here, bigtime. That is, There Ain't No Such Thing As A Free Lunch. Interning has two positive effects: it decreases memory consumption and decreases time required to compare two strings. (Because if all strings are interned at runtime then all string comparisons can be cheap reference comparisons.) But those positive effects have a cost: allocating a new string now requires that you do a search of all string objects in memory to see if you have one that matches already. In our existing optimization, the cost is small; we can know at compile time what string literals are in a given assembly and which are identical. With the proposed optimization, that cost is imposed at runtime, and it could be a very large fraction of the time spent allocating strings.

In order to keep the time cost down, you'd have to build a hash table of all strings in memory. That means either computing the hashes frequently, which is itself expensive in time, or storing the hashes somewhere. If we do the latter then suddenly we are increasing the memory burden for strings that are not duplicated. That is, our optimization makes the normal scenario -- the vast majority of pairs of strings are not equal to each other -- take up more memory, so that a rare scenario saves on memory. That seems like a bad bargain; you usually want to optimize for the likely case.

There are also serious lifetime problems with interned strings. When can they be safely garbage collected? What if a new copy of the string is created while the old one is being collected on another thread? The safest thing to do is to make interned strings immortal, which looks like a memory leak. Memory leaks are bad for performance, particularly when the optimization you're doing is an attempt to save memory. TANSTAAFL!

In short, it is in the general case not worth it to intern all strings. However, it might be worth it in some specific cases. For example, if you were building a compiler in C#, odds are good that you are going to be producing a lot of strings that are the same at runtime. Our C# compiler is written in C++, in which we have written our own custom string interning layer so that we can do cheap reference comparisons on all strings in your program. Odds are good that "int" is going to appear tens, hundreds or thousands of times in a given program; it seems silly to allocate the same string over and over again. If you were writing a compiler in C#, or had some other application in which you felt that it was worth your while to ensure that thousands of identical strings do not consume lots of memory, you can force the runtime to intern your string with the String.Intern method.

Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute.

Anyway, to come back to the question of transitivity: object reference equality actually is transitive. It's also symmetric (A==B implies B==A) and reflexive (A==A), so it is an equivalence relation. Similarly, string value equality is transitive, symmetric and reflexive, since it uses a straight "character by character" ordinal comparison. But when you mix the two, then equality is no longer transitive. That's weird, but hopefully now understandable.

  • Not equality, but an intransitive string comparison:

    string s1 = "-0.67:-0.33:0.33";

    string s2 = "0.67:-0.33:0.33";

    string s3 = "-0.67:0.33:-0.33";

    Console.WriteLine(s1.CompareTo(s2));

    Console.WriteLine(s2.CompareTo(s3));

    Console.WriteLine(s1.CompareTo(s3));

  • It might be good to mention another optimiztion..

    string x = new string(new char[0]);

    string y = new string(new char[0]);

    Console.WriteLine(object.ReferenceEquals(x, y)); // true

    .. From http://stackoverflow.com/questions/194484/whats-the-strangest-corner-case-youve-seen-in-c-or-net

  • @Brian

    > Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case.

    You're right, and I'm wrong (and I have no idea where I got this notion from). In fact, it's quite obvious now that I think of it - there's only one string pool, so assemblies don't matter.

    Which, obviously, means that my guess at the rationale of String.Empty is entirely wrong, as well. Back to square one.

    @Denis

    > the call to StringBuilder.ToString() defeats the interning, somehow

    That one is actually pretty straightforward (and Eric has already explained it in the post): only literals (including those produced by constant expressions at compile-time, like "a"+"b") are interned by default. The return value of StringBuilder.ToString() is not a literal.

  • As usual Eric, fantastic article.

    Timely as well. Read your article this morning and this afternoon discovered some weird behavior with the equality operator.

    I expected the first 6 to be True (especially #6 as I would think it would be equal for both reference and value equality.)

    What is going on here?

    double d1 = double.NaN;
    double? d2 = double.NaN;
    double d3 = double.NaN;
    double d4 = d3;
    Console.WriteLine(d1 == d2.Value);                 //False  
    Console.WriteLine(d1 == d3);                         //False
    Console.WriteLine(d2.Value == d3);                //False
    Console.WriteLine(d2.Value == double.NaN);    //False
    Console.WriteLine(d1 == double.NaN);            //False
    Console.WriteLine(d4 == d3);                        //False  
    Console.WriteLine(double.IsNaN(d1));             //true
    Console.WriteLine(double.IsNaN(d2.Value));    //true

    First off, reference equality doesn't come into it; you have no reference types at all in this program fragment.

    NaN means "not a number", and NaNs are special. In particular, the floating point standard requires that NaN == NaN be false. Basically, NaN means "the result is unknown or nonsensical."  You have two results which are unknown or nonsensical. Let's suppose the two results are the total sales for October 10th, which are unknown, and the total sales for February 31st, which are nonsensical. You compare them for equality. Does it make any sense to say "why yes, those two figures are equal!" ?  Of course not. So NaNs never equal each other. 

    Note that "null" in VB has this same property; if you compare null to null in VB, you get null, not true or false.

    See the IEEE 754 specification for more details. -- Eric 

     

  • So why isn't String.Empty a constant ? (I know this was asked before, but it seems the only answer given was later invalidated). I guess since it appears that String.Empty IS interned it probably doesn't make any difference, but I'm interested in the answer.

  • Joren,

    Ofcourse, but apart from the costs, whatelse could be a reason to not switch to C#?

    Dont get me wrong, I love C++, but it would be *nice* to see the C# compiler being selfhosting

    http://en.wikipedia.org/wiki/Self-hosting

    // Ryan

  • @Ryan Heath - There have been hints that this may happen. It's likely it's something they're currently looking into.

  • @Ryan,

    "Apart from the costs" I don't think there is a reason not to write the C# compiler in C#. But that's like asking, apart from my height and lack of athletic ability, for what other reason can't I be an All-Star professional basketball player? You have to live in reality. Cost is usually the reason that desirable things don't get done, in software and the rest of the world.

    In fact the C# team has talked about exposing the compiler as a managed service to aid metaprogramming, scripting, and other scenarios. Anders himself spoke about it at PDC last year. So I imagine we might see it happen. But it has to make it to the top of the priority list, past a whole lot of other desirable things (as Eric has often spoken of).

  • It seems that string interning happens also between assemblies (VS2008 SP1):

    object obj = new StringContainer().Value; // in another project, returns "Int32" as object

    string str1 = "Int32";

    string str2 = typeof(int).Name;

    Console.WriteLine(obj == str1); // true. I was expecting false

    Console.WriteLine(str1 == str2); // true

    Console.WriteLine(obj == str2); // false

  • If I run this code in a new console application, I get the behavior you indicate (true/true/false). When I look at the generated assembly in reflector, however, the CompilerRelaxations attribute is present, with string literal interning disabled [CompilerRelaxations(8)]. So if string interning is disabled, why am I getting the behavior that should only occur if string literal interning is enabled?

  • Thanks for the great explanation. I almost completely forgot the idea of string interning from my time with C++ after I had moved to .NET.

    We had a discussion about const fields in this question:

    http://stackoverflow.com/questions/1819117/c-do-const-fields-use-less-memory

    As I understand it, the references to const fields are replaced with actual values during compilation time. Const fields need to be of value type with string being some sort of exception. Does the interning rule kicks in when it detects a constant is a string?

Page 2 of 2 (26 items) 12