String interning and String.Empty

String interning and String.Empty

Rate This
  • Comments 26

Here's a curious program fragment:

object obj = "Int32";
string str1 = "Int32";
string str2 = typeof(int).Name;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // false !?

Surely if A equals B, and B equals C, then A equals C; that's the transitive property of equality. It appears to have been thoroughly violated here.

Well, first off, though the transitive property is desirable, this is just one of many situations in which equality is intransitive in C#. You shouldn't rely upon transitivity in general, though of course there are many specific cases where it is valid. As an exercise, you might want to see how many other intransitivities you can come up with. Post 'em in the comments; I'd love to see what obscure ones you can come up with. (Incidentally, one of the interview questions I got when applying for this team was to invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm.)

Second, what's happening here is we're mixing two different kinds of equality that just happen to use the same operator syntax. We're mixing reference equality with value equality. Objects are compared by reference; in the first and third comparison we are testing if the two object references both refer to exactly the same object. In the second comparison we are checking to see if the two strings have the same content, regardless of whether they are the same object or not. In fact, the compiler warns you about this situation; this should produce a "possible unintended reference comparison" warning.

That might need a bit more explanation. In .NET you can have two strings that have identical content but are different objects. When you compare those strings as strings, they're equal, but when you compare them as objects, they're not.

That explains why the second comparison is true -- it's a value comparison -- and why the third comparison is false -- it's a reference comparison. But it doesn't explain why the first and third comparisons are inconsistent with each other.

This is the result of a small optimization. If you have two identical string literals in one compilation unit then the code we generate ensures that only one string object is created by the CLR for all instances of that literal within the assembly. This optimization is called "string interning".

String.Empty is not a constant, it's a read-only field in another assembly. Therefore it is not interned with the empty string in your assembly; those are two different objects.

This explains why the first comparison is true: the two literals in fact get turned into the same string object. And it explains why the third comparison is false: the literal and the computed value are turned into different objects.

Knowing that, you can now make an educated guess as to why we have this bizarre behaviour:

object obj = "";
string str1 = "";
string str2 = String.Empty;
Console.WriteLine(obj == str1); // true
Console.WriteLine(str1 == str2); // true
Console.WriteLine(obj == str2); // sometimes true, sometimes false?!

Some versions of the .NET runtime automatically intern the empty string at runtime, some do not!

But why, you might ask, do we not perform this interning optimization at runtime on every string? Why not aggressively turn all value-equal strings into reference-equal strings? Surely it is wasteful to have two identical strings around when you could have half as much memory.

The answer is that the TANSTAAFL Principle applies here, bigtime. That is, There Ain't No Such Thing As A Free Lunch. Interning has two positive effects: it decreases memory consumption and decreases time required to compare two strings. (Because if all strings are interned at runtime then all string comparisons can be cheap reference comparisons.) But those positive effects have a cost: allocating a new string now requires that you do a search of all string objects in memory to see if you have one that matches already. In our existing optimization, the cost is small; we can know at compile time what string literals are in a given assembly and which are identical. With the proposed optimization, that cost is imposed at runtime, and it could be a very large fraction of the time spent allocating strings.

In order to keep the time cost down, you'd have to build a hash table of all strings in memory. That means either computing the hashes frequently, which is itself expensive in time, or storing the hashes somewhere. If we do the latter then suddenly we are increasing the memory burden for strings that are not duplicated. That is, our optimization makes the normal scenario -- the vast majority of pairs of strings are not equal to each other -- take up more memory, so that a rare scenario saves on memory. That seems like a bad bargain; you usually want to optimize for the likely case.

There are also serious lifetime problems with interned strings. When can they be safely garbage collected? What if a new copy of the string is created while the old one is being collected on another thread? The safest thing to do is to make interned strings immortal, which looks like a memory leak. Memory leaks are bad for performance, particularly when the optimization you're doing is an attempt to save memory. TANSTAAFL!

In short, it is in the general case not worth it to intern all strings. However, it might be worth it in some specific cases. For example, if you were building a compiler in C#, odds are good that you are going to be producing a lot of strings that are the same at runtime. Our C# compiler is written in C++, in which we have written our own custom string interning layer so that we can do cheap reference comparisons on all strings in your program. Odds are good that "int" is going to appear tens, hundreds or thousands of times in a given program; it seems silly to allocate the same string over and over again. If you were writing a compiler in C#, or had some other application in which you felt that it was worth your while to ensure that thousands of identical strings do not consume lots of memory, you can force the runtime to intern your string with the String.Intern method.

Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute.

Anyway, to come back to the question of transitivity: object reference equality actually is transitive. It's also symmetric (A==B implies B==A) and reflexive (A==A), so it is an equivalence relation. Similarly, string value equality is transitive, symmetric and reflexive, since it uses a straight "character by character" ordinal comparison. But when you mix the two, then equality is no longer transitive. That's weird, but hopefully now understandable.

  • I cp'd your code into VS2008 SP1, and I get 'True' on all 3 cases; which seems to be at odds with your article.

  • I wonder if an empty string is special-cased in some way. No matter what I've tried, it looks like an empty string will always refer to the same instance as String.Empty. I got the expected result using other non-interned strings though:

               object obj = "ab";
               string str1 = "ab";
               string str2 = "a" + new string(new char[] { 'b' }); // prevent compiler from computing "a" + "b"

    @John Kraft: Doesn't really affect the point of the article, but still kind of curious.

    You guys are absolutely right. Some versions of the framework intern string.Empty and some do not! I learned something new today; I've updated the text accordingly. Thanks! -- Eric

  • "Conversely, if you hate interning with an unreasoning passion, you can force the runtime to turn off all string interning in an assembly with the CompilationRelaxation attribute."

    That is not what I read at <>:

       NoStringInterning    Marks an assembly as not requiring string-literal interning.

    "Not requiring" and "Forbidding" are two different things.

  • "That means either computing the hashes frequently...or storing the hashes somewhere"

    Hmm...I'm surprised that the String class doesn't cache the hash value already.  Granted, I never gave it much thought.  But Strings are a common object to use as a key in a hashed structure; I'd think that the nominal overhead would be worthwhile for reasons other than interning, and thus interning could simply take advantage of that.

    The question about object lifetime seems less than a "slam dunk" too.  That is, yes...the simplest, safest implementation would simply lead to a huge increase in memory usage.  But is that really the _only_ implementation?

    The time cost problem seems like a much more important point than these other two.

    I think in the end, it's not so much that there's a clear argument against interning every string at run-time, but simply that there is a vague, general argument against it and no terribly compelling need in favor of it.  That is, it _could_ work given enough effort in the implementation, but in the classic cost/benefit analysis, cost is very high and benefit is very low.

  • This post accidently reveals a deep dark secret, that I have long suspected.....

    C# was created by the Ringworld Engineers!!!!!!!

    Just think how much this really explains <grin>

    [ps: I know Heinlein used it decades before Niven...]

  • Interesting that you wrote a post on this. This is one of my favorite interview questions to developers and they almost ALWAYS get it wrong!

  • Why doesn't make String.Empty a constant? I believe that String.Empty conforms to the defintion of a constant.

  • In java, the comparison of two string objects using "==" always results in a reference comparison. Therefore string comparison is always done using String.equals(), the same concept of literal pools applies java though.

    Sample this:

     String str1="xyz";

     Object obj1="xyz";

     String str2=new String("xyz");

     System.out.println(str1==obj1); //true

     System.out.println(str1==str2); //false

     System.out.println(str2==obj1); //false

     System.out.println(str1.equals(obj1)); //true

     System.out.println(str2.equals(obj1)); //true

     System.out.println(((String)obj1).equals(str1)); //true

    I always thought the same was true for C#. Interesting, now I know... Thanks! :)

  • eh.. 'better method' algorithm? Was ist das?

  • @Franklin, if String.Empty were a constant (IL "literal"), its value would be inserted into IL at compile time - so it wouldn't be any different from just using "". In particular, it would only be interned once per assembly. But since it's actually static readonly field, there's just one single instance shared between all code using String.Empty. I'm not sure if this has any distinct advantages, or if it is even the rationale for making it non-constant, but I can't think of any other points of difference.

  • Excellent post. Could you please clarify the following bit for me: "only one string object is created by the CLR for all instances of that literal within the assembly." Are you saying that if assembly A and assembly B both contain the same literal string there will be two copies of this in memory or am I reading this backwards? Because as far as I have been able to observe, that is not the case.

  • I wonder why the C# compiler still written in C++...

    // Ryan

  • The compiler is probably still written in C++ because at first they had to, and now it'd be a waste to throw out all that perfectly good code.

  • Something is bothering me here.

    object obj = "";

    string str1 = "";

    string str2 = String.Empty;

    Console.WriteLine(obj == str1); // true

    Console.WriteLine(str1 == str2); // true

    Console.WriteLine(obj == str2); // sometimes true, sometimes false?!

    If my understanding is correct, this means that "" is interned, while string.Empty depends on .NET runtime version. Then, wouldn't it be better to always use "" rather than string.Empty ?

    By the way, I always wondered where "user string.Empty" good practice came from.

  • A slight modification of the code at the beginning of the post provides yet another illustartion of the difference between the comparison by reference and the comparison by value:

               object obj = "Int32";

               StringBuilder sb = new StringBuilder("Int32");

               string str1 = sb.ToString();

               string str2 = typeof(int).Name;

               Console.WriteLine(obj == str1); // False, this time!!!

               Console.WriteLine(str1 == str2); // true

               Console.WriteLine(obj == str2); // false !?

    Well, it's self-explanatory, pretty much: the call to StringBuilder.ToString() defeats the interning, somehow, so that the two "Int32" do not end up being the same object (I've used the VS 2008 SP1, Standard Edition, on 64-bit Windows 7 Ultimate RTM: it may be different on other .NET versions, of course).

Page 1 of 2 (26 items) 12