Managed
strings are subject to ‘interning’.
This is the process where the system notices that the same string is used
in several places, so it can fold all the references to the same unique
instance.
Interning happens two ways in the CLR.
1)
It happens when you explicitly
call System.String.Intern().
Obviously the string returned from this service might be different from
the one you pass in, since we might already have an intern’ed instance that has
been handed out to the application.
2)
It happens automatically, when you
load an assembly. All the string
literals in the assembly are intern’ed.
This is expensive and – in retrospect – may have been a mistake. In the future we might consider allowing
individual assemblies to opt-in or opt-out. Note that it is always a mistake to rely
on some other assembly to have implicitly intern’ed the strings it gives you.
Through versioning, that other
assembly might start composing a string rather than using a literal.
One
thing that might not be immediately obvious is that we intern strings across all
AppDomains. That’s because
assemblies can be loaded as domain-neutral. When this happens, we execute the same
code bytes at the same address in all AppDomains into which that assembly has
been loaded. Since we can burn the
addresses of string literals into our native code as immediate data, we clearly
benefit from intern’ing across all AppDomains rather than using per-AppDomain
indirections in the code. However,
this approach does add overhead to intern’ing: we are forced to use
per-AppDomain reference counts into a shared intern’ing table, so that we can
unload intern’ed strings accurately when the last AppDomain using them is itself
unloaded.
Normally, strings should be compared with String.Equals and similar
mechanisms. Note that the String
class defines operator== to be String.Equals. However, if two strings are both known
to have been intern’ed, then they can be compared directly with a faster
reference check. In other words,
you could call Object.operator==() rather than String.operator==(). This is only recommended for highly
performance-sensitive scenarios when you really know what you are
doing.
Of
course, string intern’ing only works if strings are immutable. If they were mutable, then the sharing
of strings that is implicit in intern’ing would corrupt all kinds of application
assumptions – as we will see.
The good
news is that strings are immutable… mostly. And they are immutable for many good
reasons that have nothing to do with intern’ing. For example, immutable strings eliminate
a whole host of multi-threaded race conditions where one thread uses a string
while another string mutates it. In
some cases, those race conditions could be used to mount security attacks. For example, you could satisfy a
FileIOPermission demand with a string pointing to an innocuous section of the
file system, and then use another thread to quickly change the string to point
to a sensitive file before the underlying CreateFile occurs.
So how
can strings be mutated?
Well,
you can certainly use C#’s ‘unsafe’ feature or equivalent unverifiable ILASM or
Managed C++ code to write into a string’s buffer. In those cases, some highly trusted code
is performing some clearly dirty operations. This case isn’t going to happen by
accident.
A more
serious concern comes with marshaling.
Here’s a program that uses PInvoke to accidentally mutate a string. Since the string happens to have been
intern’ed, it has the effect of changing a string literal in an unrelated part
of the application. We pass
‘computerName’ to the PInvoke, but ‘otherString’ gets changed too!
using
System;
using
System.Runtime.InteropServices;
public class
Class1
{
static void Main(string[] args)
{
String computerName = "strings are always
immutable";
String otherString = "strings are always
immutable";
int
len = computerName.Length;
GetComputerName(computerName, ref len);
Console.WriteLine(otherString);
}
[DllImport("kernel32",
CharSet=CharSet.Unicode)]
static extern bool
GetComputerName(
[MarshalAs (UnmanagedType.LPWStr)] string
name,
ref
int len);
}
And
here’s the same program written to avoid this problem:
using
System;
using
System.Runtime.InteropServices;
public class
Class1
{
static void Main(string[] args)
{
String computerName = "strings are always
immutable";
String otherString = "strings are always
immutable";
int
len = computerName.Length;
GetComputerName(ref computerName, ref len);
Console.WriteLine(otherString);
}
[DllImport("kernel32",
CharSet=CharSet.Unicode)]
static extern bool
GetComputerName(
[MarshalAs(UnmanagedType.VBByRefStr)]
ref
string name,
ref int
len);
}
In this second case,
VBByRefStr is used for the marshaling directive. The argument is treated as ‘byref’ on
the managed side, but remains ‘byval’ on the unmanaged side. If the
unmanaged side scribbles into the buffer, it won’t pollute the managed string,
which remains immutable. Instead, a different string is back-propagated to
the managed side, thereby preserving managed string immutability.
If you are coding in VB,
you can pretend that the VBByRefStr is actually byval on the managed side.
The compiler works its magic on your behalf, so you don’t actually realize that
you now have a different string. C# works no such magic, so I had to
explicitly add the ‘ref’ keyword in all the right places.
If
you’re like me, you probably find all the marshaling directives
bewildering. I can’t recommend Adam
Nathan’s book enough. It is “.NET
and COM – The Complete Interoperability Guide”. It truly is the bible for
interop.
Nevertheless, even with the book it’s easy to make a lot of
mistakes. There’s a feature in the
new CLR release called Customer Debug Probes. It makes finding certain kinds of bugs
much easier. Fortunately for all of
us, it’s particularly geared to finding bugs with marshaling and other Interop
issues.