Fabulous Adventures In Coding
Eric Lippert is a principal developer on the C# compiler team. Learn more about Eric.
Todays post is based on a question from StackOverflow; I liked it so much I figured hey, let's just blog it today.
When you look at a string in C#, it looks to you like a collection of characters, end of story. But of course, behind the scenes there is a data structure in memory somewhere to implement that collection of characters. In the .NET CLR, strings are laid out in memory pretty much the same way that BSTRs were implemented in OLE Automation: as a word-aligned memory buffer consisting of a four-byte integer giving the length of the string, followed by the characters of the string in two-byte chunks of UTF-16 data, followed by two zero bytes. (Recall that BSTR originally stood for "BASIC string", because the OLE Automation team was actually part of the Visual Basic team; this is the format that Visual Basic used.)
Using this as the internal implementation of strings has a number of benefits. For example: it only requires one heap allocation per string. The length can be determined without counting characters. The string can contain embedded zero bytes, unlike formats that use zero bytes as end-of-string markers. If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8. If the string is pinned in place and contains no zero characters then the address of the string data can be passed unmodified to unmanaged code that takes a WCHAR*. And so on.
Strings are immutable in .NET, which also has many benefits. As I've discussed many times, immutable data types are easier to reason about, are threadsafe, and are more secure. (*)
One of the benefits of the immutable data types I've talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.
Strings in this format are immutable, but they are not persistent, thanks to that pesky single-buffer-with-both-a-prefix-and-suffix layout. Concatenation two strings does not re-use the content of either of the original strings; it creates a new string and copies the two strings into the new string. Taking the substring of a string does not re-use the content of the original string. Again, it just makes a new string of the right size and makes a full copy of the data.
This means that operations on strings such as taking a substring are O(n) in the size of the substring. Concatenations are O(n+m) in the sizes of the two source strings. These operations could be O(1) or O(lg n) if strings were persistent. For example, we could treat strings as catenable deques of chars; there are ways to do very efficient concatenations of catenable deques. (**) Or, we could say that there are two kinds of strings; "regular" strings and "ropes", which are binary trees of strings. Concatenating two strings just allocates a new rope. It becomes a bit more expensive to find the nth char of a rope, and you can't pass them unmodified to managed code, but you always avoid a potentially lengthy copy. Similarly with substring operations: taking the substring of a string could simply allocate a wrapper around the original string that keeps an offset and a length. No copying required.
Why don't we do "persistent" optimizations like this, since we have an immutable data structure already?
The motivation for doing so is based on the incorrect idea that O(1) is always better than O(lg n), is always better than O(n). The asymptotic order of an operation is only relevant if under realistic conditions the size of the problem actually becomes large. If n stays small then every problem is O(1)!
That is the situation we are actually in. Very few developers routinely deal with strings of more than a few hundred or a few thousand characters. If you are dealing with larger strings and doing a lot of concatenation, you can use a StringBuilder; if you're only doing a small number of concatenations, it is very fast to do the copy. If you're taking substrings, odds are very good that the substrings are going to be a few dozen characters out of a thousand character string, not a few hundred thousand characters out of a million-character string. Most line-of-business programmers are not in the business of chopping up strings containing million-character DNA strands; they're in the business of parsing the name of a book out of an XML document or an address out of a comma-separated-value text file. The memory operations for making a relatively small string buffer and copying a few dozen bytes out of it are insanely fast, so fast that there is really little point in optimizing it further.
Unless you have really good reason to believe that the "optimization" is a win, it probably isn't. I spent a summer about twelve years ago rewriting the VBScript internal string manipulations to use "ropes" -- to build up a tree on every string concatenation, and only turn it back into a "real" BSTR when necessary. Our research on real-world code showed that the "optimization" was no optimization at all; most of the time strings were being turned from a rope back into a BSTR after the second or third concatenation, and the added expense of allocating all the tree structures was actually slower than just making a copy of every BSTR every time. (***)
Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis. My team does that all the time; we write code editors that have to be able to deal with enormous files being rapidly edited. Because operations on strings are O(n) and n is actually big, we do not use big strings; rather, we have immutable persistent data structures that represent edits to the document, and those edits are then fed into an engine that maintains an immutable, persistent model of the lexical and syntactic analysis of the program. We want to be able to re-use as much of the previously-seen text and analysis as possible, to ensure high performance in both memory size and speed.
That was some tricky code to write, and what works for us almost certainly does not work for the people doing DNA strings, or any other big-string-with-lots-of-substrings problem that you care to name. Rather than optimize CLR strings for narrow, rare cases, it's better to just keep it simple, optimize for the common case, and rely on the hardware taking care of making blindingly fast copies of short strings.
(*) Consider an attack where partially-trusted hostile code passes a string containing a "safe" path to File.Open. The security checker verifies that the path is safe for partially-trusted code to open. And then the partially-trusted code mutates the string on another thread before the file is actually opened! If they get the timing right then the security check passes and the wrong file is opened. If strings are immutable, this kind of thing does not happen.
(**) My article on deques does not show how to make them efficiently catenable; see the referenced papers for details.
(***) If you happen to be one of the people with access to the VBScript source code, I think all that gear is still in there, but turned off. Search for FANCY_STRING in the sources.
If recall, Borland Delphi's Object Pascal's strings also worked the same way.
I never had to deal with characters beyond 2 bytes but it seems to me from the documentation that substring and indexing actually work with characters as in the 2 byte char type and not in real Unicode code points. Therefore if you really wanted to handle surrogate pairs as a single character you need to do it yourself. Is that true?
pete.d - "Not if you want the API to remain character-based."
Now your arguments are just getting silly. The traditional CHAR is only one byte. The 2-byte WCHAR was introduced later. Personally, I like UTF-8 better as it is backward compatible with ASCII and doesn't change the definition of character. Also, since I have only worked for American companies, all the business text I have worked with has always been ASCII and I can assume no multi-byte characters. UTF-8 is also the starndard for the web.
Eric has a good argument for UTF-16, in that it is currently rare to have surrogate pairs, but 20 years ago the same thing could be said for UTF-8. What about in 20 more years? An argument can be made that code should just be written to handle multi-character code-points now, and in that case I think UTF-8 wins bits down.
Caching the hash value of a string on creation definitely sounds like a dodgy optimization to me. This disregards the fact that 1) most strings are not keys for hashtables (well, maybe in your app they are, but you probably underestimate the number of strings generated) and 2) keys stored in the hashtable are hashed only once, and keys used for lookup are very likely new strings from external sources, which, again, are typically hashed only once (because they're garbage collected soon afterwards).
Caching hash values makes sense in a lot of cases where interning also makes sense. Interned strings should have their hash values cached. For other strings, it's probably a waste of storage. This is significant, since there are plenty of applications where strings make up the bulk of the heap.
Jeroen: The hashcode for a string would be cached on the first call to GetHashCode, not on the creation of a string. And a hashtable's key's hashcode is recomputed every time the hashtable has to grow or somebody iterating over the keys needs to lookup a value.
No, the real reason it's unnecessary to cache hashcodes for strings is that hashing isn't slow enough to warrant it. My laptop can compute the hashcode of a string with 1M chars in under 1 millisecond, meaning most strings hash in nanoseconds. If it usually takes 10ns to hash a string, why bother caching the result?
"Concatenation two strings" - I'm assuming you meant "Concatenation of two strings" or "Concatenating two strings". FYI.
"Now your arguments are just getting silly. The traditional CHAR is only one byte."
Calling my posts "silly" doesn't help your position any. I certainly disagree with your unfounded mischaracterization.
In any case, what was "traditional" isn't relevant. .NET strings are an API unto themselves, and are Unicode character based.
We aren't talking about some random API elsewhere. We're talking about .NET.
>> In any case, what was "traditional" isn't relevant. .NET strings are an API unto themselves, and are Unicode character based
But they aren't - they are 16-bit-char based, and 16-bit char is not a Unicode character.
He is more or less correct in that the only way in which UTF-16 is better than UTF-8 is that it makes it easier to justify ignoring proper handling of all code points, because most of those that you'll see in practice will not require surrogate pairs to encode, and you can just assume 1 char = 1 Unicode character and pretend that this works for everyone. It's more about hiding the problem by solving the most prominent manifestations and swiping the rest under the carpet, rather than dealing with it in its entirety.
That's in the perfect world, though. In practice, when .NET appeared, Win32 has been using 16-bit chars in all its Unicode-enabled APIs for, what, 7 years already? As it is, you can pass .NET strings to Win32 functions with no conversion at all - P/Invoke marshaler just pins the string and passes the direct pointer to the buffer. That's fast. Returning strings can be handled similarly with StringBuilder, as well. And remember that e.g. WinForms did a _lot_ of P/Invoke!
Even setting performance aside, there's also the issue of learning curve and familiarity for users of existing tools. Win32 already had it that way, and it was not alone - Java is another language which does it this way, and Qt is a framework with a similar choice. If I remember correctly, Cocoa NSString is UTF-16 encoded, as well. The only UI framework that I can think of that uses UTF-8 for strings is Gtk+ - pretty much everyone else is on UTF-16.
It's just one of those things that became a de facto standard back in the day when all we had was BMP and UCS2 was all you needed. For better or worse, it is what we have, and changing it now is a fairly radical, breaking change - while benefits of such a change would be very minor in comparison.
String itself aside, there are some areas related to strings that might merrit some optimization. Even if this is still only relevant in extreme cases (as long as it doesn't come with a penalty). See string.Split: ajdotnet.wordpress.com/.../the-cost-of-string-split
Another example is StringBuilder with long strings (memory going to the LOH) and naive usage (i.e. not properly initializing it with the required size). Some way of at least telling the developer about the anti-pattern (e.g. some trace, or a counter with the number of reallocations) could help.
AlexanderJ: The implementation of strings is a trade-off. If you use the CLR's implementation of strings, then performing a Split requires creating smaller strings that just about add up in size to your original large string. The bonus of that, however, is that the larger string can then be garbage collected and only the smaller strings that you actually use need to stick around.
If you look at an implementation of strings like Java's where strings can reference other strings, then doing a Split is relatively cheap -- you just create an array of references to the original string with offsets and lengths of the substrings. The unfortunate problem with that, however, is that the original large string cannot be garbage collected until every single one of those smaller strings is no longer referenced. Every call to Split is a potential memory leak!
And what's the anti-pattern with StringBuilder? A StringBuilder is actually a linked list, where they try very hard to keep each item in the list out of the large object heap.
Lua's got quite a nice string system - every string is hashed on creation, and interned (importantly with a weak reference - the CLR intern system really fails here).
Raw equality is a simple pointer comparison then, and hash lookups of the interned strings extremely efficient (the hash code is stored, the equality is a pointer compare).
Whenever I'm going dictionary heavy... I'm kind of saddened that the CLR doesn't do the same. Especially as I rarely use strings for data manipulation etc.. prefer to leave that for StringBuilders.
""if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index." Not if you want the API to remain character-based."
If you're ignoring surrogate pairs, your UTF-16 api _isn't_ character-based. You're ALREADY arguing in favor of a non-character-based API for UTF-16. So why are you insisting that the UTF-8 one has to be character-based?
@Gabe: The fact that string.Split produces new strings for the parts was not my point. My point is that string.Split internally wastes twice as much memory as the original string needs.
Regarding StringBuilder you're right, they've changed the implementation in .NET4 (until 3.5 StringBuilder maintained one buffer that was reallocated on demand). My fault.
Here's an interesting fact: SyncBlk for CLR objects _already_ has a hashCode field. It's used by stock (reference-identity) Object.GetHashCode. This begs to be reused for String.GetHashCode. In fact, if CLR ever gets the notion of immutable objects, it would be nice to universally apply this optimization to GetHashCode for all of them.