Strings, immutability and persistence

Strings, immutability and persistence

Rate This
  • Comments 32

Todays post is based on a question from StackOverflow; I liked it so much I figured hey, let's just blog it today.

When you look at a string in C#, it looks to you like a collection of characters, end of story. But of course, behind the scenes there is a data structure in memory somewhere to implement that collection of characters. In the .NET CLR, strings are laid out in memory pretty much the same way that BSTRs were implemented in OLE Automation: as a word-aligned memory buffer consisting of a four-byte integer giving the length of the string, followed by the characters of the string in two-byte chunks of UTF-16 data, followed by two zero bytes. (Recall that BSTR originally stood for "BASIC string", because the OLE Automation team was actually part of the Visual Basic team; this is the format that Visual Basic used.)

Using this as the internal implementation of strings has a number of benefits. For example: it only requires one heap allocation per string. The length can be determined without counting characters. The string can contain embedded zero bytes, unlike formats that use zero bytes as end-of-string markers. If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8. If the string is pinned in place and contains no zero characters then the address of the string data can be passed unmodified to unmanaged code that takes a WCHAR*. And so on.

Strings are immutable in .NET, which also has many benefits. As I've discussed many times, immutable data types are easier to reason about, are threadsafe, and are more secure. (*)

One of the benefits of the immutable data types I've talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

Strings in this format are immutable, but they are not persistent, thanks to that pesky single-buffer-with-both-a-prefix-and-suffix layout. Concatenation two strings does not re-use the content of either of the original strings; it creates a new string and copies the two strings into the new string. Taking the substring of a string does not re-use the content of the original string. Again, it just makes a new string of the right size and makes a full copy of the data.

This means that operations on strings such as taking a substring are O(n) in the size of the substring. Concatenations are O(n+m) in the sizes of the two source strings. These operations could be O(1) or O(lg n) if strings were persistent. For example, we could treat strings as catenable deques of chars; there are ways to do very efficient concatenations of catenable deques. (**) Or, we could say that there are two kinds of strings; "regular" strings and "ropes", which are binary trees of strings. Concatenating two strings just allocates a new rope. It becomes a bit more expensive to find the nth char of a rope, and you can't pass them unmodified to managed code, but you always avoid a potentially lengthy copy. Similarly with substring operations: taking the substring of a string could simply allocate a wrapper around the original string that keeps an offset and a length. No copying required.

Why don't we do "persistent" optimizations like this, since we have an immutable data structure already?

The motivation for doing so is based on the incorrect idea that O(1) is always better than O(lg n), is always better than O(n). The asymptotic order of an operation is only relevant if under realistic conditions the size of the problem actually becomes large. If n stays small then every problem is O(1)!

That is the situation we are actually in. Very few developers routinely deal with strings of more than a few hundred or a few thousand characters. If you are dealing with larger strings and doing a lot of concatenation, you can use a StringBuilder; if you're only doing a small number of concatenations, it is very fast to do the copy. If you're taking substrings, odds are very good that the substrings are going to be a few dozen characters out of a thousand character string, not a few hundred thousand characters out of a million-character string. Most line-of-business programmers are not in the business of chopping up strings containing million-character DNA strands; they're in the business of parsing the name of a book out of an XML document or an address out of a comma-separated-value text file. The memory operations for making a relatively small string buffer and copying a few dozen bytes out of it are insanely fast, so fast that there is really little point in optimizing it further.

Unless you have really good reason to believe that the "optimization" is a win, it probably isn't. I spent a summer about twelve years ago rewriting the VBScript internal string manipulations to use "ropes" -- to build up a tree on every string concatenation, and only turn it back into a "real" BSTR when necessary. Our research on real-world code showed that the "optimization" was no optimization at all; most of the time strings were being turned from a rope back into a BSTR after the second or third concatenation, and the added expense of allocating all the tree structures was actually slower than just making a copy of every BSTR every time. (***)

Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis. My team does that all the time; we write code editors that have to be able to deal with enormous files being rapidly edited. Because operations on strings are O(n) and n is actually big, we do not use big strings; rather, we have immutable persistent data structures that represent edits to the document, and those edits are then fed into an engine that maintains an immutable, persistent model of the lexical and syntactic analysis of the program. We want to be able to re-use as much of the previously-seen text and analysis as possible, to ensure high performance in both memory size and speed.

That was some tricky code to write, and what works for us almost certainly does not work for the people doing DNA strings, or any other big-string-with-lots-of-substrings problem that you care to name. Rather than optimize CLR strings for narrow, rare cases, it's better to just keep it simple, optimize for the common case, and rely on the hardware taking care of making blindingly fast copies of short strings.

-------------

(*) Consider an attack where partially-trusted hostile code passes a string containing a "safe" path to File.Open. The security checker verifies that the path is safe for partially-trusted code to open. And then the partially-trusted code mutates the string on another thread before the file is actually opened! If they get the timing right then the security check passes and the wrong file is opened. If strings are immutable, this kind of thing does not happen.

(**) My article on deques does not show how to make them efficiently catenable; see the referenced papers for details.

(***) If you happen to be one of the people with access to the VBScript source code, I think all that gear is still in there, but turned off. Search for FANCY_STRING in the sources.

  • "If you disregard surrogate pairs then the nth individual character can be fetched in O(1) time, unlike in variable-width encodings like UTF-8."

    So? If you disregard multibyte sequences in UTF-8, then the nth individual character can be fetched in O(1) time also.

    Sure, but it is far, far more likely for there to be a multi-byte character in a typical UTF-8 string than there is to be a surrogate pair in a typical UTF-16 string. Again, we are optimizing for the common case here. Strings containing modern Japanese, Greek, and so on, are very common; strings containing cuneiform or hieroglyphics are very uncommon. -- Eric

    OTOH, if you disregard surrogate pairs in UTF-16, or multibyte sequences in UTF-8, your code is broken.

  • @Karellen: You could perhaps put a flag somewhere (after the double-zero terminator, maybe, or at a negative offset before the length?) that indicates whether or not the string contains any surrogate pairs. Or possibly even an integer storing the first index of such a pair. Then you could get O(1) time access for all characters prior to the first such sequence...

    I have no idea whether the CLR does anything like this.

  • "OTOH, if you disregard surrogate pairs in UTF-16, or multibyte sequences in UTF-8, your code is broken."

    Maybe, maybe not. Surrogate pairs aren't really the same as UTF-8 multi-byte encoding points.

    Think about the usual use cases for using a character index into a string. For example, one might search the string for a specific character or sequence of characters, and then create a new substring based on the resulting index. With a UTF-8 string, to implement the substring logic, you have to scan the string all over again just to translate the character index to a position in the actual data structure's storage. With UTF-16, the presence of surrogate pairs is irrelevant; you just treat each half of the pair as its own character, and the logic works fine.

    Another example might be a line-breaking algorithm that chooses an arbitrary index into the string and then tries to work backwards to find a suitable place to actually break the string (e.g. whitespace). Again, surrogate pairs aren't a problem because it's trivial to identify a pair and whether you're looking at the leading or trailing value (indeed, you're never going to mistake part of a surrogate pair for some other valid encoding point), thus ensuring that the code doesn't inappropriately break the pair apart.

    Fact is, for the purpose of indexing into a string, disregarding surrogate pairs often makes perfect sense with a UTF-16 string, even as that's much less frequently the case when dealing with UTF-8 strings (and in fact, in limited scenarios where one is dealing with _byte_ indexes rather than character indexes, even when working in UTF-8 it's not always a problem).

    You cannot a priori describe code as "broken" unless you know what the code is supposed to do and how it's implemented. Even if it does index into a UTF-16 string using the convenient O(1) time method. There are lots of examples of code that disregards surrogate pairs without being broken.

  • "Now, some people are actually manipulating relatively large strings and inserting and removing substrings on a regular basis.."

    I am the author and maintainer of a document editor for a file type that's used to pass electronic banking transactions between banks and network operators (ACH, for those who are paying attention).  It's not uncommon for these files to be as large as 200Mb in size - well over 1 million lines of text and far larger than the typical documents you'll encounter in a code editor.  Despite the allure of fancy data structures like ropes, my editor is implemented(*) as a view on top of the simplest of all data structures:  a giant array of bytes (the contents are strictly ASCII - or occasionally EBCDIC.  Yes, it's still around).   Inserts, deletes, moves are all implemented in the most straightforward way: by copying large chunks of memory around.  Even with that, today's CPUs are so fast that I could never justify the work to build, test and maintain a fancy rope-like backing store: my simple array is good enough for any user of the program.  Even full undo/redo of whole-file operations that modify a million or more lines of text are fast.

    Much as I like fancy data structures like ropes or dequeues, I suspect I'll never again use one - it's just not needed unless you're working on something that's a very serious outlier.

    (*) Implemented via an abstract base class.  I fully expected to have to retrofit a rope-like data structure once I got to very large files, so I built in a dependency barrier to make the retorfit easy.  Fortunately, Moore's law has outpaced e-commerce growth by a good margin since I first started on the app back in 2004.

  • "Again, surrogate pairs aren't a problem because it's trivial to identify a pair and whether you're looking at the leading or trailing value (indeed, you're never going to mistake part of a surrogate pair for some other valid encoding point), thus ensuring that the code doesn't inappropriately break the pair apart."

    But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either. The only encodings I know of in which the problem that you describe can occur is the old MBCS encodings.

  • To try and (fail to) head off the inevitable Unicode argument over which UTF is best - almost any algorithm dealing with Unicode will be wrong for some input data. You can't safely line-break arbitrary text without huge lookup tables for word endings in non-spaced languges, you can't delete the single character before the cursor without doing normalization, eg in the case of U+0065 LATIN SMALL LETTER E followed by U+02CA MODIFIER LETTER ACUTE ACCENT, in order to be consistent with U+00E9 LATIN SMALL LETTER E WITH ACUTE. The distinctions between code units and code points pale in comparison to the differences between the many-many mapping between code-points and font glyphs (needed just for basic page layout!), not to mention there's no real definition of a 'character'.

    To be more on topic: is it more likely that a cache-oriented string implementation could give performance benefits? Say, a char array for sizes < one page, but turning into a substring array when bigger - hopefully with the substrings being close to a page in size?

    <pre>

    struct STR_PAGED_SLICE {

       wchar_t* Start;

       wchar_t* End;

    };

    struct STR_PAGED {

       int Size;

       union {

           wchar_t Chars[]; // offsetof(Chars[Size]) <= PAGE_SIZE

           struct { // otherwise

               int Count;

               STR_PAGED_SLICE Items[];

           } Slices;

       };

    };

    </pre>

  • Can you talk a bit about string internment? I would like to know about how efficient it is, especially in a high performance highly concurrent system.

  • What interests me most about the String design in .NET is how it's very different to the String design in Java which *does* use a persistent approach - each String has a reference to a char[], along with a start index and a length. I don't know how much JVMs "know" about strings - clearly the CLR needs to know rather more about System.String as it's the only type I'm aware of other than arrays where the size of an object isn't fully determined by the type, on a particular CLR instance. (Obviously 64-bit vs 32-bit makes a difference in many types.)

    The other interesting decision Java took was to cache the hash code of a string on first call. I *suspect* that's another dodgy optimization - I would love to know how often it saves time vs the memory required for another integer in *every* string in the system.

  • I am so glad that the decision was made to keep things simple for the common case. Trying to solve for 100% of the cases instead of the 90%, always results in problems and people developing an expertiese in something which shouldn't require expertiese.

  • Aren't strings in .NET technically mutable with unsafe code? I don't think it would ever be a good idea to actually do it, but it is possible, no?

    Every single bit in the user space of the process, including all of the code that implements the runtime, is mutable in unsafe code. That includes all the strings. -- Eric

  • @Jon:

    >> The other interesting decision Java took was to cache the hash code of a string on first call. I *suspect* that's another dodgy optimization - I would love to know how often it saves time vs the memory required for another integer in *every* string in the system.

    Actually, I've often thought that cashing hash code would be a very nice thing to have, but then I envisioned it being placed into syncblk - which is already there, and is already a dump of various unrelated "nice to have for some objects, but not worth paying the memory price for all of them" kind of data - which is precisely what hash code is.

  • "But in what way is this different from UTF-8? UTF-8 suffix bytes always have the highest bit set and the second highest bit unset, so you can't mistake them for something else, either."

    Okay, my oversight. You're right.

    My previous example still holds, and there are others. The fact remains that surrogate pairs aren't necessarily going to be as much of a problem as multi-byte UTF-8 encodings.

  • Can anybody explain why embedded zero characters are only fully supported in the 32-bit CLR? Since the string hashing algorithm on the 64-bit CLR stops when it hits '\0', all strings that start with '\0' have the same hash code as the empty string.

    As an example, executing

    new HashSet<string>(Directory.EnumerateFileSystemEntries(@"\Program Files (x86)", "*", SearchOption.AllDirectories).Select(s => s.Replace('\\', '\0')))

    is a couple orders of magnitude faster on my machine when targetting x86 than as x64. Changing the HashSet to a SortedSet makes them run equally fast.

    Incidentally, the 64-bit hash algorithm seems to be about 50% slower than the 32-bit version on my machine, too.

  • "My previous example still holds, and there are others. " - how does your example not apply equally to UTF-8?

    Anything that you can legitimately ignore surrogate pairs for, you can legitimately ignore UTF-8 for.

    "With a UTF-8 string, to implement the substring logic, you have to scan the string all over again just to translate the character index to a position in the actual data structure's storage." is a straw man - if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index.

  • "if you can gain efficiency without losing correctness with a "character index" for UTF-16 that ignores surrogate pairs, you can do the same in UTF-8 with a byte index."

    Not if you want the API to remain character-based.

    There's a difference between designing for absolute efficiency, and designing for usability. You may feel the latter is pointless, but thankfully not all people do (including the designers of .NET).

Page 1 of 3 (32 items) 123