Can't we all just get along?

One of the struggles that we go through every release is in figuring out how to handle updates to the behavior of linguistic sorting. People rely on NLS APIs to sort strings appropriately for display to users. That means that strings need to be in the right order, where the right order is defined as the order that's going to allow a user to locate some particular item in a sorted list. Most of the time the expected order for a user who speaks some particular language doesn't change a whole lot. A lot of the time the expected order reflects something that people learned in school, or see consistently in dictionaries and encyclopedias and phone books. Oftentimes the order of encoding in Unicode is pretty close to what users expect. And sometimes where the above do not apply, we find highly intuitive and aware native speakers who not only have an innate awareness of what to expect in terms of sort order, but can also articulate why.

Then there are all the other times.

And many of the languages for which technology support is more recent? Are the other times.

For at least one language that we added during Vista, there was not only no national standard available, but four published dictionaries used four different (and internally inconsistent) sort orders. The articulate and perceptive native speakers working with us? Did not agree with each other. For many other languages, literacy patterns are emergent enough that native speaker intuitions are quite unclear and often wrapped up in other issues of cultural identity. There are groups pushing to follow sort orders and literacy traditions from other languages with overlapping character sets. There are groups pushing to establish local standards. There are still other languages for which there are no groups at all.

And yet, in many of these places, people are using computers. And the moment people can read and enter text on the screens in front of them, they're gonna be enveloped in sorted data. In the file system. In Excel files and documents. In other application UI. And for all of those users in all of those languages, the lists need to be predictable and linguistically appropriate. When for whatever reason it's not predictable or appropriate, people tell us about it, and we have bugs to fix. Even for languages with long literacy traditions and relatively advanced technology support, every once in a while the group of people who own the language – and here I mean a speech community of native speakers, who may or may not be aligned to a governmental language authority – change how things work. Spelling reforms get all the media play, but dictionary order reforms happen too. And when that happens, sorting behavior that may have been appropriate at the time we started supporting it stops being appropriate for the users it was intended to support. And then there are all the new characters that get added into Unicode, as new scripts are encoded and new code points are added to existing scripts. These characters get encoded because users need them.

So every time we release a version of Windows or the .NET framework, we end up addressing some issues that reflect either changes in local expectations or bugs in things we’ve shipped before. For applications that rely on us for linguistic correctness of sorted strings, this is a good thing. It means that app developers never need to roll their own behavior for linguistic appropriateness – they need to know how to make the right CompareStringEx() call with the right locale information for their user and they’re done.

But then there are the databases. With indexes that can become corrupt and cause huge compatibility headaches for users.

So anyone who needs to persist sorted data, like people building indexes for a database, needs to rely on consistency of their index. When sorting behavior changes, the index can become corrupt, which means that anyone searching on it won’t find the strings that they expect to find. So all the goodness of linguistically appropriate sorting is great for applications that present strings to users for ordered display, but these updates can wreak havoc for databases that aren’t designed correctly. So many people who persist sorted data avoid linguistic sorting entirely, instead using ordinal sorting to ensure consistency of behavior across a given character set. In Windows 2003, we introduced GetNLSVersion(), which allows developers to query the particular version of sorting behavior that is present on the OS to decide whether or not to reindex. If the version number has changed, then some sorting behavior has changed since the last version, which means that databases probably want to reindex to avoid the risk of corruption.

We hope that index builders are sorting ordinally, but what if they’re not? We think that many databases are using SQL or Jet or some other technology that shields them from having to think about the issue, but what if they aren’t? We believe that most databases go offline to reindex at major OS releases, but what if they don’t? We want people to check the NLS version to make informed decisions about reindexing, but what if people don’t know that they can do this? We know that many more people rely on us for linguistic sorting for display than for persisting ordered indexes, but every release we end up stuck between a rock and a hard place, where we need to make updates to serve the needs of international customers, but we need to be cautious about updates… to serve the needs of international customers.

And that about sums up my week. J