Blog - Title

# November, 2006

 Sorting it all Out Michael Kaplan's random stuff of dubious valueBe sure to read the disclaimer here first!
• #### It may not be special, but 3 is a *magic* number, dammit!

So on Tuesday, Eric Lippert posted about how Every Number Is Special In Its Own Special Way.

Now everyone over there geeked out about proving that all numbers are special, as one might expect.

But as soon as I saw the title, I knew what popped in my head -- Three is a magic number!

Just as I learned it on Schoolhouse Rock!

I couldn't find the original online but I found the Blind Melon cover of it here on YouTube, and Everything2 has the lyrics here, plus some fun info from Bob Dorough (the author) who wrote it the way I write some blog posts (first the title, then try to fill in the post that goes with that title!).

Ignoring the possible semantic/pragmatic differences between "magic" numbers and "special" numbers, it seems like a reasonable leap of logic to me!

(As a side note, the number eight (8) is also quite special given the common association with infinity (∞), due to the code page best-fit mapping most commonly attributed to Cathy Wissink, by everyone except for her!)

This post brought to you by 3 (U+0033, a.k.a. DIGIT THREE, a magic number!)

• #### So when is Esperanto coming? (short version)

Well, depends on how fast you can create it with the Microsoft Locale Builder!

But do be sure to let us know that work is proceeding. We're kind of tired of waiting, too!

(longer version here)

This post brought to you by  (U+4df2, a.k.a. HEXAGRAM FOR THE AROUSING THUNDER)

• #### So when is Esperanto coming?

Not entirely on topic of my prior post Subsets of subsets of subsets of subsets of subsets, but Bertilo Wennergren asked in a comment to that post:

So when are you going to add an Esperanto locale? There are actually quite a few people using Esperanto in their computers. Probably many more than for some of the languages that are already in that list (Sorbian, Tamazight, Sami, Romansh, Occitan, Mapudungun, Sanskrit...). What those Esperanto people mostly need is actually a keyboard layout. Wouldn't be so hard, now, would it?

Most of the big Linux distros have Esperanto locales and keyboard layouts nowadays. Don't know about the Macs.

I promise we won't sue you (like the Mapudungun people did...).

Heh heh heh, cute. :-)

I hinted at the issues in Fictional could make things less functional and then talked about them more fully And while I'm on the subject, there is the rest of the world (I even mentioned Esperanto there explicitly!). The answer is really there -- we are talking about a general weakness in the locale model that Windows uses.

Obviously architectural problems need to be figured out before we can implement the solutions. :-)

In the meantime, support for custom locales in Vista should ease the pain a bit, since there is a very easy way to support any locale that one wants to create.... especially as one can get all the language support one desires, from locales to fonts to keyboards, and so on....

This post brought to you by  (U+0bf2, a.k.a. TAMIL NUMBER ONE THOUSAND)

• #### Subsets of subsets of subsets of subsets of subsets

The big master list of locales that Microsoft has assigned LCID values for is quite large and even includes ones like Yiddish (0x043d) that are unlikely to be added to the Windows locale list any times soon.

There is a subset of that big list, the official list of locales in Windows. I was going to post it till I saw that Kieran actually did already.

There is another not entirely matching subset of locales that are supported by Office for language support and document text tagging. Many locales in fact made it into the big list based on requests from Office.

Now there is a smaller subset of locales that Office supports proofing tools for. This to me is the coolest list since it is the one list with the power to help shape language in positive directions when things like standardized spellings are hard to come by.

There is another subset representing the locales supported by the .NET Framework (it is only smaller now since locales were not really added to it to make it up to the Vista list, but here is the list that it natively supports without Windows only locales).

There is that weird subset/superset of locales supported by SQL Server for their collation support (subset because they folded many of them together and also because they did not update for Vista or even Server 2003, superset because they added a few collations to try to bring some up to the Server 2003 level), and then there is the subset supported by SQL Server's independent locale list.

Which brings me to Aldo Donetti's mail that he sent to me yesterday:

In SQL 2005 if you get the Collations and their LCIDs and group the latter, you’ll see there are approximately 46 LCIDs.

Now there’s a feature called Full Text index which also has some language support – but only for 16 LCIDs (+ a Neutral one)

So if anyone were to programmatically try to set the LCID on the Full Text index based on the DB/Table/Column collation, there’s a good 75% chance it will fail.

The list of languages supported for the Full text index is this:

• German
• English
• French
• Italian
• Japanese
• Korean
• Dutch
• Swedish
• Thai
• Simplified Chinese
• British English
• Chinese (Hong Kong SAR, PRC)
• Spanish
• Chinese (Singapore)
• Chinese (Macau SAR)
• + the “Neutral” one

So maybe chances to fail are less than 75% given some of the most used languages are in there, but Hindi is missing and so is Hebrew, Arabic, Cyrillic, Turkish and a bunch of others (30 overall)

Not sure how many people in the SQL division knew about this, but I (and a bunch of people in DevDiv, including our VP) know about it now. :-(

Notice how they were even upset that (for example) both Swedish and Finnish weren't even on their list of SQL Server collations but were unhappy about the lack overlap between collation support and full text support? :-)

So the actual "missing" list is even bigger if you come at it from the .NET Framework point of view, especially in 2.0 and later where the .NET Framework picks up all that is in Windows even if they don't support it natively.

The languages for which Index Server/SQL Server Full Text Search have word breakers and stemmers is yet another subset, a theoretically open subset (since anyone can write one and the SDK tells people how) but in practice a mostly closed subset given how difficult it is to write a word breaker/stemmer,

I guess it is easy to look at this mess and wonder how interoperability work on these products at all, isn't it? :-)

Though the changes to bring Windows and the .NET Framework into sync with Vista and .NET 2.0 are a good first step. The next steps would be to try to bring Office and SQL Server into the fold if we can.

If we are lucky, the only ones that will always remain semi-open subsets are proofing tools and full text search indexes, since they are the two that are not just adding a compatibility layer but require unique and difficult work. :-)

I think the fact that there is apparently a VP who has been made aware of all this might help us though (and big cross-division effort needs a champion!), so maybe I need to follow up with Aldo on this!

This post brought to you by  (U+0929, a.k.a. DEVANAGARI LETTER NNNA)

• #### Where'd *that* language come from?

Now I have talked in the past about how the international settings of other accounts on the system can affect settings such as the keyboard list in the logon dialog and the actual setting of MS Shell Dlg and DEFAULT_GUI_FONT.

The truth is that these other accounts can have interesting additional effects on any services that run under them, and these just happened to be two of the more noticeable affects.

Until now, I mean. :-)

The question Kenichi asked was simple enough:

I am playing with Vista Ultimate and lang-pack and found some resources are not covered by MUI...

...the System Property dialog is not in French but in Japanese - Native Language of this installation.

Is this expected?

The question was asked of an alias I did not belong to, but luckily Tim Wegner of the MUI test team had the answer:

This is expected... ...The reason for the behavior you see is that the dialog is runs on an impersonated thread which by default uses the language selected at install time (Japanese in your case).  The account you have set to French does not affect other accounts on the system.  You can use the “Copy to reserved accounts…” button on the Adminstrative tab of intl.cpl to get the dialog to use French.  Please be aware however that this will affect other accounts on your system that have not specifically set a language.

The fact that any dialog coming up under LUA a.k.a. UAC type elevation might show in a different language is going to be pretty obvious to anyone working on a machine with MUI installed.

Thankfully this will not always happen, especially in mixed apps like our own Regional and Language Options that handle resource loading more consistently. But it will pop up sometimes, and when it does, it can be jarring. And not in a good way, especially if you don't know the other language....

So even though this is currently considered by design, in the long run, I can imagine wanting to see this issue addressed by making it easier for these elevated dialogs to pick up the settings of the desktop they are sitting on top of even if they are not really interacting with it otherwise due to the isolation of these elevated processes. It simply makes sense for there to be such an option for resource loading, to allow things to feel more integrated.

This post brought to you by  (U+0bb9, a.k.a. TAMI LETTER HA)

• #### Your layout (in all likelihood) bores me

Now if I were Rory, I might try to take the fact in the title, combine it with some girlfriends of the past who were actually in a position and/or profession to have photographic layouts of themselves done as part of their career, and move into an extended analogy about how layouts in general do not impress me.

Of course I am not as eloquent as he is, and even if I were I don't think I'd be posting cheesecake photos to make my point on this, my MSDN blog. So I'll just skip all that and get to the actual technical bit and skip the extended analogy.

Random keyboard layouts don't impress me.

I mention this because I have gotten tons of mail from people who have patents on the layout that they believe will change the world if only Microsoft wanted to buy it to supporters of the Colemak keyboard layout (which used to have a Wikipedia link here which went away since it was deleted in accordance with their deletion policy) who think I'll jump on and support the notions of supporting those who are re-inventing the most intuitive keyboard layout.

Well, Microsoft ain't buying it and neither am I.

Microsoft Keyboard Layout Creator was created to address the single largest customer request that we had been getting.

Or to be more technically accurate, the requests put it on the roadmap; it was the personal pain that I and others had to go through in creating a bunch of keyboards for a big customer in a hurry that led to getting the tool done. :-)

And I am happy MSKLC is out there; I am even happier at being able to convince management that the update for MSKLC running on Vista is a requirement.

But that doesn't mean I am going to be excited about people's layouts. If people have something interesting to say then I am glad to hear it. But it isn't the layout or the reasons behind it that are going to fascinate me.

Hmmm..... that is kind of why the other kind of layout doesn't impress me either. Maybe I should have spent more time on the analogy. :-)

Ah well, too late now; the post is almost over. But please keep in mind that hearing about the super cool Spaceman Spiff keyboard layout that you created or know about it or that you scientifically proved would cure Lupus Erythematosus is, while slightly more interesting to me than watching cars rust, slightly less interesting to me than watching paint dry.

Nothing personal....

This post brought to you by  (U+2328, a.k.a. KEYBOARD)

• #### When you ask how long it is, keep in mind that some guys may exaggerate their answer

A common question someone might have if they need to allocate a buffer is How much space do I need to allocate?

The question seems simple enough to ask, right?

Of course there are two different possible answers to the question when one is dealing with the non-Unicode versions of functions:

• Functions like GetDateFormat and LCMapString when called with a NULL target buffer, will return the exact value required as they will call the Unicode version (which will allocate as needed to get the exact size), and then convert the resulting string back out of Unicode;
• Functions like RegQueryInfoKey and GetWindowsTextLength which will actually just assume two bytes per WCHAR that the Unicode version tells it and return a "maximum buffer size" rather than the actual buffer size.

Now the hint I gave above about all of the extra work that the NLS API does is the answer to the obvious question people ask about why they have to put up with getting an estimated answer rather the actual answer -- there is a performance hit to that extra effort placed!

There are also plenty of people (like the SQL Server team) that hate the NLS "call twice" semantic due to the performance issue and the fact that we allocate. They prefer to use their own buffers and just call us once each time.

So clearly these two models each have a place in Windows (one is faster and less memory fragmenting, the other is more accurate), though admittedly the documentation could perhaps be clearer in some cases about which is which....

The rules are not always totally consistent, either; even within NLS, there is the NormalizeString function, which estimates (and sometimes in some extreme cases even returns a buffer size that is too small, a factoid that causes lots of grief for FoldString function in Vista, which guarantees exact results even though for MAP_FOLDCZONE, MAP_PRECOMPOSED, and MAP_COMPOSITE it wraps NormalizeString).

Which just goes to show you that as in life you have to keep in mind when people might exaggerate when you ask them how long something is for reasons that they may or may not choose to make entirely clear to the people asking. :-)

This post brought to you by 𐨌 (U+10a0c, a.k.a. KHAROSHTI VOWEL LENGTH MARK)

• #### Any database developers reading this?

Attention all database developers out there!

I have been somewhat dismayed at getting so few comments after I posted Wild[card] thing, You make my CHAR sing.

Is this really what people expect out of wildcards? Is there really so little interest in seeing the issue explained and documented? And maybe even future versions of query processors having the syntax expanded to support the notion that the engine underneath it does (in regards to "single characters that are actually single sort elements")?

It also affects Access (all versions) and Exchange/ESENT, yet it appears not to be documented anywhere....

Let me know what you think, our phone lines are open. :-)

This post brought to you by  (U+0fc4, a.k.a. TIBETAN SYMBOL DRIL BU)

• #### Judge not one area by performance in another?

So I was kind of re-reading DeMille's The General's Daughter yesterday.

I think the book goes best if you keep in mind Madeleine Stowe as Sara Sunhill and forget about John Travolta as Paul Brenner. The movie is kind of a distraction, otherwise. Plus Simon West took out everything beyond the romantic tension between the two characters, putting it all in the past. I know that movies cannot be 100% accurate to books, but I do hate when they take out whole plotlines.

I had really only read the book once before, but it was kind of vacation so I was relaxing with cheap fiction. So sue me.

Anyway, there was a part near the beginning that distracted me....

As she spoke, a slide projection screen behind her flashed images of ancient battles taken from old prints and paintings. I recognized "The Rape of the Sabines," by Da Bologna, which is one of the few classical paintings I can name. Sometimes I wonder about myself.

Distracted me why, you might ask?

Well, it's not like I have a degree in art history or anything, but I could have sworn that the Da Bologna associated with "The Rape of the Sabines" was a sculptor, not a painter. Wasn't he?

I stopped reading and went to the browser for a quick look.

Yep, sure enough Giovanni Da Bologna did indeed sculpt, not paint, The Rape of the Sabine Women. Several others, from Poussin to Rubens to David to Picasso had painted it, but Giambologna had sculpted it.

Hmmm.

My first thought, the one that distracted me, was similar to the characters -- the fact that I had remembered enough about the reference to be distracted? Sometimes I wonder about myself, too.

I got over that quickly (since I actually remember a whole bunch of other art, too) and then started wondering about the book. I took a lot of the little details in the book about the Army and about the CID as being somewhat accurate though I have no direct knowledge of such things. But now I wondered -- finding a mistake in a small unrelated and arguably insignificant piece I did know something about had cast the whole book into a small bit of doubt.

Maybe it was meant to do this, some sort of statement about Warrant Officer Paul Brenner, about the kind of person who had never been to Italy to actually see Giovanni Da Bologna's work but who ad seen pictures of it and remembered them.

But that seemed awfully subtle, so maybe it was just a mistake.

Or maybe DeMille, a Vietnam veteran himself, had made the same error.

And how bad of a mistake was it, really? It is a powerful work of art even in a photograph.

This got me thinking about how often we do subconsciously (or sometimes even consciously) judge the work that people do that we do not understand by how they refer to the work that we do.

Internationalization is (interestingly enough) one area where I consciously don't do this, because there are plenty of software developers who I have a ton of respect for who can do really well in so many areas but who will easily make rookie mistakes in this particular area. I'd probably judge everybody in a negative light if I were unable to avoid this snap judgment.

It made me wonder about how often that snap judgment may be the wrong thing to do -- perhaps we need to not be so quick to assume that it is time to flip on the bozo bit about all areas just because someone makes a mistake in a particular one. It seems silly to judge someone for making a mistake related to someone's knowledge of art (a subject about which I myself actually know very little anyway) when I don't do the same thing about a subject I know a lot more about.

I mean, it seems almost ironic that the more qualified one is to judge something, the less likely they are to actually judge it, right?

So I decided to go back and read the book. Which seemed much like I remembered it.

And then I decided to try to consciously avoid that particular form of snap judgment in the future, when I could. Everyone deserves the chance to not be dismissed so easily, right? :-)

This post brought to you by ್ (U+0ccd, a.k.a. KANNADA SIGN VIRAMA)

• #### Strategic avoidance of stepping in a CrapFest

First I read Joel's Choices = Headaches. And then I read his follow-up (How many Microsofties does it take to implement the Off menu?) and of course the link to Moishe Lettvin's The Windows Shutdown CrapFest.

And I think about a project I am involved with that has as many strange interconnected pieces -- keyboards. And most especially the new "Text based TIPs" that I owe a whole bunch o posts on which are almost done.

I sang a bit about the underlying organizational complexity I am referring to in my Who runs the keyboards? post, and I can assure that the actual chain of organizational connections is even worse than the one Moishe laid out, involving the hardware team that is an entirely different division, folks in Shell, USER, Cicero, Setup, and NLS (plus the MUI dev lead who got involved due to his prior work). Things had the potential to be even worse, especially when you factor in the IME team that we also had to interact with, which means we were crossing continents, not just divisions.

We were definitely in different trees, and it could take months to see changes make it from one team's tree to all the others, especially in the earlier days of the project.

Thankfully, we took it in a different direction that stayed away from many of the problems that plagued the "Off menu"....

In our case, we provided the setup team with a library that did all the work and asked them to call it (this approach is used for all of the NLS/MUI functionality setup uses, and solves the almost-as-good old solution where the setup team would just use intl.cpl).

We still aren't meeting with the hardware folks, even though we promised to do it more often. Darn, we should work on that!

The "Cicero" (Text Services Framework) team and the USER team thankfully were put in the same org, so I could talk to someone on Hiro's team as easily as I could talk to someone one Yutaka's (like Kazuhiko), and although our source trees are in vastly different places in the hierarchy, we would regularly work with each other's private drops to verify that changes worked properly. And all of us hated process, and all of us wanted to minimize change and minimize the impact to test. So on each side me providing copies of the text file I was developing and them providing me with updated versions of TableTextService.dll, sometimes input.dll, and occasionally msctf.dll made everyone's life easier. And the same applied to the shared pieces that were used by setup that we were providing a library for that used their library as well.

We did not schedule standing meetings ever -- mainly we just sent mail when we needed something, just like they did. There was a brief panic on the setup piece of this due to a PM liaison getting kind of random and messing up what all of us thought was going on, but that PM left and everyone on both sides worked to fix up this problem.

In the end we shipped. :-)

I still owe a blog post on the story about the Amharic input method. I promise I'll try to get to it this upcoming week. It is part of a "Vista non-hero" story I have actually been working on....

So, back to the non-nightmare for a moment. How did we avoid the same kinds of problems that plagued the "Off Menu" ?

Well, I think we avoided the layer of process involving:

• having our leads come with us to every meeting (we just told them later what was happening!);
• having regular meetings at all when email and private builds can do the job for us;
• involving the organizational tree that would have had to pivot on Jim Allchin (since he is the first shared management contact we all had!);
• building up tons of process based on huge plans and goals rather than just trying to solve the problems without overthinking them;
• having multiple points of contact from every team with every other (we tried to have one act as a liaison when possible).

Which is not to say that all of this is what happened in the "Off Menu" case, I guess I am thinking more of the problems that would have led us into similar troubles.

The moral of the story? There were teams in Vista that also worked hard to stay focused and get the job done, in addition to some reportedly pathological cases. None of the teams involved were any better or worse than any other team, and we were not alone

This post brought to you by  (U+1351, a.k.a. ETHIOPIC SYLLABLE PU)

• #### Math in Unicode is hard. So let's have Murray make it easier!

I have talked about math in Unicode before, like in For those who enjoy mathematics (or, 'Also new in Vista').

I figured it is only fair to point out that Murray Sargent, who I mentioned in that post, has a blog, and that it is on the list of blogs I read....

Murray also works on RichEdit and has already done several fascinating posts like Some RichEdit History and LineServices, that have quite a few international implications.

Welcome, Murray!

This post brought to you by 𝖺 (U+1d5ba, a.k.a. MATHEMATICAL SANS-SERIF SMALL A)

• #### Punctuation... now, isn't that SPECIAL [weights] ?

Well, apologies to Dana Carvey and all, but actually, it isn't!

The other day when I talked about The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen, Oleg was thinking about the documentation for sort keys in LCMspString and maybe even asbout my post How do sort keys work?. And he realized that prior mappings that explained how sort keys worked had a flaw in them. After I pointed out that special weights were actually for some particular differnces in Kana, Oleg commented:

Then example string could be SO-DIMMソケット×2.

The sort key for this string is:

0e 91 0e 7c 0e 1a 0e 32 0e 51 0e 51 22 16 22 0d 22 1c 22 1e 08 1c 0c 33 01 01 12 12 12 12 12 12 01 c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff 01 80 0f 06 82 00

In this key "Special weights" piece is:

c6 c6 c4 ff 02 c4 c4 c4 c4 ff ff

and the "punctuation weights" piece is:

80 0f 06 82

And the LCMapString with the LCMAP_SORTKEY flag stores a sort key in the buffer, as an array of byte values in the following format:

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x01 [Punctuation weights] 0x00

"Punctuation weights" piece is specific for WORD sort and contains information about hyphen/apostrophe characters.

Is this correct description now?

Indeed, Oleg's description is correct here. Because punctuation weights are not special weights. And THIS is how sort keys really work....

• UW -- PRIMARY weights -- Unicode a.k.a. alphabetic weights, in almost all cases two bytes per sort element
• DW -- SECONDARY weights -- Usually diacritic weight, but also other second order distinctions
• CW -- TERTIARY weights -- Usually case weight, but also other third-order distinctions like final forms
• SW -- QUADERNARY weights -- Specific Kana differences, sometimes internally thought of as "Extra weights" or XW.
• PW -- QUINTINARY weights -- Specific punctuation differences in WORD sorting

I'll try to get onto updating the documentation to give a slightly abbreviated version of this in future updates. :-)

This post brought to you by  (U+30bd, a.k.a. KATAKANA LETTER SO)

• #### Even if they aren't talking international, at least they aren't stomping international!

Now I am not a stranger to keyboard shortcuts -- in fact if you look at posts like these, you might think its obvious that I have been talking abiut best practices surrounding shortcuts as they relate to languge and a they relate to keyboards off and on for years now....

So I am of two minds when I see how over on Shell Blog that they have put together a quick reference with the Shell shortcuts in the post Do things faster with Keyboard Shortcuts.

On the one hand, I have always loved the fact that the Shell doesn't stomp all over the bulk of the ALTGR keystrokes like some other nameless Microsoft products that I could mention....

But on the other hand, plenty of these shortcuts only give the Latin letters for the keystroke combinations, something that ignores the fundamental fact that there are a helluva lot of people who have other letters printed on those keys. It seemed like it would at least be worth quick note. :-(

Oh well, at least they aren't stomping on people's keyboards in the name of usability. :-)

This post brought to you by K (U+004b, a.k.a. LATIN CAPITAL LETTER K)

• #### No comment, other than

No comment about this, other than that (and to agree with Michen's point that no good deed goes unpunished).

• #### Whither U+0081?

Bart asks in the Suggestion Box:

Why does WideCharToMultiByte(CP_ACP,

accept 0x0081 and turns it into 0x81 even though the acp is cp1252 and 0x81 is not defined ?

http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx

These tables could be thought of as more of an idealized notion of the code pages, perhaps? :-)

The more accurate and latest tables are actually posted up on the Unicode site, the Cross Mapping Tables link to the various vendors like Microsoft, like the "Best Fit" tables of of this page.

The 1252 "WindowsBestFit" table there has the entry in question (just as Windows has had it, for quite some time).

About the only real benefit to such a mapping even existing is that it will allow a C1 Unicode control character to roundtrip. Dubious benefit, to be sure, but too late to change at this point....

This post brought to you by U+0081, a C1 control character

Page 1 of 4 (48 items) 1234