Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Please read disclaimer; content of Michael Kaplan's blog not approved by Microsoft!
Once upon a time, everybody knew that the Earth was flat.
And not too long after that, everybbody knew that the Sun orbited around the Earth instead of the other way around.
And most developers know that there is no way to get Unicode output support in the console that will work properly both in the console and when redirected to a file.
I was just looking at some internal guidelines for developers inside of Microsoft that bemoaned all the things that don't work in the console:
The main limitations can be summarized as: Unicode I/O with ReadConsoleW/WriteConsoleW is always supported; but default raster font will break it. Changing console codepage is always possible, but DBCS codepages are supported under DBCS system locales (not necessarily matching). The console does not support complex script languages such as Arabic or the various Indic languages that can only be rendered with Uniscribe. Unicode I/O is supported through Win32, but with several limitations: a user has to select a truetype font in console; no redirection of input/output is supported; CJK languages are supported only under CJK system locales.
The main limitations can be summarized as:
Now of these caveats that everybody knows, most of them are not true!
And not too long ago someone sent me a piece of mail about that second sub-bullet in #4 and the code he wrote to work around it:
I think I’m mostly ready. However, I’m still not quite certain about proper handling of redirection of output to a file. The site above suggests the option of using WriteConsole for normal output, and WriteFile+BOM for file output; this is the direction I’ve tried, but I wanted to get confirmation that I’m doing it correctly.I’ve written the following function (mostly copied from other code): static const WCHAR UNICODE_BOM = 0xFEFF;void UPrint (LPCWSTR String) { DWORD ConsoleMode; BOOL ConsoleOutput; DWORD FileType; BOOL Result; HANDLE StdOut; DWORD StringCharCount; DWORD Written; // // StdOut describes the standard output device. This can be the console // or (if output has been redirected) a file or some other device type. // StdOut = GetStdHandle(STD_OUTPUT_HANDLE); if (StdOut == INVALID_HANDLE_VALUE) { goto PrintExit; } // // Check whether the handle describes a character device. If it does, then // it may be a console device. A call to GetConsoleMode will fail with // ERROR_INVALID_HANDLE if it is not a console device. // FileType = GetFileType(StdOut); if ((FileType == FILE_TYPE_UNKNOWN) && (GetLastError() != ERROR_SUCCESS)) { goto PrintExit; } FileType &= ~(FILE_TYPE_REMOTE); if (FileType == FILE_TYPE_CHAR) { Result = GetConsoleMode(StdOut, &ConsoleMode); if ((Result == FALSE) && (GetLastError() == ERROR_INVALID_HANDLE)) { ConsoleOutput = FALSE; } else { ConsoleOutput = TRUE; } } else { ConsoleOutput = FALSE; } // // If StdOut is a console device then just use the UNICODE console write // API. This API doesn't work if StdOut has been redirected to a file or // some other device. In this case, write to StdOut using WriteFile. // StringCharCount = (DWORD) wcslen(String); if (ConsoleOutput != FALSE) { WriteConsoleW(StdOut, (PVOID)String, StringCharCount, &Written, NULL); } else { // // Write out a Unicode BOM to ensure proper processing by text readers // WriteFile(StdOut, (PVOID)&UNICODE_BOM, sizeof(UNICODE_BOM), &Written, NULL); // // The number of bytes to write to standard output must exclude the null // terminating character. // WriteFile(StdOut, (PVOID)String, (StringCharCount * sizeof(WCHAR)), &Written, NULL); }PrintExit: return;} Based on a couple quick tests, this seems to do the right thing, but review from someone more familiar with the area would be much appreciated. :-)Thanks!
I think I’m mostly ready. However, I’m still not quite certain about proper handling of redirection of output to a file. The site above suggests the option of using WriteConsole for normal output, and WriteFile+BOM for file output; this is the direction I’ve tried, but I wanted to get confirmation that I’m doing it correctly.I’ve written the following function (mostly copied from other code):
static const WCHAR UNICODE_BOM = 0xFEFF;void UPrint (LPCWSTR String) { DWORD ConsoleMode; BOOL ConsoleOutput; DWORD FileType; BOOL Result; HANDLE StdOut; DWORD StringCharCount; DWORD Written; // // StdOut describes the standard output device. This can be the console // or (if output has been redirected) a file or some other device type. // StdOut = GetStdHandle(STD_OUTPUT_HANDLE); if (StdOut == INVALID_HANDLE_VALUE) { goto PrintExit; } // // Check whether the handle describes a character device. If it does, then // it may be a console device. A call to GetConsoleMode will fail with // ERROR_INVALID_HANDLE if it is not a console device. // FileType = GetFileType(StdOut); if ((FileType == FILE_TYPE_UNKNOWN) && (GetLastError() != ERROR_SUCCESS)) { goto PrintExit; } FileType &= ~(FILE_TYPE_REMOTE); if (FileType == FILE_TYPE_CHAR) { Result = GetConsoleMode(StdOut, &ConsoleMode); if ((Result == FALSE) && (GetLastError() == ERROR_INVALID_HANDLE)) { ConsoleOutput = FALSE; } else { ConsoleOutput = TRUE; } } else { ConsoleOutput = FALSE; } // // If StdOut is a console device then just use the UNICODE console write // API. This API doesn't work if StdOut has been redirected to a file or // some other device. In this case, write to StdOut using WriteFile. // StringCharCount = (DWORD) wcslen(String); if (ConsoleOutput != FALSE) { WriteConsoleW(StdOut, (PVOID)String, StringCharCount, &Written, NULL); } else { // // Write out a Unicode BOM to ensure proper processing by text readers // WriteFile(StdOut, (PVOID)&UNICODE_BOM, sizeof(UNICODE_BOM), &Written, NULL); // // The number of bytes to write to standard output must exclude the null // terminating character. // WriteFile(StdOut, (PVOID)String, (StringCharCount * sizeof(WCHAR)), &Written, NULL); }PrintExit: return;}
Based on a couple quick tests, this seems to do the right thing, but review from someone more familiar with the area would be much appreciated. :-)Thanks!
Well, at the time the only comment he had gotten back was that that it was a little odd that the BOM was being written on every call since it should only be in the beginning of the file.
Anyway, remember the other day when I was mentioned in Some armchair root cause analysis of the suckage of lstrcmpi how I mentioned that STL dropped by and we were talking about stuff?
One of the things I mentioned was this problem, and related ones like how you had to use binary mode to write out Unicode text with the CRT functions, thus losing all of the newline and line semantics. He agreed this was lame.
Then last night he showed me how both Visual Studio 2005 and 2008 (well, Vissual C++ 8.0 and 9.0) that it was not true!
Basically he created a file something like this:
#include <fcntl.h>#include <io.h>#include <stdio.h>int main(void) { _setmode(_fileno(stdout), _O_U16TEXT); wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n"); return 0;}
And then I compiled the file in the Visual Studio 2005 command line (where I already had my console font set to Lucida Console):
cl /W4 foo.c
And that FOO.EXE worked beautifully, outputting the Cyrillic and Ideograhic text (кошка 日本国) without corruption, to the command line....
And when I redirected it to a file, it worked then too, writing out the file as Unicode!
Notepad opened it fine and detected it as Unicode even without the BOM; I did have to save it in Notepad to have the console see it that way when I used the type command in the console.
Here was that console window:
When I copied those boxes from the console and pasted them into Notepad, I once again got the Unicode text both times.
So much for conventional wisdom. All that WriteConsoleW blitting, the binary file mode, the chcp, the console output CP crap. All to get an answer not as cool as the above.
The Earth? It isn't flat.
The sun? It doesn't orbit around the Earth.
And the CRT? Starting in 2005/8.0, it knows more about Unicode than any of us having been giving it credit for....
The heroes of the day? _O_U16TEXT and _O_U8TEXT, which I will probably talk about more at some point.
Or you can look at the _setmode and _wsopen topics (the latter is the only place that _O_U16TEXT and _O_U8TEXT seem to be mentioned:
_O_U16TEXT Open the file in Unicode UTF-16 mode. This option is available in Visual C++ 2005._O_U8TEXT Open the file in Unicode UTF-8 mode. This option is available in Visual C++ 2005._O_WTEXT Open the file in Unicode mode. This option is available in Visual C++ 2005.
But it works right here too.
Awesome, truly.
And conventional wisdom is quite retarded!
This post brought to you by U (U+0055, aka LATIN CAPITAL LETTER U)
Please read the disclaimer; content of Michael Kaplan's blog not approved by Microsoft!
Everybody hates Microsoft.
Well, not everybody.
But hating Microsoft seems awfully popular....
It seems like to try to be the best at anything you have to make choices that lots of people won't like. And then before you know it, people are hating you.
Everyone hates what Microsoft does with the BOM (Byte Order Mark). That thing I talked about in Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!).
Lots of people hate it so much that they will complain about it when it is not completely on topic, like in that other post (unicodeFFFE... is Microsoft off its rocker?).
But I feel I must ask one question.
Why are people writing their UNIX Shell scripts in Notepad such that the issue of Notepad saving the BOM in UTF-8 is such an issue?
I mean, people who are writing UNIX shell scripts are not guaranteed to be among the Microsoft haters, but all things be equal they are probably more likely to be than the people who pay their own fees to go to TechEd or PDC.
So why are they writing their UNIX shell scripts in Windows Notepad, exactly?
I'd just like it if someone could explain this one. It just makes no sense to me....
This post brought to you by U+fffe, a permanently reserved code unit in Unicode so that BOM determination can remain easier....
Pronunciation based sorting for Traditional Chinese is commonly requested from people, and the requests fall within two broad categories:
Cantonese pronunciations -- The requests have been for Pinyin style orderings, based on the Cantonese pronunciations for the ideographs. As I mentioned in The Cantonese IME, however, there is no accepted single transliteration standard to map ideographs to Cantonese pronunciations, and accepting schemes like Jyutping would leave the system only of use to some people, and it is difficult to assess whether the benefits would outweigh the costs.
Mandarin pronunciations -- Again the requests have been for a Pinyin style ordering, but of pronunciations of Traditional Chinese ideographs. The data for this also does not exist (we only currently have Bopomofo pronunciation data for ~48566 ideographs, based on a Taiwanese standard that could be missing ideographs used in Hong Kong and Macao. And as I mentioned in Is it Macau or is it Macao?, it is unclear how close to expected results the pronunciations (provided by China are for the 62,289 ideographs they covered) in these other markets. Mandarin is Mandarin, obviously -- but it is unlikely that there is enough in the way of standards here to guide what the expected pronunciations to be.
Given the problems that seem to exist for stroke-based sorts (ref: How bad does it need to be in order to be not good enough, anyway?), perhaps I can be forgiven my skepticism for the ability of the PRC-provided Mandarin pronunciations to match what they would expect in Taiwan or Macao....
There are interestingly even some people in Taiwan who have expressed interest in a Mandarin-style ordering. This is theoretically easier to do by mapping Bopomofo to Pinyin if one can come up with agreed upon ways to do that mapping, e.g. taking the first twelve entries in the Bopomofo table which all have a Bopomofo pronunciation of ㄅㄚ (BOPOMOFO LETTER B + BOPOMOFO LETTER A), with the PRC-provided Pinyin provided for reference and to prove that using the data as i may not match expectations:
Now obviously one could take all 48,566 ideographs and use this information to produce Pinyin-esque letters, and if one goes further down the table other Bopomofo is there, including many that include tone marks:
In the table of data, the first tone is never included, so mapping as follows is easy enough:
ㄅㄚ ---> ba1
ㄅㄚˊ ---> ba2
ㄅㄚˇ ---> ba3
ㄅㄚˋ ---> ba4
ㄅㄚ˙ ---> ba5
Thus it seems like it would quite easy to do the transformation.
So at least in Taiwan the resultant Pinyin-esque sort might be just what people are looking for, and whether that matches what other people expect would have to be determined to see if it would be a useful ordering in Macao or Hong Kong or parts of China where Traditional Chinese is preferred.
The overall problem of applicability in Hong Kong and Macao is really just another piece of the same puzzle that came up before -- without information it is hard to fathom how relevant the data would be.
Plus the political issues inherent with providing a Pinyin-esque sort for Taiwan because even with some people thinking it a good idea there may be just as many who could fear the long term consequences of such a thing, not to mention that it would bring differences between the PRC data and the Taiwan data into much sharper focus!
I think it might be interesting to have some people in Macao and Hong Kong look at all of the differences between the PRC Pinyin data and the transformed Bopomofo (such as the four examples I gave) and determine which one they would expect to be the most common pronunciation. If their expectations veered toward the Taiwan choice then a sensible Traditional Chinese Pinyin pronunciation could emerge for any/all countries using Traditional Chinese, though the characters needed for Mandarin (if any) in those other countries would probably also have to be added, too....
At this point I remind myself that I don't work on collation in Windows anymore, which does make pie-in-the-sky speculations thinking about future version support a less than useful endeavor....
It is fun to think about potential solutions to the problems from time to time, though.
This blog brought to you by ㄅㄆㄇㄈ (U+3105 U+3106 U+3107 U+3108, aka BOPOMOFO LETTERS B P M F)
Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)! Plus it is entirely offtopic so if thst kind of thing bothers you, then you should just move on....
I have found myself listening to Louis XIV a bit more than I ever actually was before.
I mean, I had their first album (The Best Little Secrets are Kept) and I thought the cover was really out there (more on this in a moment). And I vaguely recall havin listened to it.
I didn't have their second album until very very recently.
Even now, Louis XIV is a guilty pleasure for me. I think I like them for the same reason I like South Park -- because even though I have trouble imagining I might enjoy hanging out with the band, I enjoy the lines and the double entendres. I feel evil for enjoying it, but I do. But then I feel evil for laughing at some of the stuff in South park, too!
It was new virtual friend Samantha ("You can call me Sam. Do I have to call you Mr. Kaplan? Well, you are like nearly half my age... just kidding!") who got me listening to them a bit more closely.
Though I think her interest is different than mine -- I am in it for the lyrics, while her interest seems like Rachel Green's interest in going to soap operas parties or Penny Lane's interest in Gladwell -- she wants to be a groupie, to hang out with the band....
I have yet to find a woman who enjoys Louis XIV music who doesn't fall in the groupie category though I admit I haven't asked most of my female friends up until now what they think given my embarrassment, perhaps this blog will stimulate a conversation or two! :-)
Here is the "Walmart cut" of the CD cover (technically it is even "cleaner" than the Walmart one, actually -- I trimmed further up than they did!):
The actual cover shows a bit more of Karen Miller (the model on the cover) than this picture does. I explained to Sam that the album I had actually contained two extra songs (It's the Girl That Makes Him Happy and The Grand Apartment) but they probably couldn't get Karen to come back in to get redrawn on -- especially not with the "font" size used since there wasn't space for two more songs!
Sam was actually really impressed that I knew who the cover model was, she thought that maybe I knew the band?
She seems like she could easily enjoy being a groupie for these guys!. Truly!
I had to let her down on that count -- I had just read a backstory piece about it entitled The Making of THE BEST LITTLE SECRETS ARE KEPT Design, which I remembered vaguely from back then because of the fact that the Parental Advisory was given to the album solely due to the cover art, not the music. From the article:
The cover ended up taking on a life of its own once in public. It set the tone for interviews and reviews. The Best Little Secrets are Kept is apparently the first album to ever have a e content sticker solely for its artwork. The content of the CD is the same for both versions of the CD. The only difference is a little ass crack on the cover and back. That's an uptight country. A friend of mine informed me that the cover had ended up in the design magazine PRINT. He said it didn't have a very kind write-up. "Sexism isn't the worst thing about John Hofstetter's design," it read. They said I had blatantly and inappropriately mimicked the stencil titling on the West Side Story soundtrack. We also copied the cover of an Eric Clapton album. EC Was Here has the title written on a girl's back in lipstick with a similar crop to ours. The Eric Clapton album has a horrible image and is an unmemorable cover. I like Eric Clapton and I had never seen the cover. It did seem similar to the Clapton cover, but West Side Story? Come on. We didn't copy either. But this wasn't the first time it was called sexist. I'm not going to speak for the music, only for the design. I don't really understand the sexist claim. There's a girl's back with writing on it. I think it's a tasteful image. I don't consider myself sexist. I guess sexism is the act of objectifying woman. I suppose you could make that case. Maybe we are sexist. I don't know. Does sexy always have to be sexist?
The whole notion stuck with me and I remembered the title, so the article was an easy Google search away.
Irregardless, Samantha seemed kind of impressed by the fact that I remembered what I did, but then again she found my blog when (in her words) Googling for "Love Monkey" "Barbarian Brothers", and was briefly convinced that she (like the young lady I have mentioned here previously) had fallen in love with my blog, though (with my help) within 1.5 conversations she realized (to quote Aimee Mann when she spoke about Noel Gallagher) at a show a few years back:
This next song I wrote about Noel Gallagher from Oasis, because I had a crush on Noel Gallagher from Oasis, about eight years ago when they first came out, just pre-rock stardom. And actually of course the crush was almost entirely fueled by barely knowing him and having nothing in common. Which always seems like such a romantic idea, like "well the thing they have in common is the love part" but that really doesn't do you much good when he is passed out on the floor from sniffing glue or something. Not implying that Noel Gallagher himself would be passed out, but you get the point.......So I had this crush on Noel Gallagher. I had met him a few times, and I was in Boston. and he was in New York, Oasis was playing in New York I actually got on a plane to go down and see if I could hang out with him, which was totally pathetic. I didn't even have his phone number! It wasn't even like "come on down and hang out." It was like I was going to show up and then go "Hi, remember me?" so it was a pathetic as you can imagine.But we did wind up at some big rock and roll party with lots of like hot girls and drugs later and I just thought "This is, this is not about him. This is about me...."
that it wasn't really about me. It was just about my blog. :-)
The other bit was about a Samantha Ronson song too, which is also themed on a situation it is best to avoid when possible -- her thinking she is in love, convincing him of it, and then realizing she was just infatuated -- which leads to him blaming her and leaving her wondering if its a flaw in her since it seems to happen every time. what can she do but start over and try again?
I have a friend who I really identify with that song -- she told me that she seemed to be in a rut lately, unable to make a relationship last longer than two months, each time she was the one who did the breaking up. She really does get left wondering if she is to blame, for real.
This all came up around the time that Sam asked me if I wanted to go to Vampire Weekend with her and her friends (after I mentioned in a blog that I had given away the tickets I had and then I saw them on SNL and thought maybe I should go but before I convinced her I wasn't actually her type and that I was almost old enough to be her father and that we have so very little in common other than musical tastes and mostly not even that).
I decided not to go, I am sure there will be some other time I'll see them. There is always more music....
Hell, I will be seeing Nada Surf on Leno in less than half hour and possibly kicking myself for passing on Showbox tickets to their show on the 27th, too. So there is a pattern -- I need to see more shows!
In any case, I enjoy lots of the lines in Louis XIV songs, even if some others of them make me almost a bit uncomfortable -- actually, just like some of those South Park episodes now that I think about it. Take Pledge of Allegiance for example and the lyrics there. We are insane collectively about the Parental advisory crap in this country, but being worried about a bit of a girl's ass on the cover and somehow not worried about lines like
We don't have to go to the poolIf you want me to make you wet
and so on (personally I don't mind either one, but the inconsistency of the advisory stuff just annoys me). I have no kids but if I did I think the latter would concern me a hell of a lot more than the former -- everyone has an ass, but not everyone has to talk like that in public. Which of the two are kids more likely to emulate -- being models doing nude shoots? Or quoting naughty lyrics?
Tons of double entendres, obviously mostly about sex but some of it about more than that, too.
They gave a great almost closing line to my conversation with Samantha, after I begged off of Vampire Weekend:
Her: Ignoring the age thing, why not just go for it?Me: I'm not really looking for that right now, I'm just not in that place.Her: Are you seeing somebody?Me: Um, no. Her: Would you tell me if you were?Me: Hmmm. Good question. Perhaps not.Her: Why?Me: If its not in my blog, it's a secret, and...Both: The best little secrets are kept.Me: Exactly!
Whoa, song lyrics and album title. Cool!
Enjoy Vampire Weekend and Nada Surf, those of you who are going, and if you go be sure to let me know if I should be jealous of you kicking myself for passing or thanking my lucky stars I was washing my hair those nights or whatever....
This blog brought to you by ㊙ (U+3299, aka CIRCLED IDEOGRAPH SECRET)
Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!
The story behind today's blog post started in Why do we call w 'double u' -- doesn't it look more like a 'double v'?, where I talked about the Swedish Academy's change to the way the letters W and V were to be handled in collation, and the impact on Microsoft software when this change eventually makes it to the point where it needs to be integrated.
In particular I discussed the implications for software such as Jet [Red] and SQL Server, which unify Swedish and Finnish collations together into one single collation, when the two collations become different. Obviously there are far-reaching implications here that have to be carefully considered.
And that one day this "theoretical" issue that is a punch line in a blog post from Raymond or I would have far reaching design consequences....
A couple of days later, I noted in The disunification of Norwegian and Danish sorting that there was actually already such a change in Vista -- a change in Norwegian collation that was not happening in Danish (two collations which had also been unified in Jet [Red] and SQL Server), a change that had already gone into Vista.
Obviously with each of these changes it is only a matter of time before the issue has to actually be addressed, but neither is theoretical -- like pregnancy, at some point the situation will force itself on people and cannot be ignored. And to continue the ridiculous pregnancy analogy for just one this one more sentence you are reading now, the Norwegian "baby" is much further along than the Swedish one....
Now on top of all of this, factor in Not all in sync quite yet (aka SQL and the CLR and Windows and .NET), which spent some time talking about the consequences of these various collation solutions that move further and further apart. And in particular looked more specifically at the Danish/Norwegian disunification issue and SQL Server, at the collations covering the two languages in SQLS 2005, and what to worry about going forward....
Now of course if you care about SQL Server, you probably noticed (though when I say this it is with full knowledge that at least one regular reader to whom it is crucial missed it!) my blog On changing the world, or at least the way people order things in it, where the Windows Server 2008 collation support in SQL Server 2008 feature had achieved line item status, and obviously the whole Danish/Norwegian disunification issue must have been solved!
And indeed it has.
If you take the query I used in this post:
SELECT name, COLLATIONPROPERTY(name, 'CodePage') as CodePage, CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID, CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle, description FROM ::fn_helpcollations() WHERE COLLATIONPROPERTY(name, 'LCID') = 1030 OR COLLATIONPROPERTY(name, 'LCID') = 1044 OR COLLATIONPROPERTY(name, 'LCID') = 2068
(this query that returns 20 rows in SQL Server 2005) and run it that CTP of SQL Server 2008 to see what it says.
And here it is:
So the old Danish_Norwegian collation still exists, but two new 10.0 collations (Danish_Greenlandic_100 and Norwegian_100) were also added, bringing the total number of rows this query returns to 56.
Now accidentally, the SQLS folks managed to work around an interesting conceptual backwards compatibility issue, too.
I'll explain. :-)
Now not every version of Windows will necessarily have an updated Unicode version that would require an updated default table. Which means that in theory had there not needed to be a new default table that as simple Norwegian collation would have been needed to be added.
If you look at the Danish_Norwegian collation, you will see how it is associates with LCID 0x00000406, aka MAKELANGID(LANG_DANISH, SUBLANG_DANISH_DENMARK), so the new collation could be added with the different LCID and the old one left right in place -- thus no change in behavior would be required for the existing collation!
But I imagine that the name (Danish_Norwegian) was likely based on the random chance of alphabetical order, as was the LCID they chose. So there was a 50% chance that this behavior-preserving change would not have been possible and that the same LCID would have required a behavior change while the newly added LCID wouldn't....
Luckily, Vista took a long time to ship and some pretty hefty Unicode updates made it in. So it was a non-issue, since two new collations ended up being needed irregardless of this issue.
Of course as it turns out they will get lucky again on the Finnish/Swedish disunification whenever that happens, since Finnish_Swedish uses 0x0000040b, aka MAKELANGID(LANG_FINNISH, SUBLANG_FINNISH_FINLAND) -- so the Swedish change in some unknown future version would just be a single additional set of collations, rather than two sets.
Two near misses, though. Though perhaps this is a good time to bring up that it may not have been the best design in the world to do all of this unification work, given how things can disunify over time. And perhaps one day the disunification will fall on the wrong side of random chance....
This blog brought to you by å (U+00e5, aka LATIN SMALL LETTER A WITH RING ABOVE)
Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!Warning: This particular blog consists mostly of self-indulgent meta-blogging crap and is thus almost entirely ignorable. Note: You know those times when almost anything you would say is going to be the wrong thing? This is one of those times. Thus, while comments are not disabled in this post, they are strongly discouraged....
Apologies for the X-Files allusion in the title, as softened by the accompanying eroteme.
The question about blogging about matters so closely tied to my work when my employer (or at least my management) wants it to be made crystal clear that this is a personal blog has become an interesting one, to me.
Yet if I only posted about personal stuff, then people might find hosting it on blogs.msdn.com might be inappropriate.
So clearly by some metric this blog is about my work and my job, even though the people who manage my job are denying me three times and all that.
Which is not to suggest that I am anyone's Jesus or anything; you know what I mean.
But in the aftermath of what happened and what is, I am left wondering what this blog really should be.
Or even whether it should be anything, while it sits on this server and while I sit where I do, job wise.
I have been largely coasting for the last few months on existing items I said I'd talk about and such, but without a real sense of direction.
Avoiding the question.
I could just say I miss Liz and leave it at that, but this isn't about her.
Well, not really about her, at least.
But once upon a time she did provide something of a passive compass for me that I only sometimes consulted but always knew was there, and I admit I'm having trouble figuring out where North is at the moment, without that compass.
Though I can't truthfully claim to be unhappy. In fact, I am not unhappy.
Professionally speaking I have been rather pleased and happy to find that even though my management (in a burst of wankitude/wingnuttery that I could calculate if my slide rule was not in storage in a box in Philadelphia) doesn't appreciate me, that a lot of other people and groups do -- it keeps me feeling good about Microsoft, at least. And of a lot of the people and groups in it. And I do believe on the work that happens in the group and in the building.
And I believe in a lot of other things too.
I could likely keep coasting with the blog posting for a while, as that huge mound-o-things on my To-Do list is still there. With probably a dozen written posts and twice as many half written, and then with three or four times as many that are almost fully composed in my head and all I have to do is dictate them and thereby insert the expected typos/speechos.
Though there is no rule that says that all of them, or even any of them, have to be posted.
When I think about how the Blog is aimlessly wandering without a theme, while each blog continues to have some point, I wonder. Do people notice that lack of overall direction as opposed to the original slowly changing overall direction, all in a see of for the most part directed slices of time?
The quote that pops into my mind: The assassin's gun may believe it is the surgeon's scalpel, but the assassin must know the task.
At the moment I am unsure. Not discontent with life, but perhaps a bit discontent with all of this. At the moment.
So I am left wondering -- what if I stopped blogging?
The 20 people who read regularly read would notice right away (adjusted for the fact that this is posted on a weekend), the regular but busy folk/the in-love ones/others more casual would notice within a week or three, and it might make for a great WEHT blog that someone else could write in a year or so, after looking into whatever did happen. Nothing would actually "happen" though, in the end.
That persona would be gone, and people who missed it would have to seek out me and a less passive type of connection than the blog enables for them. A terrible loss, for some.
And I might miss it, too. My soapbox and my pronouncements from on low. I could imagine missing that, some at least.
If the existing other unpublished posts never went live? Not that much of a difference, really -- perhaps the difference between the Blog being hit by a bus and the Blog passing on after a long illness.
I am reminded of how I miss Suzanne E. McCarthy's Abecedaria, which is now approaching its second anniversary of not being posted to. We miss them when they disappear, whether with bang or whimper, knowing the people might still be alive, just no longer in our lives, right? With the blog just sitting, dormant -- a testament to the disappearance without corresponding backstory.
What if SiaO just stopped? Like with no warning or indication, perhaps even in mid-
Please read disclaimer; content of Michael Kaplan's blog not approved by Microsoft -- plus it is s off-topic as you can get without a prescription!
From nearer to the end than the beginning of So Long, and Thanks For All of the Fish by Douglas Adams:
"I just thought you'd like to see," he said, "what angels wear on their feet. Just out of curiousity. I'm not trying to prove anything, by the way. I'm a scientist and I know what constitutes proof. But the reason I call myself by my childhood name is to remind myself that a scientist must also be absolutely like a child. If he sees a thing, he must say that he sees it, whether it was what he thought he was going to see or not. See first, think later, then test. But always see first. Otherwise you will only see what you were expecting. Most scientists forget that. I'll show you something to demonstrate that later. So, the other reason I call myself Wonko the Sane is so that people will think I am a fool. That allows me to say what I see when I see it. You can't possibly be a scientist if you mind people thinking that you're a fool."
It is in that spirit of scientific inquiry that, remembering vaguely an old Gallagher comedy routine, I scooted over to the other side of the building to talk to some of the good people in Microsoft Typography.
I had an actual work reason to do this as well, but I had my scientific inquiry to act as comic relief....
The Typography folks, having that slightly artistic tendency that (on the continuum spanning people who are engineers and people who are normal humans) tends to make them more like normal human beings.
The question?
Well, it was:
You know how M&Ms melt in your mouth but not in your hands? What would they do, say, under your arms?
A simple question, but an interesting one.
And one that the scientific method of observation (in the words of Wonko "See first, think later, then test") won't work since I simply am afraid of being looked at as the kind of fool who puts M&Ms under his arms.
Obviously the M&Ms melting is not just about heat, because if one holds M&Ms tightly in one's hands and heat would do the trick then they'd melt there.
I actually tested this when I was growing up. My mother even had the pleasure of finding them in my pockets sometimes after they went through the wash after I'd forget I put a few of them there.
Word to the wise, which I learned after a scolding or three -- M&Ms do melt in the wash.
So if it's not heat but both in your mouth and in the wash then suggests that it is about the moisture.
I did put deodorant on, I do every day. And if I don't go to the gym and pit out then it doesn't fail and then moisture shouldn't be a huge problem.
I talked with Dave and Nick and Carolyn, and they all added some of their insights into the question, but in the end I guess we decided that they wouldn't really melt under your arms unless your anti-perspirant failed, and no one would want to investigate that too deeply.
This suggests some interesting pitfalls in the research methodology if someone ever wanted to verify this all experimentally. Which is way too much for me, my upper limit was the 72-hour Sex & Salad Diet and that was nearly 20 years ago. And M&Ms under the arms just seems like a whole order of magnitude more serious since even though it would not take as long, it really doesn't have nearly as much of an upside!
And then there was Carolyn's fun twist on the question, widening it out to other M&M receptacles that also made my day -- by the same calculations, the moisture factor there would likely lead to melting. :-)
Please keep in mind that you should never give your cat chocolate, and thus the feline M&M experiment should also remain a theoretical one!
This blog brought to you by M (U+ff2d, aka FULLWIDTH LATIN CAPITAL LETTER M)
So the other day Stephan T. Lavavej (also known by his initials, STL, which also happens to be what he works on!) came by and we chatted for a bit about development stuff.
It was I think a pretty fascinating conversation, you know the kind where two people don't know each other (except crossing paths in emails to distribution lists), where each person knows a bit about something that the other person knows a lot about, but we walk into the conversation knowing that so there is no weirdness or competitive stuff about it....
Anyway, at one point I was talking about that whole issue with lstrcmpi, the one I talked about in How do I feel about lstrcmpi? I think it blows.... and elsewhere.
And I pointed out that people almost universally tend to use the function incorrectly.
"Why?" he asked, genuinely curious.
The question stopped me short.
That kind of consistency is pretty rare, after all.
Why is it so easy to get wrong?
Hmmmm. I hadn't thought about it too much, really.
After he left I thought about it a bit. And then it occurred to me....
If you look at the string functions topic in the Platform SDK, there are six functions that have that naming pattern:
Given that first four functions have such similar behavior to their C runtime counterparts, the similarly named last two can reasonably expected to do the same thing, right?
The fact that they don't is obviously the source of a lot of problems here.
Now if the functions were named lstrcoll and lstrcolli then perhaps the function would not be so commonly misused.
Nobody just ever gave these functions a chance -- it is like giving a child a terrible name that gets them beaten up in school or something!
Nobody I talked to could recall what the lowercase L prefix was intended to mean, either. And now as the CRT deprecates most of these functions in favor of names that fit the ISO standard rather than the POSIX one, the Win32 functions will become even further removed from understanding.
Makes me wonder if Microsoft just ought to deprecate its "L" prefixed family, too!
This blog brought to you by l (U+006c, aka LATIN SMALL LETTER L)
You may (if you are a regular reader) have read Predictably (in retrospect), aka Where Wild^H^H^Hindows-Only Things Are, aka SHORT [on ]TIME for a LONG TIME from back in November of 2007.
In one bit of that post, I mentioned:
Ignoring occasional heroics that I find myself involved with (which, let's be honest are really the exception, not the rule!), there are a handful of times that I feel like I've been involved in something really unique...
Today I got to see the first part of another one of those times....
I'll explain.
By necessity the work I do is the kind of thing that customers only tend to notice when things go wrong, when there are bugs.
Kind of the price of working in internationalization, I figure. :-)
This weekend involved a rather exciting adventure with the latest SQL Server 2008 Community Technology Preview (CTP). It started with a popular diagram, that you can see here in the What's New in SQL Server 2008 February CTP, or I'll include it below with the emphasis on what I wanted to mention as the inspiration:
That's right -- Windows Server 2008 collation support in SQL Server 2008. The feature is a line item for this diagram that helps map the features put into the next version of SQL Server!
Now I didn't do the development work on this feature (Brandon is the one who stepped up and not only provided the solution for this CTP but who also took care of the performance issues due to "fast enough" for Windows not being quite fast enough for SQL Server that you'll see in the next one -- a solution that I hope gets looked into for Windows in a future version!).
And I didn't find the bugs that exist in Windows related to this feature that SQL Server found (Brandon again, with help from various testers -- anyone who thinks developers are cookie cutter resources that you can plug any of them into a feature have nothing on him here!).
And I didn't do the program management work on this feature (Goldie is the author of the spec, the one who met with all of the partner teams to educate them on what was happening and worked to assuage their concerns, and worked with the central release people to get the feature into the product -- anyone who thinks PMs are disposable resources understands nothing of the work she had to do here!).
And I definitely didn't meet with Corporate Vice President Ted Kummert to get the feature approved for SQL Server 2008 (Fernando is the guy who got on Ted's schedule and made the case well enough to get the VP buyoff done -- I have only ever met with VPs to get coffee or hot cocoa and have "how've ya been?" conversations -- anyone who thinks the serious conversation work can be done by any random person who is handy probably shouldn't ever be allowed to meet with VPs!).
But I know that I had the chance to be a positive influence on the efforts of all of these people (including those I am not calling out specifically) in each of these accomplishments -- whether it was providing info for the business case, data, testing information, bug fixes, or answers to random questions.
Now I am vastly oversimplifying the problem here in calling out only one developer, two program managers, and a vice president. There were a lot more people involved, obviously. These are just some of the crucial people who did the right thing in pivotal moments, the kind that make or break features.
And in the end, kind of like with Synthetic Cultures, I got the chance to be involved with a chance to change the way some of the world is going to work (in this case the way that some of the world is going to order itself!). And primarily due to the hard work of this small piece of the SQL Server Engine got to be involved in something that will be truly great for millions of customers using the formerly SQL-disenfranchised languages that have no weight in sorting, until this change rolled around.
And installing that CTP build, seeing the new collations in it? It meant a lot more than seeing them in Books Online, and it took my breath away.
Good teams can provide good results, and it was an honor to be interacting with a great team here whose members have provided incredible results!
It is why I work at Microsoft.
And moments like these are very important when I forget that they can happen....
I'll probably talk more about the actual feature and what is there in some future blogs. I'll provide one gratuitous screenshot to tude you over until then (or until you install teh CTP yourself!):
This post sponsored by ག (U+0f42 a.k.a. TIBETAN LETTER GA)
Please read the disclaimer; content not approved by Microsoft!
As a follow-up to the popular Learn Tamil in 30 Days (or something like that), there is this other book I picked up of about the same size entitled Learn Bengali in a Month that takes a slightly different approach then its Tamil cousin....
I think it was the explanatory text on the back cover of the book that hooked me in the bookstore (as I ignored the "never judge as book by its cover" principle!):
Bengal gave literary giants like Rabindranath Tagore, Sarat Chandra, Bankim Chandra Chatterjee. Tagore's Geetanjali which won Nobel Prize for him, is acclaimed as the best expression of the writer's mystical philosophy. Inspiring song of Bankim Chandra is our national song and the subtle analysis of human mind in Sarat Chandra's novels touches the highest hallmark of literary creation. Would any one like to remain ignorant of such literary works?We study these literary works through translation but forget Tagore's immortal comment "Learning through translation is like wooing a lady through an attorney". We can capture the beauty, the subtlety and the majesty of the Bengali literature only through Bengali language. Readwell's "Learn Bengali" will open the windows on this beautiful literary landscape surely in one month.
I was hooked by the time I hit that attorney quote, and I already was reading Tagore in English, ever since someone pointed me at Santiniketan Song a while back:
She is our own, the darling of our hearts, Santiniketan. Our dreams are rocked in her arms.Her face is a fresh wonder of love every time we see her, for she is our own, the darling of our hearts.In the shadows of her trees we meet, in the freedom of her open sky.Her mornings come and her evenings bringing down heaven's kisses, making us feel anew that she is our own, the darling of our hearts.The stillness of her shades is stirred by the woodland whisper; her amlaki groves are aquiver with the rapture of leaves.She dwells in us and around us, however far we may wander.She weaves our hearts in a song. making us one in music, tuning our strings of love with her own fingers; and we ever remember that she is our own, the darling of our hearts.
I realized at the time I knew as little about what these words were about as I did about Jimi Hendrix's Little Wing the first time I heard it, and maybe that is the point.
In both cases, the words are independent of the underlying meaning of the authors in so many ways that they weave their own meaning for the people who hear them. And learning the meaning later did not change my appreciation....
Anyway, after I read the back of that book all I could think was that if the words I had read in the past were like talking to an attorney, I had to meet the client somehow.
So I bought the book. :-)
Admittedly I have had more trouble here than with that Tamil book, though I found the examples for conjuncts fascinating as it suggested example words like মিথ্যা বাক্য ("untrue word") and দরিদ্র ছাত্র ("poor student") and স্রীহীন পল্লী ("unprosperous village") and র়গ্ন দেহ ("ill health") and ক্লান্ত শরীর ("tired body"). It just makes me wonder whether there is a bit in the subtext here -- a subtle attempt to make conjuncts seem harder by using such negative examples? :-)
And then there are the self-described harder conjuncts like নিস্ফল ক্রোধ ("impotent anger") and fun practice sentences like পরের দ্রব্য ল্ইও না ("Do not take things which are not your own") and ঘেমন কর্ম তেমন ফল ("As you sow, so you reap") -- the latter being the title of the post, hopefully spelled correctly though if not perhaps someone will point it out to me....
But one of the most striking things I found was that (even moreso than the Tamil book) large parts of what I do know from Bengali were so completely at odds with the transliteration used in the book that without some rudimentary mapping between the two, I am lost much of the time. I could nag Bengali friends and colleagues of mine like Goldie about it, but her knowledge of Bengali Unicode is not so great, so I don't think it would prove to be very helpful -- she would likely be more like the book in that her knowledge is based on the same principles anyone would have learned Bengali growing up.
In essence, it is the distance between these two different bits of knowledge which must be bridged for this book to be successful, for me.
More disturbing in all this is the implied notion that the way that the language is taught would need to be modified for the sake of Unicode. Yuck!
Though thankfully I don't think that is what is being proven here; instead, the problem goes back to technology limitations, e.g. the crappy input methods on Windows that are Unicode based and thus INSCRIPT based and which do not match the book's transliteration scheme and thus require an additional mental mapping to make books like this one useful to me (unless I am willing to forget everything I know and then learn Bengali Unicode after learning Bengali. In which case I would go through the same troubles as a native speaker....
The fact that the UI style of a font like Vrinda is not a great match for the font used in the book is a side issue that in most cases does not hinder too much, but if I were a native speaker I could imagine being annoyed by it about from time to time....
Which is not to say I won't be able to use the book at all; it is just that the subtitle of the book ("Easy Method of Learning Bengali through English without a Teacher") has been hindered by what I do know, as knowledge of the script via Unicode has proven quite able to hinder knowledge of the language....
Though on the brighter side, I do recognize the sort order, at least!
In the end though, I won't be learning Bengali in a month, at least not from this book. Unicode ruined the approach for me.... :-(
This blog brought to you by ঔ (U+0994, aka BENGALI LETTER AU)
Some of you may recall Igor Levicki, the guy who had 64-bit keyboards working before MSKLC 1.4 was released who I mentioned in If you just don't think you can hold it (64-bit style!).
The other day he sent me mail about a bug (actually a small bundle of bugs, but the bug he found was a crash bug) in a third party application (name withheld to protect the guilty, and also the embarrassed!).
Igor looked at crash via the disassembly, which I'll put here just for the sake of completeness. If you are the same kind of person you can work along here and try to find the problems within the disassembly:
; Exported entry 592. ?GetDoubleFormat@CAppUtils@@SAPB_WNH@Z; wchar_t* __cdecl CAppUtils__GetDoubleFormat(double, int) public ?GetDoubleFormat@CAppUtils@@SAPB_WNH@Z ?GetDoubleFormat@CAppUtils@@SAPB_WNH@Z proc nearvar_108 = qword ptr -108hLCData = word ptr -0D4hValue = word ptr -0D0hvar_4 = dword ptr -4arg_0 = qword ptr 8arg_8 = dword ptr 10h push ebp mov ebp, esp and esp, 0FFFFFFC0h sub esp, 0FCh mov eax, dword_1007E01C xor eax, esp mov [esp+0FCh+var_4], eax fld [ebp+arg_0] push esi sub esp, 8 fstp [esp+108h+var_108] lea eax, [esp+108h+Value] push offset aF ; "%f" push eax ; String call ds:_swprintf add esp, 10h push 0FFh ; cchNumber push offset word_100873B0 ; lpNumberStr push 0 ; lpFormat lea ecx, [esp+10Ch+Value] push ecx ; lpValue push 0 ; dwFlags push 400h ; Locale call ds:GetNumberFormatW push 2 ; cchData lea edx, [esp+104h+LCData] push edx ; lpLCData push 0Eh ; LCType push 400h ; Locale call ds:GetLocaleInfoW mov esi, [ebp+arg_8] cmp esi, 0FFFFFFFFh jz short loc_10008344 mov eax, dword ptr [esp+100h+LCData] push eax ; Ch push offset word_100873B0 ; Str call ds:wcsrchr add esp, 8 test eax, eax jz short loc_1000833F lea eax, [eax+esi*2+2]loc_1000833F: mov word ptr [eax], 0loc_10008344: mov ecx, [esp+100h+var_4] pop esi xor ecx, esp mov eax, offset word_100873B0 call sub_10059B35 mov esp, ebp pop ebp retn?GetDoubleFormat@CAppUtils@@SAPB_WNH@Z endp
From assembly it is pretty hard to know whose application it is, but what is happening in this one function is not too hard to figure out.
Now for those who really aren't as comfortable working this way where you have watch parameters get placed on the stack an such (though it should be fairly straightforward to work with in this case for those so inclined!), here is some essentially equivalent C code:
wchar_t Formatted[256];wchar_t *GetDoubleFormat(double Value, int DecimalPlaces) { wchar_t Buffer[MAX_PATH]; int DecimalChar; _swprintf(Buffer, "%f", Value); GetNumberFormatW(LOCALE_USER_DEFAULT, LOCALE_NOUSEROVERRIDE, Buffer, NULL, Formatted, 255); GetLocaleInfoW(LOCALE_USER_DEFAULT, LOCALE_SDECIMAL, &DecimalChar, 2); if (DecimalPlaces != -1) { wchar_t *Point = wcsrchr(Formatted, DecimalChar); if (Point != NULL) { Point = Point + DecimalPlaces + 1; } *Point = 0x0000; } return Formatted;}
Now this bit of code is as veritable bug farm of how to misuse the NLS API.
What it is trying to do, with many of the bugs embedded in the descriptive language for easy retrieval:
Originally Igor was thinking this might have even made a great entry for The Daily WTF, and I'm not gonna disagree with him on that.
The crash bug Igor had run into initially was based on the fact that he had Serbian locale settings with a customized decimal separator -- thus the user override mismatch in #1 and #2 above quickly led to a problem searching for a "." in a string such as "44,90" -- then after properly detecting that wcsrchr returning NULL for failure, a somewhat catastrophic situation arises when it tries to dereference that NULL in order to assign to it.
Basically what they (apparently) wanted to do was format a number with the user's preferences but overriding the user's choice of the number of decimal places. The resulting string is put in their user interface, returning the result to the user in a property sheet.
Worse ways of achieving that goal have been reported in code reviews, but not by reliable witnesses.
(There are other silly/problematic issues here, such as:
Now in fairness to the people who wrote this code (whoever they are and whenever they wrote it), GetNumberFormat has some specific limitations that make it less useful and that make it more difficult to write the idealized version of this function.
I am going to enumerate the three big problems here as I see them, and Igor might have some additional thoughts on this matter either for comments here or on his own site. :-)
But the NLS team could think of this next bit as feedback to them on ways to make functions like GetNumberFormat better, faster, easier to use, and more generally useful.
First of all, since GetNumberFormat can only take a string rather than a number due to the lack of a LOCALE_SPECIFY_NUMBER flag (as I mentioned in Pass the string please, three years ago), the caller must format their number as a string via a function like _swprintf so that it can then format the number within the string as yet another a string -- and perhaps this is proof that I have changed my mind a bit since tha tblog; the NLS code really ought to help out more with the more common obvious cases when it can, such as this one.
Second of all (and perhaps most importantly), the longstanding behavior of GetNumberFormat that requires either a fully filled in NUMBERFMT in lpFormat or a NULL lpFormat means that the only way to specify an lpFormat->NumDigits value is to also fill in the lpFormat->LeadingZero, lpFormat->Grouping, lpFormat->lpDecimalSep, lpFormat->ThousandSep, and lpFormat->NegativeOrder values as well -- thus requiring up to five calls to GetLocaleInfo for information entirely derivable from the passed-in locale (assuming one value being overridden).
There are any number of alternate possible ways it could have been done -- like establish the NULL case for the struct as being the default which means "take the locale's data", or specify values that mean the same thing, or even define flags to specify which values to pay attention to in the structure (there is room in the dwFlags for this both this function and GetCurrencyFormat!).
Third of all, where GetNumberFormat always requires either a fully filled in NUMBERFMT in lpFormat or a NULL lpFormat, if you don't pass in that filled-in NUMBERFMT then the code behind the function always grabs all of the information even if it does not need it -- meaning for example that it will grab the user overridable lpFormat->ThousandSep even if the number is not big enough to need it according to the number itself. The upshot of this is that the performance of the function is much better if you pass in the lpFormat yourself if you have the data handy -- because the function is not smart about how it does its work.
Now one may argue against that type of optimization due to potential thread safety issues caused by SetLocaleInfo calls happening while number formatting is happening, but given that this is already really a bit of a problem in the code terms of consistent number formatting in such situations (and also how uncommon SetLocaleInfo calls or user-specific Regional Options changes that amount to SetLocaleInfo calls actually are), there are much better solutions for this problem that are possible. Much moreso than the current solution (trying to front-load all of the data loading calls to minimize the number of possible SetLocaleInfo calls that could happen). And many of those better solutions would be more performant, too!
On top of all that, this front-loading in the case of the NULL lpFormat happens when LOCAL:E_NOUSEROVERRIDE is specified, too -- meaning there is no thread safety issue all but the code is happy to fill a structure with six separate data items without ever returning the data it loaded to the caller (in case they were going to call the function repeatedly). If one has to pay the price, one would like a bit more for one's money, in my opinion....
If those issues did not exist, then all of the work done by the third party GetDoubleFormat could happen in a single very fast function call, rather than a whole bunch of code, such as the buggy code Igor pointed out in the version of the third party application he was looking at.
Now none of this excuses the bugs in the third party application -- that code screws up the usage of NLS functions with multiple bugs including some that crash? They have no good reason to have those.
But if the NLS function did a bit more of the work here, perhaps the kind of developers who were going to make those mistakes when left to their own devices would have one less opportunity to do so?
Now for the interactive bit of this blog (and incidentally of this Blog):
Any developers want to take a stab at implementing the
wchar_t *GetDoubleFormat(double Value, int DecimalPlaces)
function without any of the numerous aforementioned bugs? :-)
This blog brought to you by . (U+002e, aka FULL STOP)
Now it isn't Bangalore vs. Bengaluru.
And it isn't Uighur vs. Uyghur.
And it really isn't Farsi vs. Persian.
Or Macao vs. Macau.
But the other day, reader James asked me via the Contact link:
I have noticed that you have mentioned the last name Chaudhuri in your blog a few times. What is the relationship between Chaudhuri and Chaudhary, another name I have seen before?
An interesting question, one that has at its heart the very informal and non-standard way that non-English words are transliterated into English for the purpose of people really moving to use English as their main language in some or most contexts....
Now in this case the name (চৌধুরী) moved into English can find itself in many different forms -- and not just Chaudhuri and Chaudhary, either -- there is Choudhury, Choudhuri, Chowdhury, and so on -- the Wikipedia article lists 21 different spellings:
I once almost (after seeing so many different people within Bengal and Bangladesh with that same last name who were not related) whether this was like the way Sikhs see all boys with a middle or last name of Singh and all girls with the middle or last name Kaur.
Then I looked at some of those various name sites and in addition to long articles like the Wikipedia one came up with this text from The Chowdhury Surname at ancestor.com:
Indian (Bengal) and Bangladeshi: Muslim and Hindu status name for a head of a community or caste, from Sanskrit catus- ‘four-way’, ‘all-round’ + dhuriya ‘undertaking a burden (of responsibility)’ (Sanskrit dhura ‘burden’). The title was originally awarded to persons of eminence, both Muslims and Hindus, by the Mughal emperors. The Khatris have a clan called Chowdhury. In some traditions the term is said to derive from a title for a military commander controlling four different fighting forces, namely navy, cavalry, infantry, and elephant corps, but this is probably no more than folk etymology.
Is it any stranger than the Hebrew last name that is derived from the High Priesthood originally the Sons of Aaron making it to our modern times as both priests and non-priests with the last names of Cohen, Cohn, Kohn, and so on? Probably not -- and both come from the same problem of mapping to English sounds in another language.
Choosing one's name one will use goes beyond the issue that is faced in the Khadafi vs. Ghadafi vs. Qadafi and so on situation -- where the person with the name is not necessarily the one choosing what the spelling ought to be. Standards are developed within a single news organization to try and be consistent in last name spellings, but between organization it is anyone's guess, and the person following the news back when Muammar and Libya were in the public eye so much had to be ready for any of dozens of different possible spellings.
And if I am deriving information from five different articles on the Libyan Prime Minister, I would likely regularize the different spellings in my own article rather than use their five different spellings.
But in names? Amit Chaudhuri is the name on the books and the one I would use in talking about those books, just as I would refer to the moveon.org advocate as Nita Chaudhary or the guy in the UN who did so much to further the rights of women and children as Anwarul Karim Chowdhury, all without blinking an eye and without being tempted to try to regularize the spellings.
My own last name I have seen as Kaplan and Caplan/Caplin (and many others), and the conventional wisdom that the ones spelled with a "C" are not Jewish and the ones spelled with a "K" are is something that I have met exceptions to the "rule" myself, in both directions -- even ignoring examples you may know something of yourself like Alfred Gerald Caplin (you may know him as Al Capp, best known for Li'l Abner).
I suppose the fact that we take people at their personal preference for their own name in some cases but not others is perhaps a fascinating study in both respect and the lack thereof.
And I say that as someone who has no problems having opinions on issues like calling a city Bengaluru, or a language name Persian or Macau or Uyghur....
This blog brought to you by চ (U+099a, aka BENGALI LETTER CA)
It is that time of year -- you know, when HR is in the air (so to speak).
It is the midyear, and everyone scrambles to fill in the forms and such, hoping it will keep the nag mails from coming but knowing somehow that they ill keep coming anyway, as all those who have finished already can readily attest!
Now perhaps it is because I have a blog that speaks so candidly about just about anything (too many things, according to some!) and most especially language issues, but I seem to attract a slightly greater than expected number of questions from non-native English speakers in the group about some of the language surrounding reviews....
Like the other day, when a colleague was genuinely confused about what actual difference was being conveyed between the words influence and impact.
I gave him some of the typical examples that they often give, but he had read those and still seemed a little unconvinced.
"Well," I suggested, "it's like my scooter." "Your scooter?" he asked, incredulously. "Yeah, like my scooter. Let me explain...." "I would like that," he said, clearly quite curious about what my scooter could have to do with the definition of two words that would be relevant. "Well," I explain, "as I am scooting down the hallway, people will move out of the way without me saying a word, without the scooter ever touching them. That is because the scooter has influence over people. The scooter has influence over their behavior, in a way that people generally don't tend to in ordinary situations." "And the impact?" he asked tentatively, almost afraid of the obvious answer and hoping that it will be something different. "Ah, that one should be obvious -- if the scooter hits someone, there will be some serious impact. Like when KC used to ask us in triage over suggested development solutions What's the impact to test? but more directly -- the impact would be a scooter-shaped dent in a tester, or in the general case, in whoever the scooter hit," I explained wryly. He shakes his head. "So you are saying that the influence is about people understanding the consequences of the impact? That doesn't seem right." "No, it isn't that simple. After all, who knows what is behind the influence, exactly? Maybe it is the fear of being hit, maybe they are being polite, maybe it is pity, maybe it is feeling bad about being in my way. The reason doesn't matter, in the end. It is an overlapping, sometimes but not always causative force." He nod his head slowly. "I think I understand what you mean." "And there is another aspect as well," I continue, clearly on a roll. "If the scooter crashes into a wall, the impact is self evident after the crash. But in this case there was no influence whatsoever. It isn't like wall could get out of the way, or flinch, or even cry out a warning. So it is not only possible to have influence without impact; it is quite possible to see impact without influence!"
"Well," I suggested, "it's like my scooter."
"Your scooter?" he asked, incredulously.
"Yeah, like my scooter. Let me explain...."
"I would like that," he said, clearly quite curious about what my scooter could have to do with the definition of two words that would be relevant.
"Well," I explain, "as I am scooting down the hallway, people will move out of the way without me saying a word, without the scooter ever touching them. That is because the scooter has influence over people. The scooter has influence over their behavior, in a way that people generally don't tend to in ordinary situations."
"And the impact?" he asked tentatively, almost afraid of the obvious answer and hoping that it will be something different.
"Ah, that one should be obvious -- if the scooter hits someone, there will be some serious impact. Like when KC used to ask us in triage over suggested development solutions What's the impact to test? but more directly -- the impact would be a scooter-shaped dent in a tester, or in the general case, in whoever the scooter hit," I explained wryly.
He shakes his head. "So you are saying that the influence is about people understanding the consequences of the impact? That doesn't seem right."
"No, it isn't that simple. After all, who knows what is behind the influence, exactly? Maybe it is the fear of being hit, maybe they are being polite, maybe it is pity, maybe it is feeling bad about being in my way. The reason doesn't matter, in the end. It is an overlapping, sometimes but not always causative force."
He nod his head slowly. "I think I understand what you mean."
"And there is another aspect as well," I continue, clearly on a roll. "If the scooter crashes into a wall, the impact is self evident after the crash. But in this case there was no influence whatsoever. It isn't like wall could get out of the way, or flinch, or even cry out a warning. So it is not only possible to have influence without impact; it is quite possible to see impact without influence!"
When I relayed this story later to Cathy, she admitted that her first thought was about the incident that inspired the Note to self: don't run over the General Manager blog -- even if people don't manage to get completely out of the way, they can still feel the influence and the desire to avoid the impact. :-)
This post brought to you by J (U+ff2a, a.k.a. FULLWIDTH CAPITAL LETTER J)
Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)!
You may have read Vietnamese is a complex language on Windows, which discusses some fixes that were put in for Vietnamese in Vista.
The nature of the bug and the fixes that were put in was discussed there, as well -- it amounted to some characters used in the language but not included in the Vietnamese exception table -- characters also missing from the keyboard and the code page, and a few inconsistencies in weights found.
Anyway, recently the person who originally reported the bug commented on the Vista behavior:
Hi Michael,Not until today, I got a chance to test the new collation on Vista. It seems that those bugs have been fixed for the Unicode composite format but not for the precomposed. Moreover, the fixes seem to have introduced new bugs. The list below includes the Vietnamese characters in question. Reference : aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ Composite : aAàÀảẢãÃáÁạẠâÂầẦẩẨẫăĂẪằấẤẰẳậẲẬẵẴắẮặẶiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ Precomposed: aAàÀảẢáÁạẠãÃăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIỉỈĩĨíÍịỊìÌoOỏỎóÓọỌòÒõÕyYỳỲỷỶỹỸỵỴýÝGiven the latest environment (in terms of OS, .NET framework), how can I get the correct sort?Quan
Let's take a closer look at the values he gave, marking the one that do not match the reference in red (blowing them up and putting the reference in the middle to make visual comparisons easier):
Composite : aAàÀảẢãÃáÁạẠâÂầẦẩẨẫăĂẪằấẤẰẳậẲẬẵẴắẮặẶiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ Reference : aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIìÌỉỈĩĨíÍịỊoOòÒỏỎõÕóÓọỌyYỳỲỷỶỹỸýÝỵỴ Precomposed: aAàÀảẢáÁạẠãÃăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬiIỉỈĩĨíÍịỊìÌoOỏỎóÓọỌòÒõÕyYỳỲỷỶỹỸỵỴýÝ
The information here is useful in the sense of repirting that therereporting are problems, but ultimately not in resolving the problems.
Perhaps I should explain what I mean by that. :-)
Now ignoring all that for a moment and explaining a bit about the actual differences noted....
The nature of the problem here is twofold:
As a rule, characters that are not used in a language do not tend to get moved along with the ones that are, but this particular discontinuity means that when such letters are sorted, their results may not be a 100% match for the precomposed/composite forms, which is why there is a difference between the two forms (the default table does match them, as do the identified Vietnamese letters; but when the moved diacritics combine with letters not in the known set, they will be moved out of matching their analogous precomposed forms).
This is a problem that cannot be fixed without a major version change, for Vietnamese, which would means a new version of Windows -- as I pointed out in 2001, a Correctness Odyssey (aka What's the matter with Ü?), it has been decided this kind of change cannot be done in a service pack....
So in any case, someone over in NLS has some investigation to do, for both repertoire and order, for a future version of Windows.
This blog brought to you by Đ (U+0110, aka LATIN CAPITAL LETTER D WITH STROKE)
The other day, Qian asked me via the Contact link:
Hi Michael,I've been reading your blog for the past several months and find it really helpful. I've been able to crush several quite puzzling bugs just by reading the archives. Thank you for sharing such valuable knowledge.I wonder if you can shed some light on a problem I've come across. I've been trying to use the ImmIsIME function to determine if the currently active input language uses an IME. I've got a plain US English XP (32-bit) system with the Simplified Chinese Pinyin IME installed. And in my code I'm doing the following: HKL hKeyLayout = GetKeyboardLayout(0); BOOL bIsIME = ImmIsIME(hKeyLayout) != 0;If I look at the HKL value, I see that I get 0x4090409 for the English layout and 0x8040804 for the Chinese. But I get TRUE for ImmIsIME for both of them.This didn't seem right to me, so I googled a bit and found this thread:http://groups.google.com/group/microsoft.public.win32.programmer.international/browse_thread/thread/6431db45bf81e718/e8111909da418549?hl=en&lnk=st&q=ImmIsIme#e8111909da418549In it you said, "All IME's have a 0xE000000 prefix in front of them when you look at the HKL value. You can easily use the GetKeyboardLayout API to retrieve the active keyboard layout and test for whether it has such a prefix in the HKL."And the OP of the thread indicated that this method worked for him. He was using a Japanese IME on an English XP system, so I installed the Japanese IME and got an HKL value of 0x4110411 for the Japanese layout. Obviously neither the Japanese nor the Chinese HKL on my system had an 0xE000000 prefix, so this test for IME didn't work either.I haven't been able to find any information on why ImmIsIME would return TRUE for the US English layout. I've also not been able to find any other references to the 0xE000000 prefix. Do you see anything that I'm doing wrong or have any insight into why I'm not getting the expected values here?I've also tested my code on Vista (32 and 64-bit) with exactly the same results. So if I'm screwing up somewhere, at least I'm doing it consistently. :)I'd greatly appreciate any pointers you can give me on this. And thanks again for all the great info on your blog.Best,Qian
Well, I guess it can't ever hurt to have people say nice things about you, especially when you're feeling under-valued. :-)
In this case the problem is once again that the move from the IMM (Input Method Manager) based to the TSF (Text Services Framework) based IMEs has made an old behavior one could rely on go away forever. :-(
There are now a huge ton of IMEs that do not follow that "E" prefix rule, since their support is entirely through TSF now.
Now if one knows all of the GUIDs and such one could probably use ITfInputProcessorProfiles::IsEnabledLanguageProfile to see if the IME in question is enabled, and one could probably use ITfInputProcessorProfiles::EnumInputProcessorInfo to enumerate all of the Input Processor Profiles on the machine and look for the one that was desired here.
But the real problem is not answering this one particular question.
It is that there is no link in documentation between the Input Method Manager and the Text Services Framework.
Not even a notice to developers that lots of the IMM won't work anymore.
Let alone a huge map akin to the Microsoft Win32 to Microsoft .NET Framework API Map that I mentioned here and suggested corrections to here.
Let alone what would best of all, which would be a little link in each Input Method Manager topic that provided the best way to accomplish the same thing in the Text Services Framework.
People over here talk all the time about gaps in international support, but calling this lack of information not only in how to migrate but in the fact that migration is needed is a lot more than a gap. It's a huge chasm!
And this problem is not one you can just aim a doc writer at -- not even a doc writer ninja or doc writer ninja II. This requires developers and testers, people who actually know the answers to the questions here and who know enough about the IMM "wrappers" put in for compatibility that they could even find and perhaps fix bugs in the current backcompat work that happens.
There is an old story about the four friends -- Everybody, Anybody, Nobody and Somebody. There was an important job to be done and Everybody was sure that Somebody would do it. In truth, Anybody could have done it, but Nobody did it. Somebody got angry about that because it was Anybody's job and the fact that Nobody did it was incredibly irresponsible (something that Everybody knew). In the end, Nobody did the work and Everybody got blamed for not taking responsibility.
I just wonder how long these four friends are going to rule the IMM/TSF documentation situation?
I don't want to minimize the importance of this prolem with humor, and to be honest I wish more developers around the world and especially in East Asia would complain about this more loudly and force Microsoft to do something about it. The lack of assistance here is unpardonable considering the sizes of the markets affected....
This blog brought to you by ꇥ (U+a1e5, aka YI SYLLABLE GAP)