Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Over in the Suggestion Box, DreymaR asked:
Hello Michael, and thanks ever so much for sharing insights about the workings of the Windows layout routines! I hope you're the person to ask about this, or if you aren't, that you could tell me who might be? In these days I'm very portable, and happily so. I 'take my digital life with me' by USB flash drive, and this allows me to work comfortably on computers where I have no installation rights. That's nice! When it comes to the keyboard layout however, I'm not yet fully satisfied. I've used MSKLC to make an enhanced layout (Unicode, some keys moved around etc) that I'm very fond of. I can use the freeware script PortableKeyboardLayout to take that with me, but this solution is imperfect because it's too slow at times (it's written in a scripting language) and doesn't always play 100% nice with the input stream. (It has other virtues, but that's another matter.) So what I'd really like would be to be able to code or script an analogue to the LoadKeyboardLayout API, that instead of looking in the registry and system folders could take its values from a specified file and an MSKLC-made install! (I hope I'm right in thinking that what the MSKLC installer does is detect the architecture and then simply copy the right .dll and some registry values to where the system expects them to be?) That way, I could run a script or program that loaded and activated a layout from my USB drive, without installing anything to the local hard drive! Any tips/ideas whether this is doable and if so, how?
Hello Michael, and thanks ever so much for sharing insights about the workings of the Windows layout routines!
I hope you're the person to ask about this, or if you aren't, that you could tell me who might be?
In these days I'm very portable, and happily so. I 'take my digital life with me' by USB flash drive, and this allows me to work comfortably on computers where I have no installation rights. That's nice!
When it comes to the keyboard layout however, I'm not yet fully satisfied. I've used MSKLC to make an enhanced layout (Unicode, some keys moved around etc) that I'm very fond of. I can use the freeware script PortableKeyboardLayout to take that with me, but this solution is imperfect because it's too slow at times (it's written in a scripting language) and doesn't always play 100% nice with the input stream. (It has other virtues, but that's another matter.)
So what I'd really like would be to be able to code or script an analogue to the LoadKeyboardLayout API, that instead of looking in the registry and system folders could take its values from a specified file and an MSKLC-made install! (I hope I'm right in thinking that what the MSKLC installer does is detect the architecture and then simply copy the right .dll and some registry values to where the system expects them to be?)
That way, I could run a script or program that loaded and activated a layout from my USB drive, without installing anything to the local hard drive! Any tips/ideas whether this is doable and if so, how?
Now Bob knows I hate to be the bearer of bad news....
But this is a capability that MSKLC does not provide, and there is no back door into the sanctioned input stream in Windows to support what is being requested here.
Now obviously Text Service Framework (TSF) Text Input Profiles (TIPs) can accomplish this, but they are complex DLLs that also require administrative permissions to write to the registry and let the system know where they are. There is no mechanism for inserting an input method without that level of access/authority on the machine....
Sorry about that!
Over in the Suggestion Box, Michal asked:
With Dr. International (http://blogs.msdn.com/drintl/) being inactive for years - who's gonna write "Developing International Software, Third Edition" now? I think there is still some stuff missing: - localized size units (French "octets") to cover... - the Japanese (and Korean) path separators... - more information on registry keys and paths translated in localized OSes... Cheers,Michał
With Dr. International (http://blogs.msdn.com/drintl/) being inactive for years - who's gonna write "Developing International Software, Third Edition" now? I think there is still some stuff missing:
- localized size units (French "octets") to cover... - the Japanese (and Korean) path separators... - more information on registry keys and paths translated in localized OSes...
Cheers,Michał
Ah, the questions that are painful to even read, let alone answer....
Interestingly, that book has never had a "real" author in the conventional sense.
The first edition has Nadine Kano listed as the author but even she would point out that it was the work of a whole team of people. And there is no publisher other than perhaps Wrox that would be comfortable with an army of authors listed on the cover.
The second edition was also written by a team, but rather than focus on any one person (a focus that I suspect Nadine did not always enjoy, especially now so many years later that she is in another group doing unrelated work), they chose the virtual team member, Dr. International.
The secret about Dr. International is that once upon a time, I was him!
When it first started out, Bjoern wanted to meet with me, for two reasons:
Since his team had a responsibility to be doing that as well, he thought maybe I could be on his team!
And he had an idea about it - like an international version of Dr. GUI, a "Dr. International".
I thought it was a cool idea, though I doubted it would be a fulltime gig. He suggested maybe 10 hours a week tops, I'd write a column and could do the whole talking about myself in the third person thing that Dr. GUI was so famous for, etc.
After six months the contract ended and other members of the team took over the job of being Dr. International.
And of course he eventually got a blog.
In general they dropped the third person thing which I admit I really got into -- and still do in my facebook statuses I post. It isn't pretension, it is just fun. I missed that....
Twice since that time I have proposed that a third edition of the book, one more focused on Windows (and perhaps .Net) rather than all Microsoft products like the second edition tried to do. Both time I was not told no outright but the project never was formally approved.
And now with the last member of that team who was still working on various projects gone (one of the primary successors to me all along, good for continuity!), Dr. International is truly gone now.
Given the re-org within Windows, the third edition is unlikely since no group would be likely willing to fund it and the book would not make enough to do as an independent project.
I suppose I'll just have to keep blogging, and not writing (ref: About [not] writing books...).
The question was easy enough:
Hi, I am trying to use the GetTimeFormat to retrieve the Time format from the system. Win7 added the new customized format. How can I get the format(as in the attached image)?Which Locale ID should I use? Thanks
Hi,
I am trying to use the GetTimeFormat to retrieve the Time format from the system. Win7 added the new customized format. How can I get the format(as in the attached image)?Which Locale ID should I use?
Thanks
Well, first let's take a moment to recognize something important that this screenshot shows.
Do you see it?
Hint #1: It is from windows 7.
Now?
Hint #2: Remember blogs like Customizing the SHORT time format? and We do seem to be short on time... and I see LONG TIME and SHORT TIME; where are SHORTER TIME and SHORTEST TIME? and Predictably (in retrospect), aka Where Wild^H^H^Hindows-Only Things Are, aka SHORT [on ]TIME for a LONG TIME and such.
Oh, never mind. I gave it away.
They added a short time format to Regional and Language Options!
Awesome, right? Now there is parity between this piece of managed code and native code, between NLS and the Globalization classes.
Very cool!
Well, almost.
Getting back to that question:
I am trying to use the GetTimeFormat to retrieve the Time format from the system. Win7 added the new customized format. How can I get the format
Suddenly we lose some awesomeness, here.
You see, GetTimeFormat/GetTimeFormatEx add no new flags to get at this new data item that Regional and Language Options exposes. They have the same old flags they always had, but no new meanings atre ascribed to them, and behavior changes would be bad so it is probablyh good that nothing changed here.
And the Windows 7 version of winnls.h adds no new flags to get at the short time, either (just in case there was some worry that the docs were falling behind the product features).
There is no way to directly get at this new format that you can get at the time for formatting from the time formatting functions.
Though you can get at it through EnumTimeFormats/EnumTimeFormatsEx with the new TIME_NOSECONDS flag, or GetLocaleInfo/GetLocaleInfoEx with the LOCALE_SSHORTTIME flag.
As a by the way, the LOCALE_SSHORTTIME flag has some really disturbing (to me, at least) information:
Windows 7 and later: Short time formatting string for the locale. Patterns are typically derived by removing the "ss" (seconds) value from the long time format pattern. For example, if the long time format is "h:mm:ss tt", the short time format is most likely "h:mm tt". This constant can specify multiple formats in a semicolon-delimited list. However, the preferred short time format should be the first value listed.
Um, if true, that means that LOCALE_SSHORTTIME does not behave like LOCALE_SSHORTDATE or LOCALE_SLONGDATE. It returns the semicolon-delimited list of short times that EnumTimeFormats/EnumTimeFormatsEx enumerate.
Now if true that would make it easy for the .NET Framework's new upcoming version to ask Windows for information since that is the format it grabs out itself anyway, but not so easy for developers in Windows or outside of Microsoft using Windows, since which entry is the (possibly customized but if nothing else "current default" is not documented and thus the Windows code that grabs formats out of EnumTimeFormats/EnumTimeFormatsEx can't be officially relied on (in practice it should be the first one, I suppose - maybe that should be officially documented).
And even if it could, parsing a semicolon delimited list of formats is easier than calling the enumeration functions which make callbacks and so forth.
That may be why LOCALE_SSHORTTIME has this unusual data return value - the fact that this keeps the .Net folks from having to call the complicated callback function. So at least someone might have had an easier job.
The solution to the original problem?
Well, Step 1 is to call GetLocaleInfo/GetLocaleInfoEx with the LOCALE_SSHORTTIME flag.
And then Step 2 is to take the string that is returned and pass it as the format to GetTimeFormat/GetTimeFormatEx.
And of course that last question (which locale?), is to use either the LOCALE_USER_DEFAULT or LOCALE_NAME_USER_DEFAULT constant, depending on which function you call.
You might get a sense of why I stopped thinking it was awesome.
I mean, don't get me wrong, it is better than nothing.
But there is a lot of room for improvement in future versions, based on the ways that developers might want to make use of the information....
UPDATE 12:32pm - it seems that they did the work in GetTimeFormat[Ex] to support the formatting; if you pass TIME_NOSECONDS they give you the short time in Regional and Language Options, with seconds stripped out if you added them. Though is you want what the user put in you have to go through the above steps due to the overloading of this flag that has had its meaning changed. Maybe I'll write about that tomorrow....
A few days ago, via several different methods (the Visual C++ Development Center forum, email to my non-Microsoft account, the contact link here, multiple off-topic comments with increasing impatience apparent in each for a solution), Rajesh asked:
Hello Michael: I have visited your blog, and know that you are an expert in Windows Uniscribe, here I have some questions about Uniscribe to ask you. Inter-character spacing for labeling results in a composite text collection with each character being split as a separate one. Hence each character is presented as a separate one and cannot arrive at a combination character. Problem with combinational characters is not only specific to right to left language( Arabic Language- Example:يُساوِي), the problem can exist with left to right language(Hindi Language - Example:ठऑक्षझॉ) also. So,Please let us know if there exists any API that identifies the given set of pre composed characters comprises a composite character. Thanks in advance,Rajesh Reddy
Hello Michael:
I have visited your blog, and know that you are an expert in Windows Uniscribe, here I have some questions about Uniscribe to ask you.
Inter-character spacing for labeling results in a composite text collection with each character being split as a separate one. Hence each character is presented as a separate one and cannot arrive at a combination character. Problem with combinational characters is not only specific to right to left language( Arabic Language- Example:يُساوِي), the problem can exist with left to right language(Hindi Language - Example:ठऑक्षझॉ) also.
So,Please let us know if there exists any API that identifies the given set of pre composed characters comprises a composite character.
Thanks in advance,Rajesh Reddy
Now of course I generally can't do the kind of 1-on-1 support that the many messages entailed, and people who are looking for support like that really need to find a more appropriate method, as I point out in my Contacting Me link.
But the question is an interesting one, and the blog that was going to be put in for today has to have a bit more done to it, so I thought I'd take a stab at it.
For starters we'll have to take the word composite out if the mix. Not that the word isn't descriptive enough, just it carries some baggage with it. It can confuse people into thinking the question is more about code pages and the difference between what Microsoft calls composite vs. precomposed sequences. This is the problem that the support engineer had in this forum thread at first.
Now the biggest problem is in the assumption that simply adding space in between every character is the right thing to do, as any language/script that does shaping when certain characters are placed next to each other will fail -- and this is the very problem that Rajesh points out.
What someone trying to do a complex operation like full justification could use is the information that Uniscribe returns in its ScriptString_pLogAttr Function (if one is using the ScriptString* functions) or the ScriptBreak function (if one is calling the fuller low level Uniscribe functions) -- in particular the array of SCRIPT_LOGATTR structures that each function returns that will, for each character in the list of characters Uniscribe is processing will return all of the following information:
Now once one has all of this information, one knows the safe places where space can be inserted if one is trying to extend the width of a line in order to make the justification match other lines, if one is using simple space insertion to do so.
But this is the wrong approach.
Note that in pretty much all cases such an algorithm has a pretty fundamental flaw, which is that the actual widths one might need to insert can be different and using full characters between the words will make the text jagged on the far edge (as can the different widths of the words themselves).
The better way to perform such operations is by use of the ScriptJustify Function as possibly modified by a more advanced editor, as the function indicates:
This function provides a simple implementation of multilingual justification. It establishes the amount of adjustment to make at each glyph position on the line. It interprets the SCRIPT_VISATTR array generated by a call to ScriptShape, giving top priority to kashida. The function uses interword spacing if no kashida points are available. It uses intercharacter spacing if no interword points are available. Note: Sophisticated text formatters might generate their own delta dx array by combining formatter-specific features with the information retrieved by ScriptShape in the SCRIPT_VISATTR array. The application should pass the justified advance widths generated by ScriptJustify to ScriptTextOut in the piJustify parameter. ScriptJustify creates a justified array containing updated advance widths for each glyph. When an advance width for a glyph is increased, the extra width is rendered to the right of the glyph, with a white space or, for Arabic text, a kashida.
This function provides a simple implementation of multilingual justification. It establishes the amount of adjustment to make at each glyph position on the line. It interprets the SCRIPT_VISATTR array generated by a call to ScriptShape, giving top priority to kashida. The function uses interword spacing if no kashida points are available. It uses intercharacter spacing if no interword points are available.
Note: Sophisticated text formatters might generate their own delta dx array by combining formatter-specific features with the information retrieved by ScriptShape in the SCRIPT_VISATTR array.
The application should pass the justified advance widths generated by ScriptJustify to ScriptTextOut in the piJustify parameter.
ScriptJustify creates a justified array containing updated advance widths for each glyph. When an advance width for a glyph is increased, the extra width is rendered to the right of the glyph, with a white space or, for Arabic text, a kashida.
This is the Uniscribe model for dealing with the kind of advanced justification one might see in a program like Word or PowerPoint or Publisher -- as it can be used to precisely place text to allow desired justification to take place....
For the other issue, the way of getting my (or anyone's) attention, I expect in most cases if one just thinks of me not as an employee of your company or you personally but as someone who has a job and really just blogs because it is fun and interesting to talk about the things that interest me (such as Uniscribe). If you met such a person, how would you approach them? If you had their email address, how would you word the email? And what would your expectation be? I expect the majority of people who frame the question that way will come up with an appropriate answer.
If the answer is needed urgently (which I assume it is) then there are many more formal support options that will guarantee the timeliness of the response, much more effectively than shouting the question from the rooftops (sometimes I end up involved with those too, and I serve at the pleasure of the customer).
I mean if they have an interesting enough question maybe I'll answer anyway. But my interests are pretty hard to pin down sometimes, and even the girl I go out with wonders how she catches my eye (though she does and I suppose once one catcheas my eye and not my ire then the hardest part is taken care of!).
And all of that is ignoring the challenges of figuring out my blogging schedule!
It kind of reminds me of the days back when I blogged about Doing a little more in Sri Lanka....
Back in the beginning of 2005. After all of the problems in Sri Lanka and an urgent need that people had for help.
Sure lots of money was given, by many people. Many people from Microsoft, in fact.
But there was also some assistance that was of help in a more technological sense, as I mentioned back then.
Well now it is 2010, and another natural disaster has taken place. In Haiti.
Again, the money has been flowing out there, to help with many of the different relief efforts.
Plus one additional effort, discussed in Announcement: Haitian Creole support in Bing Translator and other Microsoft Translator powered services (see also the info about the bot that can speak the lingo in Haiti, as well -- a huge boon in a country like Haiti that has ~55% illiteracy in the country)....
The question that came in was:
My customer have a questions with EUDCEDIT program on Windows XP. As we know, if we use EUDCEDIT to add some characters on XP. It will create two files: eudc.tte and eudc.euf. The question is, if we lost eudc.euf file, could we restore this file from corresponding eudc.tte? Because even we only have eudc.tte, the new characters still can work well.
My customer have a questions with EUDCEDIT program on Windows XP.
As we know, if we use EUDCEDIT to add some characters on XP. It will create two files: eudc.tte and eudc.euf.
The question is, if we lost eudc.euf file, could we restore this file from corresponding eudc.tte? Because even we only have eudc.tte, the new characters still can work well.
Now I have written about EUDC before, on several occasions. But knowing something about how it is used and how it interacts with different parts of the system doesn't necessarily make someone knowledgeable about the authoring issues.
I mean, I had some thoughts on the subject but this seemed like a better one to get some more expertise on....
Luckily Peter was around to provide the answer that I suspected was true:
I believe the .euf file contains the originally-edited bitmap data from which the .ttf is edited. If the .euf is lost but you still have the .ttf, then you can display those EUDC characters, but any editing of the glyphs would have to be done in a different font-editing tool, such as Fontographer or Fontlab. I don’t know of any way to recover the .euf file from the .ttf file. (In theory, it should be possible to generate an .euf that was approximately the same, but I don’t know of any tools that support that.)
It is possible that he is being a shade optimistic about tools being able to view/edit the .TTE files, but there aren't a whole lot of technical issues blocking it, so if they don't then they ought to. EUDCEdit itself is primitive enough that sophisticated options such as this seem a little out of scope, but of the many tools out there I assume some must be able to do something with TTE files.
Any of the regular readers here know for sure?
Over in the Suggestion Box, Andrew West asked:
Hi Michael, I have used the Table Driven Text Service to create an input method for a not-yet-encoded script (Tangut) using PUA codepoints. It was easy and it works great, except that the PUA Tangut characters do not display on the candidate list, which is quite inconvenient, especially in those cases where two or three characters share the same input sequence. So I was wondering, is there any way to specify what font to use for the candidate list? Thanks, Andrew
Hi Michael,
I have used the Table Driven Text Service to create an input method for a not-yet-encoded script (Tangut) using PUA codepoints. It was easy and it works great, except that the PUA Tangut characters do not display on the candidate list, which is quite inconvenient, especially in those cases where two or three characters share the same input sequence. So I was wondering, is there any way to specify what font to use for the candidate list?
Thanks,
Andrew
Now it is interesting that you ask this question, Andrew.
As you know, we aim to please here. :-)
As it turns out, the folks in Taiwan were not thrilled about the way that the Array and DaYi IMEs looked on non-CJK UI language systems -- since DEFAULT_GUI_FONT was being used, the size ended up being 8 on those machines, which is not great for those IMEs on Chinese, Japanese, or Korean UI language systems.
They wanted 9pt, darn it!
Plus any time they were on Simplified Chinese UI language system, they were getting the font chosen in the IME that they just didn't want (SimSun).
They wanted PMingLiU, gosh darn and to heck with it!
So in Windows 7, the information/option to support them was added to the TableTextService text files. You can check it out in the Windows 7 files for TableTextServiceDaYi.txt and TableTextServicesArray.txt in the configuration section:
[Configuration]FontFaceName=PMingLiUFontSize=9
And there you go!
You can put in some other font facename/size there and that font will be used instead (otherwise it will default to whatever GetStockObject(DEFAULT_GUI_FONT) returns, as it does for most of the built-in IMEs, and in Vista....
So I guess maybe you could say that you love Windows 7, and that it was your [independently requested] idea, Andrew!
Warning: this blog will not be as nice as some of the other blogs in this Blog have been, previously.
Remember how I used to have a Unicode character sponsor every blog in this Blog?
Well yesterday I was in my Twitter account (http://twitter.com/michkap) having some random goofy moments, so I was tweeting some sponsorship tweets, with characters sponsoring me.
Like this one:
is sponsored by ䷲ (U+4df2, aka HEXAGRAM FOR THE AROUSING THUNDER). This one spins itself, no additional comment required. #fb
Now people who had seen the feature on the blog would see this as nothing new. But this was a novel thing for me in Twitter and probably a good reason that most people think Twitter can be a waste of time since it was not doing anything useful.
Now since the primary purpose for Twitter (for me) is to being some of the discipline that its 140-character limits bring to its tweets and extend it to the statuses in my Facebook account (http://facebook.com/michkap), that #fb hashtag at the end causes a Facebook app to pick up the tweet and make it a Facebook status.
This lets me waste twice as much in the way of people's time while spending half the time that would usually require.
Anyway, I managed to [re]discover a problem this way.
You see, fellow blogger Larry Osterman is also a Facebook friend of mine, and he noticed a problem with this tweet:
do you have a link for those of us who are running a unicode challenged browser like IE?
Now I had been running in FireFox for reasons not terribly relevant here relating to a bug which I am told has been fixed but it takes me a while to recover.
So my view was like this:
Clearly I was able to see the character.
So I launched Internet Explorer to see it as Larry was seeing it:
and yes, the character is not visible there. Bummer.
Now I know some people hate Internet Explorer.
And one could jump on the bandwagon and show this is convincing proof that FireFox rules while Internet Explorer drools but one would be wrong to do so.
Because this problem is not Internet Explorer's fault.
Well, not really.
It is a problem of group focus versus customer scenarios, in actuality. I should probably explain:
You see, the Windows team is most focused on and concerned with the product version they are working on. This makes sense since for the most part there is another sustained engineering group that is most focused on prior versions.
But if you are a member of a group that produces Microsoft Office or Visual Studio or the .NET Framework or SQL Server or Internet Explorer then you know you have to run on other versions. So the exciting features of th latest version of Windows are of interest but hardly the only consideration since these products have minimal interest in sucking on earlier versions of the operating system.
The picture:
Got it?
Now the NLS folks have owned MLang for years now, part of a restructuring from I believe when the original IE6 team kind of disbanded and went to the four winds, which in its own way is kind of unfortunate since the reconstituted IE team did not take it back but relied on the NLS team to continue to own this library.
Why is it unfortunate?
Because the NLS team didn't touch it.
They fixed security bugs and occasionally fixed major reported problems (though usually did not touch those either due to lack of testing resources to verify changes or backcompat concerns.
In my opinion every time a feature was better developed in Windows than MLang, the MLang feature should have been gutted and made into a wrapper around the non-MLang version. And any feature that did not exist elsewhere but was needed by the IE team should continue to be maintained for the simple reason that Internet Explorer was depending on it and they are a partner team that for some people is the only time that some international features would ever be used with any kind of frequency. And if not then we should just give it back to them and let them control their destiny here. This worked with .Net (where we chose to continue to own the globalization pieces) and in Office (where we provided them with snapshots of the data)
I was regularly overruled on this opinion.
So support for several international features in Microsoft's premier Internet platform piece just started falling farther and farther behind.
In this particular case, the case that Larry noticed here is yet another side effect of the problem I mentioned in The importance of Tagalog to Burmese, aka "Of course I'd lie to you, I'm a font!".
Yet another bit of font support that the typography team worked so hard to support - in this case to add to the new Segoe UI Symbol:
becomes a great opportunity to make FireFox look better on Windows 7 while Internet Explorer 8 gets to look dumb for no good reason.
Note that after I first put up The importance of Tagalog to Burmese, aka "Of course I'd lie to you, I'm a font!" I made the recommendation that at a minimum the two bugs get fixed (since those were scripts that Windows claimed to support and were legitimate bugs) but ideally the basic table get updated to support everything in Unicode (which would be harder; the bug only involved entries in a table while the full fix involves some new script IDs which means other work).
I was overruled since the idea of updating MLang was simply not one that the folks deciding stuff wanted to entertain.
Personally, I think Internet Explorer should just make a land grab and take back MLang, doing a good solid job on it to bring their support to where it ought to be. Because being owned by the NLS team is a good thing when they are supporting you and your goals, but it really sucks if you are being put in maintenance mode. IE8 is by report a pretty good browser and deserves to be treated with more respect by its partners.
Perhaps this won't happen either, but if nothing else maybe the team that owns MLang now (post Windows re-org I cannot claim to know with 100% certainty who that is) can be shamed into updating a frigging table. Either on their own or with help.
I could do the bug fix work myself in afternoon by updates to one source file. I'll even give them the updated mlflink.cpp source file myself if they are worried about the time sink to look up the latest Unicode information. I'll even give the update to the SE folks in case they would like to unlameify any of the prior versions of IE. Plus I'd help whoever wanted to do the full fix any way I could....
Internet Explorer 9 (and frankly Internet Explorer 8 and Internet Explorer 7) deserve better.
This post brought to you by ䷐ aka ䷐ if you explicitly tag the font (U+4dd0, aka HEXAGRAM FOR FOLLOWING - as @DaleSchultz pointed out to me, a great character for Twitter!)
It was actually just over a year ago when Michael Holtstrom asked a question over in the Suggestion Box:
Hi. Here's a little program. I know it's not unicode, but the product I'm working on is 14yrs old, so it's just too late for that. #include <iostream> void info() { char line[1024]; printf("\n Input via gets() "); gets(line); printf(" Echo via printf() %s\n",line);} int main(int argc, char** argv) { info(); setlocale(LC_CTYPE,""); info(); return 0;} So, on my dos console, built from visual studio 98, this works just fine, but built from visual studio 2008 the characters no longer round-trip. For example, after the setlocale call, ALT+252 shows SUPERSCRIPT LATIN SMALL LETTER N as expected from cp437. And the byte from gets is xFC as expected. But when you give xFC to printf, it displays as LATIN SMALL LETTER U WITH DIAERESIS as would be expected from cp1252. Now I realize that I can work around this by using ReadConsole/WriteConsole instead, but isn't is a little insidious that on a completely default system, using basic calls like gets/printf/setlocale, simple IO doesn't round-trip? Maybe I'm missing something, but it seems like someone has intentionally gone out of their way to make me suffer. I'd love to know why. Thanks. P.S. why call setlocale? Because we always have, and they're scared of what will happen to the database drivers, etc. if we change it. P.S. why care about non-ascii? Because many apps talk to our db and all latin1 is legal. We've already gone to a lot of trouble to avoid best-fitting when printing to the console, and the new behaviour destroys that.
Hi. Here's a little program. I know it's not unicode, but the product I'm working on is 14yrs old, so it's just too late for that.
#include <iostream> void info() { char line[1024]; printf("\n Input via gets() "); gets(line); printf(" Echo via printf() %s\n",line);} int main(int argc, char** argv) { info(); setlocale(LC_CTYPE,""); info(); return 0;}
#include <iostream>
void info() { char line[1024]; printf("\n Input via gets() "); gets(line); printf(" Echo via printf() %s\n",line);}
int main(int argc, char** argv) { info(); setlocale(LC_CTYPE,""); info(); return 0;}
So, on my dos console, built from visual studio 98, this works just fine, but built from visual studio 2008 the characters no longer round-trip.
For example, after the setlocale call, ALT+252 shows SUPERSCRIPT LATIN SMALL LETTER N as expected from cp437. And the byte from gets is xFC as expected. But when you give xFC to printf, it displays as LATIN SMALL LETTER U WITH DIAERESIS as would be expected from cp1252.
Now I realize that I can work around this by using ReadConsole/WriteConsole instead, but isn't is a little insidious that on a completely default system, using basic calls like gets/printf/setlocale, simple IO doesn't round-trip?
Maybe I'm missing something, but it seems like someone has intentionally gone out of their way to make me suffer.
I'd love to know why.
Thanks.
P.S. why call setlocale? Because we always have, and they're scared of what will happen to the database drivers, etc. if we change it.
P.S. why care about non-ascii? Because many apps talk to our db and all latin1 is legal. We've already gone to a lot of trouble to avoid best-fitting when printing to the console, and the new behaviour destroys that.
Sorry it took me so long to get around to this one, Michael. There has been a lot going on....
Now there are shades of the Anything still wrong is probably wrong for good.... issues here and the complex issues surrounding the CRT's setlocale.
As I've mentioned there and other places and as people have noted for a long time, the nature of setlocale with the "" locale call is complicated and seems to change from time to time due to both OS settings and CRT changes.
In this case, a concerted effort to take the implied meaning of setlocale's "" setting to mean
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
and actually switch more of it to use the ACP rather than not making changes.
To make it work the old way in all versions, you can change
setlocale(LC_ALL,"");
to
setlocale(LC_ALL,".OCP");
though note that this will potentially also change more than was intended (it will fix the reported issue, but yo could run into another problem with something changing that you didn't expect to).
But the CRT, in many cases, is beholden to a standard that it at least passively tries to live up to, just as most C compilers have a C runtime that tries to live up to that standard (possibly with their own extensions like Microsoft's does).
So if a function is documented as being impacted by locale setting changes then fixing problems where the impact is not happening is simply making the CRT more conformant -- a change that many people feel is long past due and they are glad it happens more and more each version....
I have no strong feelings in either direction, but I will note that there is no way to become more conformant if one retains nonconformant behavior, like ignoring statements about expected locale dependencies.
Now note that I do think the inconsistencies that remain are still kind of weird - like the fact that puts (which behaves the same as printf) does not do the same thing as gets here; this seems like a bug although usually this boils down to the fact that the implementations of each function are not isolated from each other and a function that specifies no locale will deep down be calling one that does and thus one has to have an actual locale in there and then there you go.
But perhaps there is a way to dig in here a bit and treat the functions that deal with the console differently -- and have them use the settings attached to the given console in which they are running, across the board. Since the console is also "locale" based in a broad sense such behavior would also be conformant....
I have a blog I have been writing off and on for a couple of years now all about digit substitution.
That blog is coming soon and will be my definitive and final thoughts on the feature and its implementation(s).
This is not that blog.
This is a blog about digit substitution, though.
Digit substitution and GDI+, this one is about, actually.
The question:
Issue: Need to get the lang id and “standard digits” values that the user has chosen for digits. Scenario: the user locale is set to "English (United Status)" In "Additional settings..." in Region and Language, "Standard digits" has been set to ٠١٢٣٤٥٦٧٨٩. "Use native digits" has been set to National. Problem: In the above scenario, all the windows UIs like shell, notepad, etc are showing ٠١٢٣٤٥٦٧٨٩.However, when I render text using GDIplus drawstring, it is rendered as 0123456789 I was able to get the desired output when I used format.SetDigitSubstitution(0x0C01, Gdiplus::StringDigitSubstituteNational); (Our code is in C++.) Question: Does anyone know the win 32 api or how to get the lang id and “standard digits” values that the user has chosen for digits.
Issue:
Need to get the lang id and “standard digits” values that the user has chosen for digits.
Scenario:
the user locale is set to "English (United Status)" In "Additional settings..." in Region and Language, "Standard digits" has been set to ٠١٢٣٤٥٦٧٨٩. "Use native digits" has been set to National.
Problem:
In the above scenario, all the windows UIs like shell, notepad, etc are showing ٠١٢٣٤٥٦٧٨٩.However, when I render text using GDIplus drawstring, it is rendered as 0123456789
I was able to get the desired output when I used format.SetDigitSubstitution(0x0C01, Gdiplus::StringDigitSubstituteNational); (Our code is in C++.)
Question:
Does anyone know the win 32 api or how to get the lang id and “standard digits” values that the user has chosen for digits.
This ought to be easy enough to answer, one might expect.
Unless of course one has spent any time dealing with digit substitution, of course!
Unfortunately, the answer is that there is no way to query for this information, this scenario where one changes the fundamental digit choice to be different than the user's standards and formats language, al so known as the default user locale for the given user.
Now if one looks at the internal tables behind Uniscribe and GDI+ one will see lots of hard-coded, locale based info (as discussed in Digits -- there is no substitute) but the fact that they are kind of locale based is hidden from the caller of both technologies, though hidden in very different ways (in the case of GDI+ this way is a lot harder to work around as this problem illustrates, as there is no intrinsic way to say "use the default user locale setting as modified by the user when appropriate". This is the setting that would easily solve this StringFormat::SetDigitSubstitution Method-based question.
Now in theory one could hope that not calling the above method with its limitations would lead to correct behavior, but this is apparently not the case. Before one gets too haughty it is reasonable to consider that one of the most common reported performance complaints about Uniscribe is its overabundant interest in user locale settings due to the anal retentive checking of the digit substitution settings.
So GDI+ is giving us some gain for the lack of this functionality - one less performance issue!
Unfortunately, if it is not giving a parameter on a StringFormat::SetDigitSubstitution Method overload that lets one say this (like the Uniscribe ScriptRecordDigitSubstitution Function not only allows but explicitly documents) then GDI+ is leaving this hole in support, a hole that really has no good excuse (if the lookup is only triggered by a StringFormat::SetDigitSubstitution Method call with LOCALE_USER_DEFAULT then it is hardly a negative performance issue since no one is monitoring anything at all, really -- it can be a one-time lookup.
Note also that the Uniscribe ScriptRecordDigitSubstitution Function and in particular the SCRIPT_DIGITSUBSTITUTE structure that both it and the ScriptApplyDigitSubstitution Function depend on, introduces the notion of separate NationalDigitLanguage and TraditionalDigitLanguage values that do not have to be the same as the user locale since they can be updated via other processes.
There is a part of me that would like to mistrust the report from the questioner that LOCALE_USER_DEFAULT to the GDI+ StringFormat::SetDigitSubstitution Method really fails here, but I'm going to trust that it did fail to fix the problem as they told me when I made the suggestion.
Worst case has me reporting that I was mistaken here and I will only be the second most embarrassed person and I can live with that! :-)
Plus if it does actually work I'll still get the last word about how weird it would be that LOCALE_USER_DEFAULT and the LCID that is the current default user locale would have different behavior, despite all the work that NLS does to make them the same thing. Even though it is a screwy semantic at times, it has been there for a long time and it really ought to be respected by people who wish to opt in to the "LCID" datatype....
The question was one that has been asked before:
Our app could be deployed on Windows XP/2003 which dozen’t support en-SG (English-Singapore) and en-MY (English Malaysia), two cultures that we are shipping. We are setting the CurrentUICulture explicitly and deploying the appropriate satellite assemblies. My question is: will this just work on XP/2003 or would we have to register custom cultures on these operating systems?
Our app could be deployed on Windows XP/2003 which dozen’t support en-SG (English-Singapore) and en-MY (English Malaysia), two cultures that we are shipping.
We are setting the CurrentUICulture explicitly and deploying the appropriate satellite assemblies. My question is: will this just work on XP/2003 or would we have to register custom cultures on these operating systems?
This issue is one that I often blame on the way UICulture keys off of CultureInfo, in particular as I described in Two things that suck about CurrentUICulture, part 1 and part 2. This keeps an architecture that could have worked just based on names without requiring cultures to exist from reaching its full potential. From working in the specific scenario of supporting cultures that may or may not exist on a given machine, on that machine.
Now the workaround is straightforward enough - a custom culture on the machine will let the culture work on that machine.
But this is an honest step backward from the world of native resources in the pre-Vista era, when a single binary could contain multiple languages keyed off of LCID values and it did not matter whether the LCID was understood by the machine or not.
And while on the topic, the Vista and later story is also a step backward in some ways, since it is also name dependent and the locale must exist on the machine. Though the old, all language resources in one binary, should still work there a fact that would make this more of a step sideways since it primarily affects the enhanced MUI story rather than the resource loading story that predates MUI entirely.
Managed code is better than native code in one very important way: the custom culture solution works much better there overall, as I pointed out in Thinking about MUI is making me bipolar. Though that bug is fixed in Windows 7 (verified just a moment ago, with random-piece-of-crap-locale working just fine there for native MUI too.
But both still fail in the "culture/locale not on the box" scenario....
Gwyneth's question was an interesting one:
Out of curiosity, do you know the history of why Unicode didn’t create separate characters for the Turkish i? The i is only character that changes casing based on the language (Turkish/Azeri). I did a little searching online, but didn’t find any obvious references to the rationale behind this. Thanks!
Out of curiosity, do you know the history of why Unicode didn’t create separate characters for the Turkish i? The i is only character that changes casing based on the language (Turkish/Azeri). I did a little searching online, but didn’t find any obvious references to the rationale behind this.
Thanks!
As was Peter's answer:
That would have been a Unicode 1.0 decision, before I had even heard of Unicode (which I first heard about around 1992/3), so I’m not sure. I suspect it was pre-determined by legacy standards. Encoding a Turkish i probably wouldn’t have been enough; it probably would have been necessary to encode a separate Turkish I as well. Arguably, both would have been duplicating characters and certainly they would have resulted in confusion with 0049/0069 – likely with some data getting encoded one way and other data using the other. Chances are we would have ended up facing the casing issues as well as data in mixed representations.
This sums up the principal reasons quite nicely!
It is unfortunate that the experience most people have with Turkish is how it highlights code that does not handle globalization issues (blogs like this one summarize the approach quite well and the "Turkey Test" is no worse that anything else one could call it).
Though I think I still owe a blog post discussing vowel harmony and other linguistic features affecting Turkish. It is coming, eventually. The goal of giving Turkish a better legacy than the Turkey Test may be tilting at windmills, but maybe I can at least point out there is more out there....
Some of you may remember a few year back when a VP over at Adobe commented that perhaps the solution to the problem of rampant software piracy in China that the government there seemed uninterested in combating, or at the very least unable to combat (not to mention other problems), was to simply stop shipping software there.
Now it was very quickly denied as a matter of official Adobe policy, but the point was made - that patience could be exhausted.
I was reminded of all of this after reading David Drummond's (Google Senior VP, Corporate Development and Chief Legal Officer) A New Approach to China on Google's official blog.
Now of course Microsoft has seen both the issues the Adobe VP mentioned and the ones the Google VP mentions, in the case of Google before that company even existed, and despite feeling offended by rampant software piracy in a theoretical sense I have to admit that it makes little real difference to my own personal bottom line. So it bothers me, sure. But I only have so much RAM to devote to issues each day, and it often falls off.
Some of the information issues that Drummond raises feel kind of the same to me, despite the fact that I can recognize I am taking my own morality and using it to judge China's. It makes sense that an Adobe (or a Microsoft) would feel as offended by people stealing their product as Google would feel about the open flow of information being controlled -- in each case the stronger response about China's approach affecting something important to them is obvious.
For me, the China issue has always been most important to me in terms of the issues that directly affected the things in front of me -- their baffling minority language policies toward Uyghur, their Taiwan policies, their policies toward Tibet and Bhutan and even Kashmir, their terrible and inconsistent approach to international encoding and other standards, all sharing the same feel of being a large country that acts small, that acts so worried that it will be small in people's eyes that it will prove it is big even at the expense of language.
I guess you could say that is my biggest issue, and in a way it is the most arbitrary of all of the reasons to try to take issue with China or any of its policies, any of its politics.
My role vis-a-vis support in China, for minority languages whose support is demanded even as expertise is not made available, in the supplementary DLL to assist with GB18030 support to GB18030 work in standards to adding support for more ideographs that China will ever need or use doesn't make me feel proud of the work I do so much as ashamed that I and the company I work for is so easily bullied.
I don't know how much time I can really spend railing on the issues though. Even if I were a VP at Microsoft (and I'm obviously not), these aren't Microsoft's issues exactly. My "test balloon" therefore carries none of the weight of Adobe's or Google's. But in my own way the fact that it is not my business, my money, my bottom line, that is one the line makes it feel like it is less about self-interest, for what it is worth.
Maybe I'll put some thought into what I can talk about, and then talk about it. Because my passions here do run high and it would be nice to be able to put some of my thoughts out there more specifically....
Developer Jason (an enthusiastic reader of the Blog) asked:
We need to be able to convert UCS-2/UTF-16 to a user-specified SBCS/DBCS/MBCS code page. Currently, we achieve this by simply taking the UCS-2 string and passing it on to WideCharToMultiByte with dwFlags set to zero. When converting to the Vietnamese code page 1258, this process can’t find a representation for the Vietnamese character U+1ec5 (Latin e with circumflex and tilde) even though one actually does exist (albeit with a combining diacritic from code page 1258: 0xea 0xde). Converting Vietnamese glyphs from the Unicode BMP to the corresponding glyph representation in the Vietnamese code page seems like a reasonable thing for us to be doing. My question is, should I be expecting WideCharToMultiByte to know this and successfully convert the character? I can’t be the first person to hit this issue and I imagine the mapping tables have been reasonably static, so it seems like perhaps there is something more that I should be doing. Is there, for instance, an expectation that the input string is normalized into some canonical form before calling WCToMB? Presumably decomposed form?
We need to be able to convert UCS-2/UTF-16 to a user-specified SBCS/DBCS/MBCS code page. Currently, we achieve this by simply taking the UCS-2 string and passing it on to WideCharToMultiByte with dwFlags set to zero. When converting to the Vietnamese code page 1258, this process can’t find a representation for the Vietnamese character U+1ec5 (Latin e with circumflex and tilde) even though one actually does exist (albeit with a combining diacritic from code page 1258: 0xea 0xde).
Converting Vietnamese glyphs from the Unicode BMP to the corresponding glyph representation in the Vietnamese code page seems like a reasonable thing for us to be doing. My question is, should I be expecting WideCharToMultiByte to know this and successfully convert the character? I can’t be the first person to hit this issue and I imagine the mapping tables have been reasonably static, so it seems like perhaps there is something more that I should be doing. Is there, for instance, an expectation that the input string is normalized into some canonical form before calling WCToMB? Presumably decomposed form?
An interesting question that will really draw on information from several different blogs from this Blog:
There are several people who tend to be dismissive about this code page, calling it at best incomplete and at worst broken. From a Unicode standpoint it certainly is, and arbitrary, to boot!
But there is a reasoning behind the code page, a point to which regular reader John Cowan's comment to blog #5 above is particularly relevant:
There's a Vietnamese-specific logic to CP 1258 that transcends the arbitrary Unicode normalization rules. The breve, circumflex, and horn accents, unlike the rest, affect vowel quality. If you look at a Vietnamese alphabet like the one at Wikipedia, you'll see that A WITH BREVE, A WITH CIRCUMFLEX, E WITH CIRCUMFLEX, O WITH CIRCUMFLEX, O WITH HORN, and U WITH HORN (as well as D WITH STROKE, which isn't Unicode-decomposable) are considered separate letters from their unaccented correspondents. Consequently, in 1258 they are encoded using seven precomposed characters. On the other hand, the grave, acute, hook above, tilde, and dot below accents are tone marks, conceptually not part of the letters they appear on. They're encoded using combining characters, since encoding them using precomposed characters would create a combinatorial explosion of 12 x 6 x 2 = 144 distinct vowel characters. (The VISCII encoding actually does that, at the expense of filling the whole 0x80-0xFF space with letters and even usurping six of the control characters!) Unsurprisingly, Vietnamese conventions always place the tone mark outside any breve, circumflex, or horn diacritic (and therefore following it according to Unicode rules).
There's a Vietnamese-specific logic to CP 1258 that transcends the arbitrary Unicode normalization rules. The breve, circumflex, and horn accents, unlike the rest, affect vowel quality. If you look at a Vietnamese alphabet like the one at Wikipedia, you'll see that A WITH BREVE, A WITH CIRCUMFLEX, E WITH CIRCUMFLEX, O WITH CIRCUMFLEX, O WITH HORN, and U WITH HORN (as well as D WITH STROKE, which isn't Unicode-decomposable) are considered separate letters from their unaccented correspondents. Consequently, in 1258 they are encoded using seven precomposed characters.
On the other hand, the grave, acute, hook above, tilde, and dot below accents are tone marks, conceptually not part of the letters they appear on. They're encoded using combining characters, since encoding them using precomposed characters would create a combinatorial explosion of 12 x 6 x 2 = 144 distinct vowel characters. (The VISCII encoding actually does that, at the expense of filling the whole 0x80-0xFF space with letters and even usurping six of the control characters!)
Unsurprisingly, Vietnamese conventions always place the tone mark outside any breve, circumflex, or horn diacritic (and therefore following it according to Unicode rules).
Thus it is incorrect to say that U+1ec5 is not supported by cp1258; it may be true that U+1ec5 (ễ, aka LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE) is not supported as a discrete single code point, but U+00ea U+0303 (ễ, aka LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE) is - and according to Unicode those two things are to be treated as the same thing. Given that the tilde is a tone mark in this particular case the language-specific way in which the various components of the letter are used makes sense whether the form follows Unicode normalization rules or not.
Which is this case it doesn't.
Thus the wider question of which Unicode normalization form to use that was one of the main points of Jason's inquiry is in fact a trick question: the answer is neither!
Instead, the Microsoft-specific normalization pseudo-Form V mentioned in #5 above is what would be needed here if one wanted to convert.
Now that is a big if in that last sentence.
Since Microsoft's Vietnamese keyboard layout produces text that will be perfectly represented on code page 1258, there are only three scenarios where one would that "pseudo-Form V" to convert out of Unicode:
For the third point the quick answer is to just not do that, if it is possible.
But of course even that is not always possible, so if the sad truth is that some component that cannot be changed is putting the data in some other form then some type of conversion between [probably] normalization Form C would be needed.
This is something that does not exist though the only requirement that a single byte code page such as 1258 cannot handle is the times when one code point would need to be converted to two, e.g.
and so on through all the other various letters covered by the code page.
Unfortunately a simple, table-based double byte code page could not properly support such a custom "Vietnamese Plus" code page mapping.
EXTRA CREDIT: Can anyone here discern and/or explain why, exactly? :-)
Thus one could build a DLL-based mapping (as in Custom code pages? Redux) and just keep these tables around in code if one wanted to. But one would obviously have to have some vested interest in wanting to (e.g. a need to support cp1258 data with data in Unicode that isn't currently in pseudo-form V.
I was most of the way to having this done (auto-generated) to post as a sample before it occurred to me that there might be very good reasons for a full-time Microsoft employee, even a pain in the ass one like me, to post such a thing.
Though if anyone wants to do it, note that I was using cp 51258, for obvious reasons.
If you wanted to create such a DLL-based code page and there is any way to create a standard usage out of a non-standard/unsupported code page, I would encourage you to do the same! :-)
Now for the record let me say this is an area where I do not really tend to agree with the Microsoft party line completely. I mean, I truly believe that Unicode is the best answer here in the long run, but I am hardly naive enough to believe that everyone has made that change yet and surprisingly [to some] not obnoxious enough to think it is acceptable to do nothing further to assist customers. Especially when we expect people to migrate and we know we aren't the most popular non-Unicode solution, the fact that we provide no assistance here and aren't even remotely apologetic as we vote to make Unicode less and less compatible with our own solution even as we make it harder to use is really not my style. To be honest, the fact that we do not have a better solution for integrating with Unicode in the Vietnamese case is also pretty bad -- not even the excuse of backcompat, the only explanation is that no wants to do the work because supporting Vietnamese correctly and more consistently with Unicode just doen't hit anyone's radar. So no one wantsd to use us and the problem perpetutates itself.
When you consider in particular the history of Microsoft in regard to VNI, it just makes Microsoft look worse. Perhaps there are even legal reasons related to the VNI thing that we are requitred to suck here that no one has told me about?
But anyway and either way, that is how things are right now, so my dissenting opinion is unlikely to reach any higher level than the blog post you are reading....
I got an email from Mike the other day:
Hi Michael, Just a quick FYI, a bit of great news (I guess 15 years is as good a time as any). VC2010 now generates Unicode RC files (when using the project wizard to generate a new app). Wow, I'd never thought I'd see the day. It was a great day when VC2005 actually supported opening and saving of Unicode RC files, but this is the icing on the cake. Now all those people using obsolete source control systems and diff utilities are really gonna have to update to support these newly generated projects that include Unicode RC files or they're in for a surprise :)
Woo hoo!
I agree this is very good news, and very good icing on this particular cake.
Remember when I talked about the first part of this, in The Unicode train is leaving the station, back in 2005?
I remember inside Microsoft how awful the diff'ing sitution was until someone updated WinDiff to support Unicode; Mike has a good point about those other diffing programs!