Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
If I had a dollar for every time someone said the exact words that form the title of this post, I would be able to rival the annual salary I paid myself as Chief Software Architect of my software company before I became a serf at Microsoft. :-)
As Cathy Wissink first told me a long time ago, NLS may be best thought of as an acronym for Not Localization, Stupid. Which is not to call localization stupid, it's not. In fact, as I implied yesterday and the day before, localization really depends on good support from what is covered by NLS, and could thus be thought os as a proper subset in any software that is not itself stupidly implemented to international markets.
The term stupid in that recasting of the acronyn is meant to refer to the people who think that what I do is localization. Though in polite Kyoto to Tokyo form, I usually tell the person who made the mistake that NLS stands for Not Localization, Silly. It is just more polite....
Also to alter the aside from TStT:
[Aside: This anecdote is sort of, but not exactly, an instance of what Mark Liberman calls "silly talk about linguistics". It doesn't really qualify, though, because the misconception (a) was really about localization, not internationalization or NLS in general, and (b) was brought up not because I mentioned I work in NLS, but rather because in this case they claimed to have read my blog when pretty obviously they could not have read any part of it.]
Speaking of the RHDL Emeritus again for a moment (Kieran claims that I have caused Cathy's Google hits to expand voluminously though a quick look seems to disprove that notion), when she and I write articles she usually has to translate the Michaelisms into English, so perhaps Cathy is in fact a localizer in that context? :-)
And that reminds me, speaking of articles, we have one in the October 2005 MSDN (the one I referred to last month), which will be available online in a few days. She struggled to localize this one back to English in addition to all of the work to provide actual content. I wonder if she pines for the days before I was a serf when we actually got paid for that sort of thing? :-)
Maybe that is a better answer for the silly NLS question:
woman on plane: You work for Microsoft, what to do you?me: I work in NLS.woman on plane: NLS, what is that?me: Well, basically the international support. So people can use their language (or really any language) in Windows.woman on plane: NLS, that is wonderful! How many languages do you speak?me: Only one. But I have a colleague who is really patient about it, and she translates for me.
Kind of passive/aggressive, but no worse than when I used to respond to "Jewish? Where are your horns?" type questions as a teenager travelling through West Virginia with "I'm don't wear them when I travel."
Back in January of this year I said a little bit about the new CharUnicodeInfo class.
And then back in March of this year I talked about the stability of the Unicode character database.
I specifically tied these two together at the end of that second post:
So what does that mean for us in the world of the .NET Framework and the new class in Whidbey that captures (among other items) the Unicode general category, as described in A little bit about the new CharUnicodeInfo class? Well, it means two thing, primarily: These values will not change very often. There are times that some will change. Not many, and there is always a carefully thought out reason, but it can happen. And the class is not called "CharMicrosoftSpinOnUnicode" which means that by and large the class needs to follow the standard. Any code that you write using the CharUnicodeInfo class must take this into account.... As Microsoft gets better and better about standards, it will become more and more important for code to recognize that this sort of thing is possible.
So what does that mean for us in the world of the .NET Framework and the new class in Whidbey that captures (among other items) the Unicode general category, as described in A little bit about the new CharUnicodeInfo class?
Well, it means two thing, primarily:
As Microsoft gets better and better about standards, it will become more and more important for code to recognize that this sort of thing is possible.
That text was recently brought to the test in the first release to include CharUnicodeInfo, in a decision to update the version of Unicode it supported from 3.2 to 4.1.
When I read that stability post, I remember that I was actually at all of the Unicode Technical Committe meetings when the changes between 3.2 and 4.1 were decided, and I can promise you that the concerns about compatibility with the changes that were made were very serious and were very extensively discussed.
And I remember the sense of deja vu I got when I was explaining to the various people involved with Whidbey breaking changes on why this update was important. They had the same concerns, even for a breaking change between beta versions of the class. But the truth is that a class purporting to represent Unicode has to represent Unicode. Truly. Even if it does mean an occasional change that impacts code that depends on the results.
I can promise you that the people at UTC meetings representing Microsoft will be taking a stronger interest in changes made here, to make sure they are as small as they possibly can be. And no smaller....
This post brought to you by "þ" (U+00fe, a.k.a. LATIN SMALL LETTER THORN)
A developer at Microsoft recently asked me the following:
In control panel -> Regional options, the short date sample some times show national digit shapes. Is there any way, I can find out using an api which digit shape is used in the short date example of regional options?
The most important place to look is at GetLocaleInfo and specifically the LCTYPE values LOCALE_IDIGITSUBSTITUTION and LOCALE_SNATIVEDIGITS:
Value Meaning 0 Context - the shape depends on the previous text in the same output. 1 None/Arabic - gives full Unicode compatibility. 2 Native - national shapes determined by LOCALE_SNATIVEDIGITS.
Regional and Language Options uses GetDateFormat to get the formatted string and it is Uniscribe that takes the string when it is rendered and decides what to substitute, as appropriate.
You see, it is Uniscribe and more specifically the ScriptApplyDigitSubstitution and ScriptRecordDigitSubstitution functions (plus the supporting SCRIPT_DIGITSUBSTITUTE structure). Since Uniscribe has shipped in the box for every version of Windows since Windows 2000, the information about when and how to substitute digits is indeed valuable, and the Uniscribe implementation proves it. You can use these functions to do all of the "heavy lifting" for figuring out how to handle digits in a wide variety of situations....
This post brought to you by "๙" (U+0e59, a.k.a. THAI DIGIT NINE)
A recent thought from a reader, sent via the Contact link:
Hi. I actualy tried finding the correct blog post to submit this response too - but I couldn't Anyways - A while ago you had a couple of posts on internationalized text esspecially in the browser however you also mentioned how it can be used to cloak a bad file in explorer etc... Would this make any sense at all? In vista- Create a "Secure Unicode" Rendering function - Sort of a "overriden" implemetation of drawText that will draw a little squiglly under any unicode charecter that is deemed suspious (You linked to a RFC that had some good ideas there) - this suiglly would be draw in the sane pen as the font and it would look similiar to the squiggly that word draws under misspelled words. In any situtation where a unicode char might be used to fool the user into doing something he probablyu does not want to do Windows ,(and third party apps0 can use this version to ensure that the user is notified when a charecter might not be exactly what it looks like. I can see this being used in windows explorer for file listings - or perhpas in login text boxes etc, email address boxes (I can send yo a link asking you to send sensitive info to a email address that looks similiar to an address you trust) etc.... Just a thought (obvioussly..)
Hi. I actualy tried finding the correct blog post to submit this response too - but I couldn't
Anyways - A while ago you had a couple of posts on internationalized text esspecially in the browser however you also mentioned how it can be used to cloak a bad file in explorer etc...
Would this make any sense at all?
In vista- Create a "Secure Unicode" Rendering function - Sort of a "overriden" implemetation of drawText that will draw a little squiglly under any unicode charecter that is deemed suspious (You linked to a RFC that had some good ideas there) - this suiglly would be draw in the sane pen as the font and it would look similiar to the squiggly that word draws under misspelled words.
In any situtation where a unicode char might be used to fool the user into doing something he probablyu does not want to do Windows ,(and third party apps0 can use this version to ensure that the user is notified when a charecter might not be exactly what it looks like.
I can see this being used in windows explorer for file listings - or perhpas in login text boxes etc, email address boxes (I can send yo a link asking you to send sensitive info to a email address that looks similiar to an address you trust) etc....
Just a thought (obvioussly..)
This is an interesting suggestion, and it would be a fascinating use of the mitigation tools for IDN security problems that I posted about last month in any application, whether it was from Microsoft or not, even if a specific Win32 or managed API function were not added to the platform or the .NET Framework.
But with that said, it would be fascinating to see such a function!
I would love to see such an idea with even more functionality, like an underlying "confidence level" that would score the confidence that a string was in fact valid and a way to pass to the function the score required to show the visual difference between the two forms of text. And maybe even two HDC values, one for the safe text and the other for the potentially suspect text. I think it would be a fascinating extension to the tools that were originally posted for dealing with IDN security problems but which obviously could play a much wider role in software.
So it is just a thought but one that is good enough that I would even give it attribution had the person left a full name. :-)
Now the original functionality was added in these the final days of the 'Whidbey' product cycle so it was really too late to add any more functionality there, and it is unclear what more could be added to Vista in the way of new features, but the idea (as evidenced by the ideas I spitballed in just a few moments two paragraphs ago!) has a lot of potential in my mind as a functionality.
I do not know if such a function is planned, but it may already be in the works. If I hear anything I'll let you know, I think it is a truly intruiging thought, the potential design of which would make for a fascinating interview question, I think.... :-)
This post brought to you by "а" (U+0430, a.k.a. CYRILLIC SMALL LETTER A)(the original sponsor of the mitigation post, and a letter that truly resents those who would USE it to try to fool users of computers in any kind of phishing expedition!)
Yesterday, in response to my post Its not localization, really, regular reader CornedBee commented:
I always defined internationalization as "giving an application the ability to be localized with reasonable effort," and localization as "adapting an application to local language, appearance and similar issues."
Now note that this does fit the model I had where I pointed out that localization presumed a good internationalization story, but it goes a little further than I would here....
I always called "giving an application the ability to be localized with reasonable effort" localizability rather than internationalization.
Localizability not only included "good internationalization" and "good typography" but it also included stuff like
Compare this with the items in Norman's list that I considered internationalization:
The difference between "localizability" and things under "internationalization" in my mind being that items in the "I" list are required even if a customer is going to use the product in its original user interface language. So if they want to consider the plain old English UI to be properly internationalized for them, they need all of those items to be correct. Even if it is never localized at all. And if the UI needs to be localized then it is much more expensive and difficult to do so if the items in the "L" list are not done. But items in the "L" list are technically not required if the product is never localized -- there is no real harm if they are done, but it is time that is only needed to prepare for good localization and if it is never done then the lack will not be noticed by a customer who is not looking at the localized version....
Perhaps I need to add a glossary here? People may or may not agree with my definitions, but listing them out in one place may save some time, like what I did with some keyboarding terms.... :-)
This post brought to you by "z" (U+007a, a.k.a. LATIN SMALL LETTER Z)(A letter that knows it has a less prominent prominent placement if there were a UK localisation of internationalised software products!)
Kevin asked me, via the Contact link:
Having spent a considerable amount of time browsing the internet, knowledgebases etc. and reading lots of information about encodings, I need some "expert" help (please!). We are writing a worldwide .NET application to get data from IBM mainframes including Japanese, and though I've seen codepages 50930 and 50939 documented I've never been able to find the actual encodings or .nls files. Do these encodings exist for Windows (.NET) and if so where can I get them? I even tried installing Japanese Windows XP! We are a Certified Partner but getting this information has proved to be near impossible!
Having spent a considerable amount of time browsing the internet, knowledgebases etc. and reading lots of information about encodings, I need some "expert" help (please!).
We are writing a worldwide .NET application to get data from IBM mainframes including Japanese, and though I've seen codepages 50930 and 50939 documented I've never been able to find the actual encodings or .nls files.
Do these encodings exist for Windows (.NET) and if so where can I get them?
I even tried installing Japanese Windows XP!
We are a Certified Partner but getting this information has proved to be near impossible!
When I look at the Character Set Recognition (Internet Explorer - DHTML), it has the following items in it:
CharsetFriendlyName Preferred Charset Label Aliases IE Ver Min OS CodePage FamilyCodePage IBM EBCDIC (Japanese and Japanese Katakana) x-EBCDIC-JapaneseAndKana IE5 Win2000 50930 932 IBM EBCDIC (Japanese and Japanese-Latin) x-EBCDIC-JapaneseAndJapaneseLatin IE5 Win2000 50939 932 IBM EBCDIC (Japanese and US-Canada) x-EBCDIC-JapaneseAndUSCanada IE5 Win2000 50931 932 IBM EBCDIC (Japanese katakana) x-EBCDIC-JapaneseKatakana IE5 Win2000 20290 932
CharsetFriendlyName
Preferred Charset Label
Aliases
IE Ver
Min OS
CodePage
FamilyCodePage
IBM EBCDIC (Japanese and Japanese Katakana)
x-EBCDIC-JapaneseAndKana
IE5
Win2000
50930
932
IBM EBCDIC (Japanese and Japanese-Latin)
x-EBCDIC-JapaneseAndJapaneseLatin
50939
IBM EBCDIC (Japanese and US-Canada)
x-EBCDIC-JapaneseAndUSCanada
50931
IBM EBCDIC (Japanese katakana)
x-EBCDIC-JapaneseKatakana
20290
Note that these code pages do not exist in NLS, so either they exist in MLang or maybe even in IE directly (though probably in MLang, which is mostly where this whole list comes from). I think I mentioned that sometimes MLang actually calls the NLS code page functionality, anyway, right?
You can certainly try to do the conversions in MLang and see what happens, right?
Now the EBCDIC code pages are not very great when it comes to coverage of a huge variety of scripts, if you know what I mean. There are many good reasons why IBM has ICU rather than extensive international support based on the EBCDIC code pages!
Perhaps you may even find that 932 covers it all just as well or better?
This post brought to you by "ヒ" (U+30d2, a.k.a. KATAKANA LETTER HI)
Recently overheard in the newsgroups:
Okay, I've had enough of trying to figure this out so it's time to ask those who know. What I'm trying to do is quite straightforward but seems to be a nightmare to get right. I have a system that needs to format a monetary value based on the associated monetary symbol. i.e. It needs to format a GBP value to 2 decimal places, EGP to 3 decimal places etc. I can query the Locale Info to get the number of decimal places for these locales which is fine but the last time I went to Spain they were using the Euro, not the Pesata. France wasn't using the Franc anymore. When I enumerate the available locales, I never get an entry that says it's using the Euro. Therefore my Spanish values format to 0 decimal places and my French format to 2 decimal places. As far as I know, I do have the latest version of the OS. Win2k with all the service packs and Windows Updates. How can I check what's not up to date? Can someone point me in the right direction please. Thanks in advance.
Okay, I've had enough of trying to figure this out so it's time to ask those who know. What I'm trying to do is quite straightforward but seems to be a nightmare to get right.
I have a system that needs to format a monetary value based on the associated monetary symbol. i.e. It needs to format a GBP value to 2 decimal places, EGP to 3 decimal places etc.
I can query the Locale Info to get the number of decimal places for these locales which is fine but the last time I went to Spain they were using the Euro, not the Pesata. France wasn't using the Franc anymore. When I enumerate the available locales, I never get an entry that says it's using the Euro. Therefore my Spanish values format to 0 decimal places and my French format to 2 decimal places.
As far as I know, I do have the latest version of the OS. Win2k with all the service packs and Windows Updates. How can I check what's not up to date?
Can someone point me in the right direction please.
Thanks in advance.
The problem is easy enough, it is that locale data is not really updated on Windows in service packs; it is generally only done in major releases. To that end, both Windows XP and Windows Server 2003 have full support for the Euro in the appropriate "Euro zone" countries. But for earlier versions of Windows, there is no way to pick up support for all of those locales.
Now there is a tool that Microsoft released on Windows Update that will update the appropriate Regional Options settings for downlevel verions of Windows if your user locale is in one of those Euro zone countries. And the tool is left on the machine so it can be rerun as needed or other users or after later changes are made in Regional Options. But to get that support for all of the Euro locales, you have to have the up-to-date OS....
This post brought to you by "€" (U+20ac, a.k.a. EURO SIGN)
Recently in the newsgroups, Robert asked:
I'm just trying to understand that an encoding is a way to represnt numbers UTF-8/16 - a codepage is a mapping of numbers to characters - a localization is a .... the localization is the bit I'm have trying to understand.
I'm just trying to understand that an encoding is a way to represnt numbers UTF-8/16 - a codepage is a mapping of numbers to characters - a localization is a ....
the localization is the bit I'm have trying to understand.
He got a few answers and then Norman replied in a slightly different way:
UTF-8, UTF-16, codepages, and other encodings are all encodings. National standard encodings developed after corporate-developed encodings but before international standard encodings. UTF-8 and UTF-16 are designed to support larger collections of characters than individual national standards supported, and they are intended to support all characters that were included in national standards plus some additions besides. But of course the Unicode codepoints (numeric values of the encodings) for most characters have to differ from the codepoints that were used in national standards and codepages and most existing databases. Localization includes compatibility with existing codepages (and other existing encodings), and translations of words and sentences, and the ordering of fields in dates (year at the beginning or at the end) and time fields (am/pm indicator at the beginning or the end), rules for sorting lists, and anything else that's needed for communication with humans in various cultures.
UTF-8, UTF-16, codepages, and other encodings are all encodings. National standard encodings developed after corporate-developed encodings but before international standard encodings. UTF-8 and UTF-16 are designed to support larger collections of characters than individual national standards supported, and they are intended to support all characters that were included in national standards plus some additions besides. But of course the Unicode codepoints (numeric values of the encodings) for most characters have to differ from the codepoints that were used in national standards and codepages and most existing databases.
Localization includes compatibility with existing codepages (and other existing encodings), and translations of words and sentences, and the ordering of fields in dates (year at the beginning or at the end) and time fields (am/pm indicator at the beginning or the end), rules for sorting lists, and anything else that's needed for communication with humans in various cultures.
Now I have to say that the definition of localization he gave does not sit well with me.
Traditionally, most of what is listed there is known as internationalization, not localization. And this is not just my opinion, it is one shared by myself and many of my colleagues -- all of whom consider internationalization to be one of the largest aspects of our actual jobs.
The same opinion is shared by many of our colleagues who work in localization, too!
Now, I do not want to minimize the important kernel of truth in Norman's words here. If you do a poor job of providing internationalization support for a language, than your efforts to localize into it will pretty much be wasted. One could say that good localization presumes good internationalization.
But that does not mean they are the same thing -- after all, in order for the localization to be of good quaality you need good font support. But no one considers typography to be localization. For that matter, the localization efforts are wasted if your laptop does not have the appropriate AC adapter for your market, yet no one considers power management to be a part of localization, either.
The simple fact is that a lot of what happens in computer software involves dependencies -- and a quality localization presupposes good internationalization and typographical support upon which to build. There is simply no other way to be sure that you have a product that can be accepted by a customer.
And of course they won't even get there if the power management folks don't have their act together yet....
This post brought to you by "A" (U+0041, a.k.a. LATIN CAPITAL LETTER A)
As I mentioned earlier when I posed the question Where's Michael, there were some people who were a little worried about what happened to me when I appeared to be gone for a few days.
I was at my cousin Julie's, because her two oldest kids (Sarah and Brian) had their Bat Mitzvah and Bar Mitzvah, respectively.
It was a very interesting visit, and not just because she is one of my favorite cousins, and also not just because Sarah and Brian did such great jobs! :-)
I had the chance to see a few of the more recent innovations in the Shabbat services of some congregations:
At first I kind of fought these changes, mentally. I mean, I am used to different tunes for songs and I have never heard the exact same service in two different synagogues (heck, I usually would not even hear the exact same service in one synagogue on separate weeks!).
But then I thought about it -- there is honestly no harm in the insertion of the foremothers. (though perhaps it would be more meaningful to me if it were a bit more targeted -- the actual additions were done across the board with the wife being named severy time the husband was, rather than being done based on the actual meanings of the words in the various prayers -- which to me indicates that it has more to do with equal rights between the sexes than a real attempt to extend the prayers).
Now to be snarky for just one more moment, I don't think everyone who was there necessarily knows Hebrew fluently enough to judge the literary aspects. I really do not see a problem in the particular innovation, and it does not destroy any of the meaning of the prayers.
And when I think about the additional reading, my father's point at the time seemed quite relevant -- that a lot of the congregation would not end up going to both Friday night and Saturday morning services, so that the Friday night reading meant that more people would be able to hear it. And that is not a bad thing at all.
There are several readings during the week that are done, originally on days that people would traditionally be going to the market so they were in town. Clearly that 'innovation' has a pretty earthbound, secular basis, and I am sure there were people who felt at the time that it was somehow a betrayal of the higher purpose of reading from the Torah to do the readings just because people were there. But everyone really got over that, to the point where today it is the most devout, traditional Jews who will be there on Mondays and Thursdays for Torah readings. So clearly the additional reading has a somewhat holy precedent, even if it does indeed have a somewhat secular basis.
So when I think of my initial reaction, I realize that it is not that different than the more traditional folks I talked about when I talked about how Good things can happen when religious authorities work with science and technology. I was not really judging anyone in that post, but I had mentally placed myself in that "flexible" category of people who are willing to rethink traditions when there is a reasonable cause to do so. And even then, at my first chance to perhaps prove that I have that kind of flexibility, I prove that my instinct is to fight against it!
Ah well, there would be nothing wrong with me if I had chosen to not be so accepting, anyway. These two "shocks" made it easier to handle when they sang Adon Olam to the tune of Deep in the Heart of Texas. :-)
This post brought to you by "ץ" (U+05e5, a.k.a. HEBREW LETTER FINAL TSADI, a.k.a. TSADI SOFIT)
And then the other day Chuckles asked:
I have the Hebrew keyboard installed but need to type vowels as well. Any suggestions?
Now it can be a little harder to find them because on the Hebrew keyboard on Windows the SGCAPS feature of keyboards is used, and the vowels are in the SGCAPS+Shift state. Which means that they may not be so easy to find!
But you can use the OSK (On Screen Keyboard), or you can look at the Hebrew keyboard layout on the Windows Keyboard Layouts site.
Personally, I like to use MSKLC to look at keyboard layouts, where it is easy to load any layout on the machine and see what is there.
There is also a great online resource that Jony pointed out entitled Typing Hebrew Points (Niqud) in Windows and Word.
If you want all of the Hebrew points as well as the vowels, you may have to use something like MSKLC to add them yourself; they are really not used in modern Hebrew as they are used as the trope that signify the way one would chant sections of the Torah or Haftorah. But it is easy enough to add them to a keyboard if that is something you need....
This post brought to you by "ְ" (U+05b0, a.k.a. HEBREW POINT SHEVA)
Yes, I'm still around. :-)
I have a huge backlog of topics to post about and I will see about taking care of that tonight or tomorrow....
Serge Wautier asks in the Suggestion Box:
When I switch my locale from French (Belgium) to Thai, the date switches correctly to sometime in year 2548. However, if I open the calendar applet in Control Panel, the calendar displays Thai months but year is displayed as 2005 ! Why so ? Is it because there's no ReverseGetDateFormat() that the applet could use ? Just curious... TIA, Serge. PS: Don't worry, it's not because you were so kind as to reply my previous question immediately that I intend to flood your suggestion box ;-)
When I switch my locale from French (Belgium) to Thai, the date switches correctly to sometime in year 2548. However, if I open the calendar applet in Control Panel, the calendar displays Thai months but year is displayed as 2005 !
Why so ? Is it because there's no ReverseGetDateFormat() that the applet could use ?
Just curious...
TIA, Serge.
PS: Don't worry, it's not because you were so kind as to reply my previous question immediately that I intend to flood your suggestion box ;-)
I was not worried -- I still choose when/if to post an answer, after all....
Anyway, the Thai Buddhist calendar will indeed put us somewhere in the 26th century, thus making us around 60 years post Buck Rogers with the single change of the user locale in Regional Options? Pretty impressive, and we even got to skip the 'being frozen in space due to a freak mishap' and all that! :-)
But the calendar applet is not really making full use of the NLS API that it could be, so it is unable to help here in providing an intuitive calendar that has all the right info (it makes the same sort of mistakes in the Hijri and really any calendar that changes the year will hit the problem.
FWIW, how annoying this problem is, is something more than one person who has the desire to see it fixed, or the ability, or both, has noticed. So there may be some movement here one day....
This post brought to you by "؍" (U+060d, ARABIC DATE SEPARATOR)