Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
THE WINDOWS 7 INUKTITUT LANGUAGE INTERFACE PACK IS LIVE!
Please note that we are releasing a 32- and 64-bit version of the Inuktitut Windows 7 LIP. This LIP can be installed only on a system that runs an English client version of Windows 7.
You can download it from right here.
A LITTLE BACKGROUND ON INUKTITUT
NUMBER OF SPEAKERS:
NAME IN THE LANGUAGE ITSELF:
Inuktitut is, along with English and French, is the official language of Nunavut, the largest of the territories of Canada which was created in 1999. Nunavut is spoken by about 80% of the population there as well as in all other areas in Canada north of the tree line, like the Northwest Territories where it is official language as well. In Nunavik, a semi-autonomous portion of Quebec, it has legal recognition and enjoys official support. While for a long time Inuktitut shared the fate of most indigenous languages in the Americas, namely getting closer and closer to extinction, the last census data indicate that the number of speakers have stopped declining in Canada and might even be increasing in Nunavut.
Because of the huge area in which Inuktitut is spoken, it has a big dialectal diversity. Some scholars even count Greenlandic as a variant, though it is more commonly considered a language of its own. Inuktitut is an agglutinative language in which a succession of different morphemes are added to root words to express for what other languages need several words or sentences.
Inuktitut belongs to the eastern group of Inuit, one of the two branches of the Inuit-Aleut (Eskimo-Aleut) language family.
Inuktitut is written either in the Latin alphabet (which was introduced to the region by Moravian missionaries) or the Inuktitut syllabary, which is based on the Cree syllabary created by the missionary James Evans. This syllabary got its present form in the 1970s when it was adopted by the Inuit Cultural Institute in Canada.
Yep, that's the deal -- Unicode 6.0 has been released!
You can check it out right here....
Now of course the bug questions that come up next are when do we pick it up, when does Microsoft support it. For that, see the following blogs:
And of course keep in mind that we just shipped Windows 7 and Office 2010. BEFORE Unicode 6.0 came out. The comments in that last blog on the list about the non-synchronization between Unicode and Microsoft has just been given an epic example!
From the Unicode Announcement, titled: Unicode Version 6.0: Support for Popular Symbols in Asia:
The newly finalized Unicode Version 6.0 adds 2,088 characters, with over 1,000 new symbols.A long-awaited feature of Unicode 6.0 is the encoding of hundreds of symbols for mobile phones. These emoji characters are in widespread use, especially in Japan, and have become an essential part of text messages there and elsewhere. Unicode 6.0 now provides for data interchange between different mobile vendors and across the internet. The symbols include symbols for many domains: maps and transport, phases of the moon, UI symbols (such as fast-forward) and many others.A late-breaking addition is the newly created official symbol for the Indian rupee. With the help of the Indian government and our colleagues in ISO, the consortium was able to accelerate the encoding process. Once computers and mobile phones update to the new version of Unicode, people will be able to use the rupee sign like they use $ or € now.This October 2010 release includes the Unicode Character Database (UCD), Unicode Standard Annexes (UAXes), and code charts. With the release of these components, implementers are able update their software to Unicode 6.0 without delay. The final text of the core specification will be available in early 2011.* To access Unicode 6.0, see http://www.unicode.org/versions/Unicode6.0.0.* For more information on emoji, see http://unicode.org/faq/emoji_dingbats.html* For a formatted version of this message with images, see http://unicode.org/press/pr-6.0.html.
Have at it.... :-)
But note the information I started this blog with: for new characters and scripts, there is no collation or font or property support yet. We don't have the ability to predict the future!
Over in the Suggestion Box, saqqara bundled up several questions and asked:
I've counted 12 Unicode scripts missing from the Opentype tag list (15 if we count Unicode 6.0) at www.microsoft.com/.../scripttags.htm. The list has not been updated for almost 3 years. See my blog jtotobsc.blogspot.com/.../isounicode-scripts-missing-in-opentype.html for a little more on the topic.
It would be interesting to learn why it is that Microsoft has got into the habit of arriving late to the party in matters Unicode.
Google search supported Unicode 5.2 rapidly, twelve months later still no sign in Bing.
Internet Explorer 9 Beta is lagging Firefox.
Even more interesting would be any news. Is the renewed focus on standards in IE9 having a positive effect on these matters?
Ordinarily when there are many questions bundled together like this covering so many areas, I would not cover them all in one blog.
However, since pretty much all of them are outside my core areas of expertise I figure it is better not to drag it out more. I'll cover it all now. :-)
I'll start from the bottom of the list.
Once a long time ago I had a manager who was not entirely pleased when I suggested that Microsoft really resembled several different companies rubber-banded together more than one big company.
That image is somewhat appealing as a shortcut to describe the various business units, but probably it would be more accurate if less viscerally desirable to think of it as a bunch of metal sticks rubber-banded together, with a seemingly random set of random points where they are soldered together. Because even with entirely different business units that honestly don't care about each other some of the time, there are clearly other times that they have deliverables and dependencies that connect them.
The last question about the connection on the IE9 focus on standards could be kind of thought of in that context, I suppose. But this would not be one of those connected points. I severely doubt that one point had anything whatsoever to do with the other, with one major exception that I'll talk about in a bit. Mostly, the IE9 push for standards (e.g. HTML5) have excellent reasons that have essentially nothing whatsoever to do with a push related to OpenType (which is also a standard).
I won't comment on the IE9/Firefox issue since it's really out of my area. I do like IE9 on the machines I am running it on, though (FWIW).
In regard to Google's adoption of Unicode 5.2 versus Bing, no idea (I hardly know anyone on the Bing team and the few I do know aren't so far as I know working in this area), but if I had to guess I'd imagine that having the president of the Unicode Consortium as your "Senior International Software Architect" like Google does has to have some impact here. As far as I know, no one from the Bing team has ever even attended a Unicode UTC meeting (let alone been an officer in Unicode). The Bing team and others have to base their priorities on those old fashioned ideas (like customer scenarios and explicit customer requests) in this case, which can obviously have an impact (by contrast, the Google before Mark Davis and others moved there could not be called Unicodally Retarded, but on the spectrum from Unicode Junior Community College to Unicode Stanford/Unicode Harvard, I'd claim it was closer to the former -- probably one of the big reasons they were hired!).
And of course one cannot discount the minimal importance and meaning of support of Unicode in search compared to applications -- in that area (Google's apps) the company is not nearly as far along, even for its own applications. Mark et. al. have a big territory to cover there, it is going to take some time. I suspect they will catch up there soon enough anyway.
Anyway, none of this supposition assumes strategic thought, it is just common sense. There is a reason that Apple has supported Deseret and it's the same reaon Microsoft does so much for FarsiPersian, and it is not based merely on objective decisions of individuals with no connection to the script/language. Proximity and interest can really drive people.
"Microsoft late to the party in Unicode" is an interesting idea, but I don't see it that way; it is at best an accident of largely incompatible release schedules or at worst a lack of ability to to drive synchronization between Unicode releases and Microsoft's major product releases whwn it would make sense.
Regarding the latter idea, I'll never forget conversations that folks from Unicode had with us about how excited they were when a version of Windows was going to be shipping with Unicode 5.0 support (if only it weren't Vista, as they found out later; I imagine this revelation dampened the enthusiasm!). The fact that Windows is so common means that a new version having a new Unicode version is a tremendous opportunity to show the impact of the latest version of Unicode, and the fact that neither Microsoft norUnicode ever tries to do better at synchronization (in a legal forward-looking marketing timing kind of way, not an illegal conspiracy-in-restraint-of-trade kind of way) is actually kind of a detriment in my opinion to both Unicode and Microsoft, actually.
As for the OpenType script tags, one could actually almost think of this as a time that standards work against speedy updates, since OpenType updates mean updating an ISO standard. With the group that owns that work being a part of Windows, the simple fact that no one outside of MS was explicitly asking the standard for new script tags and the release date of Unicode 5.2 that made Windows 7 support of it impossible/impractical just makes it a lot harder to push an update to the standard when no product was going to be released in the near term.
Beyond that, the big push in the font world both inside and outside of Microsoft is not so much on adding a few script tags to OpenType; it is the ISO Open Font work in general, which touches on a lot more stuff of much wider interest. This is why I suggest how in this instance putting OpenType in a standard makes it more work to do updates, since it isn't just the time to add a section to a spec and update a web site. This is not me bad-mouthing standards, its just me pointing out that standards take time! :-)
More generally, looking around at the wider world:
Now these last three bullet points are not shining beacons for Microsoft either:
I guess I'm just saying that for all the complaints one could (and some do) make in general about Microsoft, in the specific case of Unicode support I think we're shooting par. :-)
The qiestion I received via the Contact link was an old one -- one that goes back centuries, really. It was a request to explain the issues about the English spelling of 한국 (or 조선 in the North), and the English spelling of it....
Now we're not talking Chaudhuri vs. Chaudhary? here.
And it isn't Bangalore vs. Bengaluru, either.
Though bearing some surface resemblance to Farsi vs. Persian, it ain't that, either.
Perhaps if you think about Macao vs. Macau or Uighur vs. Uyghur, you'd be a little closer.
But I'll tell you, the situation of
Korea vs. Corea
has them all beat when it comes to rumors and stories and conspiracy theories. Hands down.
The conspiracy theory generally tenfds to suggests that the move to use Korea rather than Corea was orchestrated by the (at the time) occupying Japanese, in order to make sure that Korea did not come before Japan in alphabetical order.
The wikipedia article on Names of Korea lays out the broad strokes:
English usageBoth South and North Korea use the name "Korea" when referring to their countries in English.
As with other European languages, English historically had a variety of names for Korea derived from Marco Polo's rendering of Goryeo, "Cauli" (see Revival of the names above). These included Caule, Core, Cory, Caoli, and Corai as well as two spellings that survived into the 19th century, Corea and Korea. (The modern spelling, "Korea", first appeared in late 17th century in the travel writings of the Dutch East India Company's Hendrick Hamel.)
Despite the coexistence of the spellings "Corea" and "Korea" in 19th-century English publications, some Koreans believe that Japan, around the time of the Japanese occupation, intentionally standardised the spelling on "Korea", so that "Japan" would appear first alphabetically. Both major English-speaking governments of the time (ie the United States and the United Kingdom and its Empire) used both "Korea" and "Corea" until the early part of the Japanese occupation. English-language publications in 19th century generally used the spelling Corea, which was also used at the founding of the British embassy in Seoul in 1890. However, US minister and consul general to Korea, Horace Newton Allen, used "Korea" in his works published on the country. At the official Korean exhibit at the World's Columbian Exhibition in Chicago in 1893 a sign was posted by the Korean Commissioner saying of his country's name that "'Korea' and 'Corea' are both correct, but the former is preferred." This may have had something to do with Allen's influence, as he was heavily involved in the planning and participation of the Korean exhibit at Chicago.
A shift can also be seen in Korea itself, where postage stamps issued in 1884 used the name "Corean Post" in English, but those from 1885 and thereafter used "Korea" or "Korean Post".
[See Wikipedia article for the stamp pictures]
By the first two decades of the 20th century, "Korea" began to be seen more frequently than "Corea" - a change that coincided with Japan's consolidation of its grip over the peninsula. Most evidence of a deliberate name change orchestrated by Japanese authorities is circumstantial, including a 1912 memoir by a Japanese colonial official that complained of the Koreans' tendency "to maintain they are an independent country by insisting on using a C to write their country's name." However, the spelling "Corea" was occasionally used even under full Japanese colonial rule and both it and "Korea" were largely eschewed in favour of the Japanese-derived "Chosen".
This makes for a nice summary but just as the history of the battle is usually written by the winning side, the description does tend to favorite the Korea camp....
As a general principle, Germanic languages are more likely to go with a "K" while romance languages are more likely to go with a "C". English is of cvourse somewhere in the middle there, which is why both spellings "make sense" in a way -- until the eigenstate is resolved and one is chosen, at least.
Articles like this one are really able to give a better flavor of the "Corea camp", though the many comments from people who disagree tends to shift the overall feel of the page in the other direction.
There are several other such articles on the Internet, of course....
I won't say that the article swayed me exactly, but it made a lot more sense as a theory (even if I ultimately decided my opinions went the other way) when I gave people the chance to present their case.
I didn't find the notion that a country or a language could be petty to be unlikely in general, it is just this case in particular that seems like a stretch.
But by reading these two articles (before looking at many others), I felt like I now had looked at some balance in the situation - that I had looked a little into both sides.
I suppose some might say this puts the whole "Wikipedia:Neutral point of view" thing in doubt a little, though I don't know if I would go quite that far for sure. But I do know that there are times I feel like I leave Wikipedia with a much more fair, balanced, and accurate view of a historical dispute that to this day remains.
But not this time....
For comparison purposes, look at the Greek vs.Ethnic Macedonian positions on the Macedonia Naming Dispute in Wikipedia. This is in my humble opinion much more in line with the stated intent of Wikipedia's neutral point of view.
Perhaps my presentation here on the issue also shows bias since I ultimately chose a side and do not pretend to be fair and balanced, but I have no "neutral POV" philosophy: my blog is the world from my point of view.
Plus regular readers know that if I am truly neutral on an issue then I'm unlikely to write about it....
But that is just me. :-)
As a random side note, Japan itself had names for itself in English like Wa and Wagoku and Nippon and Nihon, the bulk of which come alphabetically in English after Korean, anyway. This does make the "alphabetical conspiracy" argument feel a little less likely.
Note: if you found this blog to be interesting, then the new blog written on this topic a week later (The ordinary nature of being exciting) should be right up your alley...
So, way back in 1999, Microsoft eliminated the MVP program that they had created and nurtured.
It was done quickly and entirely due to an AOL lawsuit where its newsgroup leaders sued for benefits as "de facto" employees.
Then over a weekend, so much mail was received by senior execs about how awful that decision was that they reinstated the MVPs days later.
You can see it described here. My only contribution is that I can confirm that it was the AOL lawsuit that inspired it (the Wikipedia article is less definitive on the matter) and that I had a minor impact on the list of exec email addresses (I recommended certain email addresses be removed as some of the people on the list were not the right people -- I think my advice was mostly ignored, at the time.)
You may wonder why this is relevant.
Well, because Microsoft, multi-billion dollar company that it may be, has no spine whatsoever to speak of.
Even though the cases were miles apart, even though there was no threat from any MVPs suing, they wigged out at the chance it could happen and over-reacted in a way that was not in the best interest of customers. Including some of their best customers, the ones who did so much to support their other customers.
Fast forward a few years.
And the fact that Microsoft has great health coverage.
Given my pre-existing condition coverage was a huge issue about the idea of working full time for Microsoft, as I discussed in My thoughts on the health care thing (given my life, my multiple sclerosis, and my iBot). As that blog discusses, they cleared this hurdle with ease.
Multiple sclerosis costs money -- for symptoms, for tests, for treatment.
But Microsoft has always had some of the best health care there is, something they are in fact renowned for.
Calling it universal health care is an understatement -- it is practically pan-dimensional it is so huge.
A few years later, we got a mail about changes in the health plan:
Shocking! Microsoft was taking two steps to do the thing that every other company in the country had been doing all along.
We were still astoundingly well-covered.
Now recently the new health care bill passed.
Foremost in Microsoft's cowardly mind and spineless guts was the Cadillac Insurance Plan notion, which as Wikipedia explains is "an informal term for any unusually expensive health insurance plan, usually arising in discussions of medical-cost control measures in the United States".
As it further goes on to explain:
Although neither bill uses the "Cadillac" term, the 2010 Patient Protection and Affordable Care Act, as amended by the Health Care and Education Reconciliation Act of 2010, imposes an excise tax on plans with an annual cost exceeding $10,200 for individuals or $27,500 for a family (not including vision and dental benefits).
Now me saying this scared Microsoft is not me spinning yarns. They directly said so in a communication just after it all passed. Sure it would be in eight years, but they stood out like sore thumb from everyone else and made it clearly they were actively worried about this tax.
Then yesterday they dropped the bomb on us, just eight years in advance: the insurance that used to never have a co-pay would now always have one. Up to an annual maximum not yet determined, we the employees would now be contributing to our health care costs. As Ina Fried reported:
"We can confirm that Microsoft has begun to evolve its employee health care benefit," Microsoft spokesman Lou Gellos said in a statement. "There will be no changes for the next two years, but in 2013, employees will contribute to their health care. A guiding principle in this evolution is that Microsoft will continue to offer market-leading health and wellness benefits that rank among the best in the country."
The company didn't go into details on what exactly employees will have to pay, but said there will be a maximum amount that workers will have to pay both individually and as a family. Microsoft currently pays 100 percent of the cost of coverage for workers as well as for a spouse or domestic partner and children.
This is like the same language they use about when the poll shows many employees feel their pay if less than the industry and we are assured how competitive they are. Of course they point out the benefits as a part of that, so that other language will now be taking a hit too as they decide that the best way to proceed is to act like everyone else does.
Because maybe if they fool the government into thinking we're a Pontiac Insurance Plan they won't call us a Cadillac Insurance Plan.
UPDATE 10:59am: I have had several people, at multiple levels of authority and influence, point out that in the announcement the larger issue of health care costs skyrocketing was also emphasized. So I will point that out now since it was not my intent to mislead, though the issue of Cadillac plans was raised as an important issue. My larger points still stand, I believe.
We don't have numbers yet but for now I can pretty much assume that I am likely going to hit that maximum any year I am working, which means I can probably look at that amount as an annual salary decrease.
And Microsoft will just get some extra money from me, so that they can try to make sure they won't have to pay more to the government in eight years.
My initial thought is cheap ass, cowardly, and greedy bastards and I'll admit further reflection does not water that sentiment down much, as people making millions and more decide that people making way less can foot the bill to protect them from the company having to pay out more.
After spending the night drinking too much and reflecting further, I have a different thought.
They no longer look as bad as they did, and I'm not angry any more.
But Microsoft now looks ordinary to me.
I never consciously thought of the health care plan at Microsoft as handcuffs keeping me here until they announced how they planned to remove those shackles.
And now I sit here and when think about Windows 7, or the next version, or Windows Phone 7, or really anything planned, and thing meh. The idea of doing my job when I am not loving it is something that I never have found appealing. But now I have a type of "short timer" syndrome, because in a couple of years working for Microsoft would be just like working for anybody, and perhaps I should not have been so hasty in turning down prior job offers on the basis of how well Microsoft treats its employees.
I'm grateful about that bonus I got a few years ago when they paid for the iBot entirely. Because in the new world, it would cost me since they've decided intrinsicly that we aren't worth as much.
They used to care about their MVPs until they proved they'd drop them in an instant due to even the most irrational fear of lawsuits.
They used to care about their employees too much to get rid of them in large numbers, and then they laid off over 5000.
And they used to care about their employees too much to give us anything other than benefits to rival and top the industry.
Now they've decided money is better then loyalty.
Microsoft did want to be like everyone else, and not be special.
Congratulations Microsoft. Now you are.
It's a funny thing about off-by-one errors, you know.
Looking at the subject in Wikipedia:
An off-by-one error (OBOE) is a logical error involving the discrete equivalent of a boundary condition. It often occurs in computer programming when an iterative loop iterates one time too many or too few. Usually this problem arises when a programmer fails to take into account that a sequence starts at zero rather than one (as with array indices in many languages), or makes mistakes such as using "is less than or equal to" where "is less than" should have been used in a comparison.
I found another case yesterday.
By which I mean today, since it is today that I am writing it, but by the time you read it then either you hacked my account on the blog server, or it is tomorrow. Or later still.
Anyway, the error.
So if you look at keyboard hardware, each key has a scan code that it sends to the computer when you hit it.
Here is a typical layout, shown via MSKLC:
Now the key with the tilde on it has a scan code of 29.
And then the numbers 1234567890 have the scan codes of 02 to 0b.
You may know where I am going with this one....
If you look at the keyboards of India that ship on Windows, most of them have one thing in common: they put U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER) in the CRTL+SHIFT shft state.
This was very important at the time when Windows 2000 was being developed, back when Unicode had the grand plan for how to use these two characters I showed in Why don't all the half forms sort right? -- before fonts started widely doing something different as I discussed in Which form to use if the form keeps changing?.
So because of this fact, they were put on most of the keyboards.
Here we come to the problem, though.
You see, while all the keyboard layouts we ship (e.g. the Hindi Traditional keyboard) put these two control characters on the CTRL+SHIFT+1 and CTRL+SHIFT+2 keys:
but a colleague of mine who had created some updated versions of some of the layouts had put them instead on the CTRL+SHIFT+2 and CTRL+SHIFT+3 keys.
I pointed this out to her, and she admitted she did the work in the .KLC file rather that MSKLC itself. And when looking at the following rows in the file:
02 1 0 09e7 0021 -1 -103 2 0 09e8 0040 -1 200d04 3 0 09e9 %% -1 200c
The confusion of the scan codes (02, 03, 04) right next to the VK_* values (1, 2, 3) and her knowledge that the scan codes were one off from the numbers caused her to put them in the wrong spot.
And thus the decision long ago that caused the scan codes to not line up for these digits when they could have, combined with an incorrect compensation for that of-by-one assignment by intentionally being one off from the scan codes, led the characters to be in the wrong slot.
Clearly this was not due to the traditional "off-by-one" error due to 0-based vs. 1-based counting that the Wikipedia article was referring to.
But the fact that the scan codes were off-by-one from the numbers atop them due to the way they were assigned and the potential confusion thereof made it easy for her to introduce an off-by-one error of her own!
The story has a happy ending, though. Her keyboards will now be correct when she uses them, and she was only a little embarrassed by the fact that I told her I'd be writing this blog you are reading (though she admitted that it wouldn't make sense to not do!). She even bought me a beer for looking at her layouts. And thus everybody wins....
I have been writing about internationalization and the console off and on for over five years now, and this fact highlights two problems with blogs:
Because of this, I thought I would take a little time to really summarize the current state of affairs, with examples in-line for shorter stuff and with links for the longer stuff. So you can just look here to get the full story in one place.
Random people trying to improve their console application story can use this blog to find out about everything they need to know. They are perhaps not in the ideal order here either, but at least they are all here. In a "myth-busting" format.
So, without further introduction, let me proceed....
Myth #1: You can not detect within a console application whether a console handle has been redirected to a file.
Given some of the central differences between applications that are principally used within the console and ones that expect to spend the bulk of their time redirected, this is a pretty important issue, especially given other myths related to the "least common denominator" of each when it comes to Unicode support.
This myth, however, is demonstratably false. You can (for example) check out the IsConsoleRedirected function from this blog to see how easy this is to do. There are many other examples but this seemed like the most contained.
Myth #2: You cannot detect whether a console handle that has been redirected to a file is appending to an existing file or is creating a new file of its own.
Another myth to bust -- this one is trivial to detect, as I point out in the section entitled "First if all" in this blog:
A simple call to GetFileSizeEx will tell you that, immediately! Just pass in that stdout pointer that you have already determined is a redirected file, and then you will know by the size of the file if they redirected with a > or a >>.
Sometimes the simplest answers are the hardest ones to figure out!
Myth #3: You should never add a BOM (byte order mark) to console application output.
Given the truths behind Myth #1 and Myth #2, this one obviously is weird. If you
then you can write U+FEFF, the Unicode Byte Order Mark. Let the underlying encoding do its thing with the BOM and you don't need to worry about anything.
Myth #4: When appending console output to an existing file you don't/can't know what the encoding is.
Now this myth causes all kinds of problems because one can easily talk oneself into being unable to safely support anything beyond ASCII -- and I have seen people do this very thing.
But untrue is untrue, and in Orwellian terms this one is perhaps even doubleplusuntrue.
As this blog points out, in its section entitled "Second of all":
A quick call to GetFinalPathNameByHandle to get the path and you can look at the contents and see what is in there and make the appropriate decision (if you need it to be pre-Vista you can use code like this).
Remember that the file is guaranteed to be opened to you or else all of your write operations would fail. So you are the one person with access.
By combining the knowledge of Myth #3 and Myth #4 with the general truth that people who "redirect append" do so to files created by applications that use similar techniques or by the very same application you can look at the first few bytes in most cases to do rather precise detection verifying whether it is Unicode.
Myth #5: The console does not support Unicode.
This is a pretty weird one, and between all of mystery surrounding the default use of CP_OEMCP in the console and the longstanding poor documentation and bugs surrounding the console in general and the CRT in particular it has been one of the most enduring myths of all time.
But when you use the knowledge behind Myth #1's debunking to determine you are in the console, the WriteConsoleW and ReadConsoleW Win32 API functions have supported Unicode for not quite as long as CMD.EXE has existed, but certainly on any version of Windows you are likely to see.
Myth #6: The console is able to support Unicode, but sometimes it doesn't work and for some characters it doesn't work -- and you can't ever tell what's what.
Thank goodness the myth was worded this way, so I can once again say it is wrong. Wrong, wrong, WRONG.
By using the knowledge of this blog, one can detect both
By the use of those two functions, you can know exactly what is supportable/supported. From there you can choose (if you so wish) to make intelligent decisions on how to proceed, even perhaps going so far as to warn people what they ought to be doing instead if they are likely to be unable to support the text your console application might want to output.
Myth #7: The Microsoft Visual C Run-time library (mvcrt*) doesn't support Unicode output to a Unicode console.
This myth was proven to be wrong back in this blog.
_setmode(_fileno(stdout), _O_U16TEXT); _setmode(_fileno(stderr), _O_U16TEXT);
_setmode(_fileno(stdout), _O_U16TEXT); _setmode(_fileno(stderr), _O_U16TEXT);
before calling Unicode console functions like wprintf and getting back error text is all you need to have the CRT do all the work behind Myth #1 and properly handle console output. This has been true since Visual Studio 2005 (VC++ 8.0).
There are several such modes:
that will do much wonderfuler things for Unicode text in the console using the CRT.
Myth #8: The Microsoft Visual C Run-time library (mvcrt*) completely supports Unicode text processing in a Unicode console.
Ah, I must have lulled you into complacency after Myth #7 was proven wrong, and you took it too far.
Because the truth is that there is a bug that has existed in one form or another in every version of the CRT since 2005 that makes it so that even though stdout and stderr can handle Unicode, stdin cannot.
I discuss the issue in this blog, if you are interested in details. It is one of my most fervent hopes that this bug is fixed at the next available opportunity in every place it can be fixed.
And since one cannot count on either hopes or prayers to make such things happen, I am trying to make sure that the fix is made by more conventional means within my job....
Myth #9: You should really use resource fallback to handle the scenario of a console that cannot fully support the text.
Once again, the myth is wrong.
As this blog discusses at length, the claim that blogs like this one make about the need to force the resource loader to do fallback in order to avoid problems in the console has many flaws in it.
There are scenario based flaws (e.g. that the world is teeming with people writing console applications that output Arabic text), implementation flaws (e.g. the ridiculous fallback to en-us for most complex script locales guarantees that a lot of text such as that used by many European languages that might have succeeded will fail due to the en-us use of a 437 CP_OEMCP), conceptual flaws (e.g. fixing teh UI language does nothing for text created by the user locale like date formats which will have the same problem and no ready sensible solution).
As I exaplained in the debunking of Myth #6, the answer if you detect a case where the application may not fully support the text is at best to detect and warn the person, at worst output the junk but make sure that the right documentation or KB articles are available to tell people what is going on, so they can address the problem.
your console application does itself and its users a tremendous disservice by unconditionally falling back.
Now the world of MUI_CONSOLE_FILTER isn't the worst thing you can do to users, since it is not legally assault by a software developer. But it is still pretty bad....
Myth #10: Okay, I am convinced. You can support Unicode in pretty much all of the console.
Wow, that is incorrect. I'm sorry I lulled you into thinking everything works. Across all of the built in commands in CMD.EXE itself and all of the common binaries that extend the console, many support Unicode but not all of them do.
No list of what commands and executables fall in each category exists, and the rules about each one (for example find.exe never supports Unicode, while as I discussed in this blog the type intrinsic completely supports Unicode but the file must have a BOM in front of it). Such a list would be really cool, but it does not currently exist and no one seems to want to take the time to create it....
Plus occasionally other long-standing issues can exist like the one I talked about in this blog, a bug that none other than Mark Zbikowski ended up fixing for me in the Vista cmd.exe. The effort was truly appreciated.
Myth #11: All of the defaults in CMD.EXE like whether you are using a TrueType font are out of your control.
This particular myth is al so wrong. Surprised?
You can change this setting in any CMD.EXE shortcut -- something that every single VS, SDK, and WDK console shortcut should be doing!!!
I am totally serious here. I can understand why people would be afraid to change the defaults in CMD.EXE itself for back-compat reasons of changing behavior in legacy console apps, but the shortcuts? Seriously, these should get updated.
Myth #12: You can't change the setting of whether a console window is using a TrueType font.
This myth too is quite untrue.
I have used a few console API functions and that IsConsoleFontTrueType function from this blog to change the font within a console window to a TrueType font, from code running in the console window.
This is something I would never recommend in production code, mind you; I only did it because someone told me it wasn't possible and I was sure she was mistaken.
The impact of the accomplishment was interesting, mind you; she and I dated for about a month after that. ;-)
And the story there may be worthy of its own dedicated myth-busting blog, along with the code itself. If people are interested, I mean. Let me know....
Myth #13: There are thirteen incorrect myths about support of Unicode text within the console that this blog will talk about.
Crap. You foxed me.
If I say you are wrog and that there are only 12 myths, then this is s genuine myth. Which means there are 13 myths.
But if there are 13 myths then you are right and this isn't a myth, it's just a fact. In which case there are just 12.
Maybe you should go shave a Spanish barber or something! :-)
So a while back, in a comment to The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer, user Abeywickrama commented:
Michael, I was in Turkey and could not log in to the Google account from Hotel PC. Latter I found the reason was that my username had letter 'i' in it. In Turkish keyboard the letter i [which is generally between u & o] represent a different 'i' than what we use. But I found letter i in a different place.
Now the comment was from a few years ago (kind of like the blog itself was), but I thought I'd follow up on it a little....
I'll start by giving the keyboard layouts.
First the Turkish F layout in BASE and BASE+SHIFT states:
And then the Turkish Q layout in BASE, BASE_SHIFT, ALTGR, and ALTGR+SHIFT states:
Now the Turkish F keyboard puts the dotted I on VK_OEM_4 (just to the left of the "u" character in the third row), and the dot-less I on VK_I between the ğ and the O in the second row (uppercase in the shift state).
(note that on a US keyboard, this would be on the "S" key and the "R" key, respectively)
The Turkish Q keyboard keeps the dot-less I down in the base state while putting the dotted I in the ALTGR state. This can make typing the dotted I much harder. Though it puts all for letters on the VK_I -- and the "I" key on the US keyboard, which taken together maybe makes it all a little easier....
When typing on a keyboard one is not used to, the problem Abeywickrama notes is probably a lot more common to run into, and which points to one of the very real problems when using unfamiliar computers in general and the keyboards attached to them in particular....
Potential workarounds include soft keyboards (on computers, in browsrs, in toolbars, and in particular web pages), though if the web page has its own special handling that can complicate things further.
I was told Google was sometimes doing extra work here but that they aren't doing it any longer, though I am unable to verify the veracity of either claim....
So I read the English becomes Europe's second language story from the Telegraph.
Of course the title was a bit sensationalistic, since that isn't really what happened....
The subtitle explains what the story was really going to say.
English becomes Europe's second language
The subtitle said:
English has become Europe's second language of choice with two thirds of people in the continent able to speak it, according to a survey.
Now that is a fairly large difference!
The article goes on a bit further to stir the pot with some more kind of incendiary text:
The study found that English is the first foreign language studied in secondary schools in every country outside Britain and Ireland.
The results of the survey are a particular blow to the French, who recently launched a failed bid for their language to be made the sole official language of the EU headquarters in Brussels, claiming their mother tongue was "more precise".
Now I suppose I could go on, but you should probably read it yourself if you want to see some good old fashioned muckraking.
No, I'll do a little more:
It has also prompted calls for the EU to cut back on the £1 billion it spends every year translating official documents into the organisation's 23 recognised languages.
A spokesman for the Foreign Office said: "It's only right that the EU institutions think carefully about every penny they spend to ensure that they're getting the most from their money.
Though there is some truth in the article, I guess I'd be more impressed if a non-English speaking country were having its foreign office making the statements. And if so many different issues were given the clear timeline of a full story instead of a bit of sensationalistic fluff.
On the other hand, this is a sign of the times... or rather, The Telegraph.... :-)
Whether one agrees with the Wiktionary definition or not, if you work in software you know what a breaking change is.
Let's look at that definition for a second:
breaking change (plural breaking changes)
1.(computing) A change in one part of a software system that causes other parts to fail; occurs most often in shared libraries of code used by multiple applications "Not possible to fix old entries without a breaking change, so remap old to new in import lib."
We have had the locale data in Windows for over 15 years now, and every version several things happen:
Literally every version this happens.
Occasionally other stuff has happened too. But these three things represent a cluster of things that has always happened.
Now the interesting philosophical question pops up here, one that I am kind of curious about:
Is each individual change within the second two bullet points above a separate breaking change?
Perhaps the above is an outrageous instance of reductio ad absurdum, though.
I mean, perhaps some changes would not really be reasonably considered to be breaking changes, and some would.
Other changes are there because people saw the unchanged behavior itself as a bug. And dependence on incorrect behavior is always treated as a second class citizen compared to not having such dependencies.
Is there a reasonable manperson standard that could be applied here, in order to determine which types of changes could be reported as breaking?
And, if that is the case, should such reports be given for each change?
Or should there be a list of all the qualifying changes at designated intervals, such as ship dates, or at beta dates?
On the other hand, maybe this is the wrong way to think about the problem, since we actually have an over 15 year history of not treating anything at all as a breaking change, even when the change was in fact later determined to break something.
And I will not lie, that has happened.
There are times in society where the first action a new team takes is to repudiate and assume fault in everything the old team did, but this hardly seems like the most prudent course here. So maybe the old team's approach (to focus on the message that anything in the data can change between versions, so if something of theirs broke then it was caused by their broken assumptions) is the correct way to look at things, rather than focusing on each individual change.
Yet this can also be cast as a form of laziness by someone who disagrees with the idea. They could look at it as someone being afraid to take on a complicated problem.
This point, and how to resolve it, is not a trivial one.
While it is not the most important issue I am dealing with at the moment, it is in the top five.
And it is the most important issue about which I am really able to talk about at the moment.
I am interested in people's thoughts here.
How does the notion of breaking changes fit the model of per version locale changes?
It is something I talked about years ago, in my ISO 8601 is asinine blog. And not long after in ISO 8601 redux, which references Shawn's ISO 8601 Week of Year format in Microsoft .Net blog and Isaac's Best Week Ever blog.
Both Shawn and Isaac were trying to work around the fundamental difference between what Microsoft does with its GetLocaleInfo LOCALE_IFIRSTWEEKOFYEAR flag in native code (its DateTimeFormatInfo.CalendarWeekRule in managed code).And how they compare to the ISO 8601 rules, which are used by several different countries.
Unfortunately, the documentation on the Windows/.Net side doesn't help enough to set expectations.
Managed code describes it all as
A CalendarWeekRule value that determines the first calendar week of the year.
and native code describes it similarly.
With these descriptions, even with the cool example tables in the CalendarWeekRule enumeration, its official statement "This does not map exactly to ISO 8601", and its direct link to Shawn's blog (proving once and for all that blogs can indeed be treated as official documentation!), it is still way too easy for a person to:
Probably we should just add an actual ISO 8601-compliant locale field, though for whatever reason no one seems eager to do that (we can't change the meaning for the existing property). The workaround is pretty easy (since the MS implementation and the ISO standard aren't that different.
I think we're just going to be stuck with this difference between the two, with people periodically reporting this as a bug....
The email came, as most emails seem to, in the middle of the day to the account I don't tend to check much in the middle of the day.
Not their fault; it just wasn't a work email and one can't spend all day checking one's personal mail and facebook if one has a job!
Okay, on second thought some people do that. But it isn't really for me....
Ok, end digression. The e-mail:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Redmond Area MS Support Group StartingA new support group for people living with MS in the Redmond area is being planned.Find comfort and strength with your peers as you learn, cope, laugh and celebratethe challenges and successes of living with MS.Monday, October 11, 20106:30 - 7:30 p.m. PSTRedmond Library15990 NE 85th StRedmond, WA 98052Join us for the group's first meeting with an open discussion about issues that matter to you!- Who else in my neighborhood lives with MS?- Where can I meet others coping with similar symptoms?- How do I build on a positive approach to managing my MS?- What speakers would I like to hear from?- What resources am I looking for?Contact Sue Dahlin at firstname.lastname@example.org[mailto:email@example.com] if you planto attend or are interested in attending future meetings.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I've had kind of a tenuous connection with support groups over the years.
I went to one not too long after I was first diagnosed and kind of in denial about the whole thing.
I was working for neurologists and neurourgeons and even without the Internet I knew probably too much about MS for my own good. It was easy to not really think of myself that way since I sometimes couldn't even convince myself it wasn't in my head.
Anyway, being the person least affected by the disease, I found the support group to be depressing. I'm not the sort of person to derive hope from people who bear terrible burdens well. When I think about parts of Heinlein's Stranger in a Strange Land like the bit about Rodin's Fallen Caryatid Carrying Her Stone:
"This poor little caryatid has fallen under the load. She's a good girl - look at her face. Serious, unhappy at her failure, not blaming anyone, not even the gods... and still trying to shoulder her load, after she's crumpled under it.
But she's more than just good art denouncing bad art; she's a symbol for every woman who ever shouldered a load too heavy. But not alone women - this symbol means every man and woman who ever sweated out life in uncomplaining fortitude until they crumpled under their loads. It's courage... and victory.
Victory in defeat, there is none higher. She didn't give up... she's still trying to lift that stone after it has crushed her... she's all the unsung heroes who couldn't make it but never quit."
I am simply not inspired in that case or in the ones where the looad is being shouldered.
Generally, I found myself moved to pity, which I knew that they didn't want.
And that moved me to a bit of shame that I couldn't cast out the pity.
So not going just seemed easier.
I was invited to other groups and other events from time to time, but usually found suitable excuses. And as I got worse but kept on living life people started suggesting I could be some kind of inspiration for others.
But I knew how I felt about that same group of people from that first meeting -- I felt pity. And I really wasn't interested in inspiring pity. Whether I had fallen under my load or carried on in spite of it all.
Which is not to say everyone would feel that way, many probably wouldn't.
It wasn't until a few years back that I took a Negoitiation Strategies class that I learned a lesson really outside of what the class itself was teaching, and applied it to my life....
What I realized was that yes, some people will be moved to pity. And others to denial or discomfort or shame or pseudo-empathy (it wouldn't be my pain they were feeling, it would be their reflection of what they thought my pain might be).
But all of that? It isn't about me -- it's about them. Those are their issues.
If I understand that then I can move past however it might have made me feel and accept it or deflect it or ignore it or whatever. But I could avoid letting it bother me....
I suppose one of the reasons I enjoy going out again (after a bit of hermit-like break) is that now with the iBot it changes the reactions a lot, and I don't mind mixing up the landscape a bit.
And there are times I forget that "it's all about them" lesson for a moment, but not for too long. Because it's still all about them.
So maybe I could get something out of that meeting on October 11th.
It's weird to think I had a bunch of years where I wasn't healthy enough to be spending time in a support group, but that was kind of where it was.
Now of course there is a new problem -- the iBot can be kind of distracting.
I'll think about it, in any case. It's probably worth the time to give it a shot.
If you have MS and are interested, maybe I'll see you there. :-)
The title of this blog was inspired from a blog from a few years ago in Elizabeth Pyatt's A Linguist in the Wild blog entitled Arabic with a Kurdish Accent?. However, this blog has nothing to with accent and is really focused on issues of script and collation..
It was a litte over 16 months ago in my The Whey doesn't get a locale, either blog that I responded to the It is Time to Add Kurdish Culture to VS .NET Globalization paper, which mentioned me and my work a bit.
I skipped over several of the (in my opinion) interesting linguistic issues related to Kurdish, with just a hint to the fact that there are multiple scripts involved in a theoretical locale list.
But one of the linguistic issues touches on the area of
and since these topics are all interesting to me, I thought it might make sense to talk about Kurdish and these issues a little.
I'll start with the two major branches of Kurdish:
I won't be talking about Yekgirtú in this blog, though its unification efforts are also kiind of interesting so I could talk about it some other time!
Now an interesting question comes up when one answers the question of how to sort the language if one is creating a custom culture and must pick a sort order from all the ones that exist.
If you look at the It is Time to Add Kurdish Culture to VS .NET Globalization paper, you will see that Azad Ali and Frederick Kohun explicitly use the ar-IR (Arabic - Iraq) culture to get their sorting from. Arabic sorting.
This is obviously not so very Perso-Arabic due to the specific differences in Persian when compared to Arabic. But if one looks at the letters used for Sorani, you may not get the same kind of "Persian", Perso-Arabic sense that languages like Persian and Urdu might give you.
Whether this is due to worry about support for some letters or is how people want it? That much is unclear, but we'll assume for now that the results are actually correct. So perhaps using ar-IR rather than something like fa-IR might be okay in this context, if there are no letters that would be different.
But another problem quickly rears its head, however, one that the Yekgirtú alphabet I said I wasn't going to talk about helps to highlight.
You see, the Sorani alphabet for Kurdish has a feature that makes it somewhat unique among languages using the Arabic script.
That difference? Well, vowels are required, and the result becomes a true alphabet rather than an abjad.
This factor could lead one to want a sort that is more like the Latin script Kurmanji, because even though it is a different script there is a total mapping between letters in the two cases.
You can actually find examples of both sorts out there, though without a mechanized solution it is possible that this is more of an anomaly than anything else - a limitation of technology and of sketchy manual labor.
Now in the Latin script, Kurmanji does not have Turkic rules about either letters or casing. So perhaps the English sort or something like it is good enough for them -- or perhaps they are just using the thing most easy for them to get on computers and if they had their choice they'd want something different.
Interestingly, both English and Arabic sit in the default collation table of Windows, so assuming one doesn't need the Arabic letters of Sorani in the Latin order of Kurmanji, then collation will simply work (if not, then one would have to work a little harder to get the right results).
It does seem odd to be placing the vowels in the "Arabic" places when they act like the vowels in Sorani Kurdish do, but perhaps that is indeed what people are used to at this point; perhaps it is what they want.
Note that the "Arabic" sort puts some of these letters in different places than the letter lists for Kurdish typically do. Which makes this question all the more complicated....
What bothers me most here is the thought that perhaps technology has led to changes in language done merely for the expedience of getting on computers. I don't know why exactly, but that bothers me. Like as if there is some issue in language and linguistics that technology should not sully....