Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Yesterday, I was having an interesting conversation, one that has given me pause.
We were talking about Unicode, and the need for components in the OS to do a better job of directly embracing it.
This is obviously nothing new around these parts, but a new twist was interjected into the conversation.
You see, the components we were talking about were consistently calling the "W" suffixed Unicode functions all the time -- either explicitly or because Windows has been compiled with /DUNICODE and /D_UNICODE for years now.
However, at certain critical bottlenecks, they had two requirements added:
The net effect of these two requirements was a system default locale dependency, which resulted in many serious limitations, the most important of which:
Now the fact that they were mistaken was not a surprise; people have been making this mistake for almost two decades.
The shocker (for me) was how long it took them to understand and accept that some characters were not on any ACP or OEMCP at all. And that "Unicode-only locales, first introduced during the early betas of NT 5.0 (aka Windows 2000), even existed.
I first wrote Code pages are really not enough.... and Why ACP != OEMCP (usually) over six years ago, but the very real consequences on the destruction of text by flags like ES_OEMCONVERT was simply new information to some people who would never blink at swearing by the need for Unicode support.
So I've decided that the mantra of making sure components "must support Unicode" is insufficient.
I'll need to make sure it's always clear that a system default locale/CP_ACP/CP_OEMCP dependency is just as bad, and perhaps even worse. Because removing a code page dependency can be more involved than just compiling the code differently.
Sometimes a lot more involved....
There are a lot of people in the world.
A subset of those people have some interest in language and technology.
A percentage of them are a bunch of people who read this blog from time to time.
Some of the those people who read this Blog work for Microsoft.
And some of them are in the Puget Sound area.
And a few of those people are still in town now rather than being off for the rest of the year due to wanting to avoid losing vacation rollover time.
Today's blog is for that subset of people -- the rest of you can go read what's left of TechCrunch (or as I like to think of it, "Why I don't like startups anymore!).
I won't say you're children of destiny, but as it turns out you may be one of the people who turn out to be immortal, invulnerable superheroes who have outstanding luck with your preferred gender and are incredibly adept in countless other ways that you could spent countless hours enumerating if you were not too busy enjoying your fulfilling lives.
So you may in fact be Children of Destiny™!
If you're one of them, then all I can say is Congratulations. And Welcome.
Are you not sure if you're one of those people, there is a great way to find out.... you can attend one or both of the presentations I did for the 35th Internationalization and Unicode conference!
Today (Tuesday November 29th, 2011, in 86/2835)!
@ 10am:
Locales on Windows - the view from 18 years inIt was 1993 that the basic model for locales was integrated into Windows in its current form, and that model has been largely unchanged for much of that time. In this unique view of those 18 years, you can find out about the lessons learned, unlearned, relearned, and mis-learned. You'll leave this all up view feeling both more impressed and more embarrassed to know Microsoft than you ever have before, even if you were there while it was going on!
This talk shows off a lot of the new exciting features in Windows 8 locales, keyboards, and fonts!
@ 11 am:
Korean Hangul: from Sejong the Great's Hunmin Jeongeum to Unicode 6.1
Hangul has had a long history from the 1446 document that first described the underlying Jamo to the latest Jamo additions to the Unicode Standard. This presentation will do a whirlwind and only mildly irreverent tour of that history in the form of a presentation to Sejong to explain what has happened, highlighting the use, encoding, and re-encoding of one of the more perfect alphabets, imperfectly handled, in this or any other age.
This talk shows off the new Old Hangul IME in Windows 8!
So, if you are one of those potential Children of Destiny™, then for the sake of you an your significant other and everyone who will have the chance to interact with you in the future, pop on over to 86/2835 for one or both presentations!
Leo's question was, like many of the questions I cover here, pretty interesting:
Hello,
I’m writing UI automation for ***REDACTED***, and dev UI displays number in native Arabic digits for ar-SA. Hence, I was trying to convert my test integers into Arabic digit strings so I can do string comparison. However, this code below doesn’t seem to work. Anybody knows what’s the proper way to do this in C#? Thanks.
CultureInfo info = new CultureInfo(CultureInfo.CurrentCulture.LCID); info.NumberFormat.DigitSubstitution = DigitShapes.NativeNational;
int num = 3; Log.TraceInfo("Name: " + info.Name); // this outputs Name: ar-SA Log.TraceInfo("NUMBER: " + string.Format(info, "{0:d}", num)); // this outputs NUMBER: 3
Now there are a few different problems here.
First of all, as I initially mentioned more than six years ago in Is Whidbey's international support finished?, these properties involving digit substitution (NumberFormatInfo.NativeDigits and NumberFormatInfo.DigitSubstitution) and the DigitShapes enumeration all are data-only properties not hooked up to the behavior of the cultures they are nominally connected to.
And second of all, even now over six years later, there had been no move to change that, to add support for digit substitution.
I don't want to ignore the related third of all, which is that even if you ask the question of the CurrentCulture, which is directly connected to the current user default locale in Windows and picks up every other customization for data formats, it does not give you the current user override values. So you can't even use this information provided by .Net to get the data being used.
Thankfully, none of this matters.
Because Digit Substitution as a feature is a font level switch which does not change the underlying numbers themselves.
Therefore even if these other problems did not apply, the fact is that the code, which looks at the numbers themselves, is giving accurate answers, returning actual ASCII digit values.
And beyond that the only way to retrieve the actual digits based on shape would probably involve OCR technology.
All of this underscores a tangential point not often discussed -- that the underlying technology in screen readers never gives any indication that digit substitution has been applied -- so if you're blind we don't even act like we're substituting anything.
And thus the title of today's blog, and the irony thereof....
I can't even imagine how we'd do this any other way, really. Can you?
Previous blogs from this series:
Continuing the tradition started in part 10 about things that are likely to make you say meh than oooooo!, let's talk about some of those new keyboards again.
Looking all the way back to part 2 of the series, I had an interesting list of keyboards at the end:
And of course people are already noticing these keyboards, and even trying to use them.
Like the other day, when a developer here asked me:
Hi Gentlemen,
I installed the Myanmar keyboard on Windows 8. When we call ::GetKeyboardLayout(), we got 0xFFFFFFFFF0302400. So, the primary langid is 0x0 (instead of 0x55 for LANG_MYANMAR). Is this a bug?
Now that value for LANG_MYANMAR has been reserved since at least 2007, but it has never shown up anywhere before in wither ntdef.h or winnt.h.
Remember that the main purpose of the LANG_* and SUBLANG_* consonants is to define legal values to use in all functions accept or return LCIDs -- and no data is defined for a Burmese locale.
Now in theory since we can work with expatriate linguists and language experts, the fact that we can't work directly with people in country does not block such a locale -- in fact this helps us with the font work we did do or Myanmar and the others covered by the list above.
Generally, this change can be thought of as a way to try to make huge chunks of what is displayable in fonts able to be typed in keyboards.
In the Developer Preview and even in latest builds, there doesn't seem to be an indication in the registry of what the language/script might be -- e.g. no "Layout Locale Name" registry value. I'm not sure if this is an oversight or not, but perhaps people will have to keep the list around themselves if they get a 0 for the LANGID and there is no "Layout Locale Name" there:
Some might wonder why I don't just suggest that people use the "Layout Text" registry values like the ones in the table, but since they can change occasionally (the underlying keyboard don't but the string can if the name of the keyboard changes), it seems like a bad idea to me.
I originally asked for "Layout Locale Name" to be added, since users trying to suss out custm keyboard layouts would have to look for that anyway. But people didn't see the point -- anyone reading here want to weigh in on that?
Perhaps there will be other way to get at the info programatically at some point though....
Now I'm much bigger fan of evolution than intelligent design.
Though as opposites go, they are a bit unbalanced.
Design can be carefully planned yet be quite unintelligent, after all -- that is in fact lore provable.
So here in this series that talks about the EVOLVING story of locale support, I thought I'd give such an example, in order to help out a colleague!
Now in part 9, I took the following list that applied to all the prior parts:
Then in part 9 I found one where only the first three applied.
Now I'm going to point out something that none of them really apply to, but which nevertheless talks about the evolution of locale support at Microsoft....
It started when we added a locale for "English - Caribbean" (LCID value of 0x2409).
We already had several locales that conceivably fit under that grouping:
One could think of "English - Caribbean" as being an easy way to avoid having even more people ask for locales for the Cayman Islands and Barbados and Antigua and so on and so on.
And in the world where everything was based on LCIDs, that was fine and dandy.
But as we became more and more of the opinion that LCIDs suck, we at first for the sake of .Net used the "name" of en-CB until we rather embarrassingly realized how unintelligent we had been here -- what's the point of moving to a less proprietary solution to a more standards based one if we just start making up codes?
Thus, as I discussed in The road to standards compat is paved with app back-INcompat, we changed it to the ISO 3166 region code: en-029.
Perhaps this points to the solution to the issue raised by former colleague Sébastien Molines in the Suggestion Box:
Hi Michael
I've read recently about es-419, or "Spanish appropriate to the UN-defined Latin America and Caribbean region" -- See http://www.inter-locale.com/ID/rfc5646.html#region
Such a language code would be really useful for the company where I work, which specifically targets the South American and Caribbean markets rather than Spain. But it seems that the .NET Framework doesn't support region subtags. Do you think it's something that will be supported some day? If not or in the meantime, what's the best way of supporting South American/Caribbean locales and in particular localizing for these markets in one go?
Hope all is good--I have good memories of working with you!
Seb
Now whether or not Microsoft ever wanted to add es-419, the fact that we added en-029 sets a clear precedent that could be used in any company's strategy for custom locales/custom cultures
Sébastien's company could thus create an es-419 custom locale, though he may run into the big problem that our en-029 had -- that the date formats and currency preferences and even the dialect can vary between the Cayman Islands and Barbados and Antigua and Saint Vincent and the Virgin Islands and all of the other places where Caribbean English is spoken.
This points out the downside of such groupings -- it is hard to give consistent and correct data for them!
Perhaps it would have made more sense to create en-BB for Barbados, en-JM for Jamaica, en-AG for Antigua, and so on. :-)
Meanwhile, over on stackoverflow, there was a recent thread, started by DeadMG:
I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?
Unfortunately, neither the IsChar* functions in USER32.DLL nor the NLS GetStringTypeW function underneath them can handle supplementary characters. There is no Win32 way to get the info.
You can use managed code, and the CharUnicodeInfo class I first mentioned in A little bit about the new CharUnicodeInfo class:
Note that every one of these methods has two overrides -- one that accepts a single System.Char, and the other which takes a System.String and an index value. The latter case is for dealing with supplementary characters, which are made up of a high and low surrogate (also known as a surrogate pair).
Unfortunately, even functions like GetStringTypeW (which takes whole strings and could in theory return info about surrogate pairs), don't handle them.
Back in 2005 I wrote a speclet (what people in Windows today would call a "one pager"), that did two things:
I even had a prototyped version of this change, which wasn't actually accepted for Longhorn/Vista and wasn't picked up for Windows 7.
In fact, it isn't in Windows 8, either....
Win32 simply refuses to see beyond the BMP.
Raymond Chen asked me a somewhat related question that occurred to him when he was thinking about all of this:
Why does IsCharAlphanumeric check for C3_KATAKANA|C3_HIRAGANA and explicitly exclude them? In other words is it
* Katakana and Hiragana characters are genuinely alphabetic, but IsCharAlphanumeric wants to reject them because <obscure reason>, -or-* Katakana and Hiragana characters are not genuinely alphabetic, but for <obscure reason>, they are reported as C1_ALPHA, so IsCharAlphanumeric needs to filter them out.
From looking at http://www.fileformat.info/info/unicode/char/30d8/index.htm it appears that Katakana and Hiragana (or at least character 30d8) are considered Letters by Unicode. I.e., we are in case 1 above. So what are the <obscure reasons> that IsCharAlphanumeric wants to reject them?
This weird "Kana isn't alphabetic" approach is something I previously talked about in IsCharSomethingOrOther? and Is Kana 'alphabetic' ? Depends on who you ask....
Summary; there is no good reason; just some random person who was burned taking Japanese lessons nearly two decades ago and decided to take it out on everyone else....
Maybe he flunked out of the class or something.
Luckily, GetStringTypeW gives the right answer here, for Japanese at least. Though not so much for supplementary characters (including the 200-odd Extension B ideographs in JIS X 0213.
Looking ahead to Windows 8 and modern, the CharUnicodeInfo class should be available in the sandbox, which covers the future, at least.
But for native code, we really don't give a solution.
CT_CTYPE4 FTW? :-)
As the Unicode 6.1 beta marched along happily, Peter Constable noticed and asked the following:
In UnicodeData.txt (both 6.0 and 6.1 beta), why is the bidi category of 1F48C LOVE LETTER set to L rather than ON? By design, or (I'm guessing) a bug?
And Ken Whistler jumped in with the answer and explanation:
It is an artifact of the heuristic which is used to assign initial values to the 1000+ new entries typically appearing for a new version of UnicodeData.txt for a release. That heuristic guesses that a character with "LETTER" in its name is a letter, and assigns initial Bidi_Class properties accordingly, before I go through attempting to find all the exceptions manually and correcting them.Apparently both I and everybody else missed this during the beta review for Unicode 6.0. This clearly is a bug, and should be fixed for 6.1. I'd suggest dropping a short note in the hopper as feedback on PRI #206, so we don't lose track of this and remember to get it fixed along with anything else that turns up in the data files.BTW, when you report that one, there is another with the exact same problem:U+1F524 INPUT SYMBOL FOR LATIN *LETTER*Swhich is also bc=L, instead of the expected bc=ON.Cf.U+1F520 INPUT SYMBOL FOR LATIN CAPITAL LETTERSwhich *did* get corrected, and is the expected bc=ON.--Ken
Now obviously, the heuristic Ken refers to here could easily be improved.
For example, if it says LETTER with nothing after it, then maybe it's a love letter, versus an actual letter.
And again, for example, if it has the word SYMBOL in it, then perhaps that would override it having the word LETTER in it.
And so on.
You get the idea....
Over in the Suggestion Box, "Eli the Bearded" asked:
I see a need for a Unicode character story behind 'PILE OF POO' (U+1F4A9).
Ah yes -- good old U+1f4a9.
U+d83d U+dca9, as a UTF-16 "surrogates pair".
Pictorially, scatologically, this one:
Originally known as DUNG, or "dog dirt".
💩
It is one of the Emoji.
And a fun one.
I mean yes, Unicode sold its soul when it agreed to encode the Emoji. And this one has become a fun counter example to every proposed character people have, with the pattern:
"They encoded a PILE OF POO but they didn't encode _____________."
e.g.
"They encoded a PILE OF POO but they didn't encode Klingon."
It has even had time in Reddit, in a "Maybe they added too much to Unicode 6.0 thread that people had fun with -- you can read it here.
And it is in Windows 8, too.
No weirder than any other Emoji, though. All of it adds up to a PILE OF POO in my book.
I mean, "threads" like this one nonwithstanding....
I can't do this. I can't.
Some stories are better left untold....
So I've talked a lot about new keyboards and new locales and such, and everything I have talked about so far has four things in common:
Some casual (and not-so-casual) readers might assume that I am perhaps only going talk about stuff that meets all four criteria.
But today I'm going talk something that only the first three apply to; my involvement has been limited at best.
In other words, you can think of it as proof that I believe there are cool features in Windows 8 that I had very little to do with! :-)
I'll start with a blog of mine from this last March, Nastaliq is not just another script....
And the new Windows 8 font: Urdu Typesetting., a new member of the Arabic script font family:
It will be (and given how many people have or are installing the //Build Developer Preview of Windows 8, is) the first widely available Unicode font to support Nastaliq.
Here you can see it contrasted with Arabic Typesetting, a Naskh based font, for the same text:
That's in WordPad.
And here is in Notepad:
Now as I pointed out yesterday:
That Nastaleeq and not Naskh should be the writing style used for computers is also based on this misconceived “Nastaleeq or Naskh” notion – which in turn is an unfortunate legacy of Urdu word processing packages which supported one style or the other. So far as Unicode is concerned, for example word Pakistan would always comprise of characters Pay, Alef, Kaf, Seen, Tay, Alef and Noon.
The underlying text here is equivalent underneath, but the Nastaliq is quite simply overwhelming preferred by many people for use in Urdu documents -- like Urdu poetry, for example.
Anyway, if you have the Windows 8 Developer Preview, you can try out Urdu Typesetting to see how it works for you.
Keep in mind that it is mostly meant to be a Document font, not a UI font -- which should suit the needs of most of the people who have been asking for it, though I imagine people trying it out might try it many different places and at many different sizes. This is a font that really does need a little space -- you have been warned!
One caveat: a lot of work happened over the last few months to improve the font: lots of kerning was added, for example,and some compatibility work to fix minor bugs people found in Word vs. WordPad vs. Notepad. I was almost tempted to say nothing until beta, but someone would have stumbled on the font. Perhaps someone already has. So after talking to the owner of the font in MST, I decided to go ahead and write this up. I figured everyone understands about pre-beta vs. beta vs. release, and there are probably some people who would be very, very interested to find out about this long-requested feature -- now a part of Windows 8!
I have several people I'll be forwarding this blog to who have been asking for it over the years, and if you know people who have been looking for a good Nastaliq font you should do the same. Enjoy!
Special thanks to colleague and friend Irfan Gowani for loaning me some of his Urdu poetry for the screenshots -- I will probably be using them in another blog or two in the future....
Back in the middle of 2002, Abdul-Mqajid Bhurgri wrote a white paper for Microsoft, entitled Enabling Pakistani Languages through Unicode.
It was not so much about Microsoft's own support of Pakistani languages, which if you go back to 2002 was fairly scant -- we supported an Urdu - Pakistan locale (added for Windows 2000) but with no specific sort order (meaning it had the sort order from the default table, intended primarily for the Arabic language. Even though we knew there was a different collation. The white paper had a purer intent than that: a love of language, and a desire to see the right thing done with the languages of Pakistan. A great white paper done as the microsoft.com/middleeast site was just starting to think beyond the Middle East and look at languages in other parts of the world that shared the Arabic script.
Abdul-Mqajid's site can be found here.
The 35-page white paper contains the following table of languages and the number of people who speak them as their mother tongue:
And some interesting text about written language in the country:
Major spoken languages of Pakistan are: Punjabi, Saraiki, Sindhi, Pashto, Urdu, Balochi, Hindko and Brahui. Of these, only Urdu, Sindhi, and Pashto have a standardized alphabet. There are very few written works available in these other languages. Speakers of these languages, if they ever need to write in their language, use the alphabet of some other major language (usually Urdu or Sindhi) in which they have been formally educated. For Punjabi, mostly Urdu alphabet and writing style is used because most of the Punjabis have received their schooling in Urdu. For Saraiki, Urdu as well as Sindhi alphabet is used because Saraiki is spoken in Punjab as well as Sindh. Balochi also does not have any standardized alphabet. Mostly Urdu, sometimes Farsi, and occasionally Sindhi alphabets are used for it. Situation of the remaining languages is not much different.
The paper explains some of the important distinctions about Arabic script vs. language to answer people who misunderstand that distinction, and perhaps more importantly provides some of the best text I have read explaining the Naskh/Nastaleeq difference, e.g.:
The white paper also did some work to contrast the sort orders of Pashto, Sindhi, and Urdu.
Some time in 2004, the data in this paper, and its information about Urdu, was used as one of the sources for the Urdu collation that was finally added to Vista, many versions after the locale was added to Windows (as well as supporting data for the different sort order for Pashto, which was being added for Afghanistan in Vista).
It was preferred over the document Michael Everson wrote about the languages of Afghanistan, because that document primarily used the Unicode Collation Algorithm tailoring syntax without word list examples (and we don't use the UCA). The sorts were comparable either way.
Anyway, fast forward to part 5 of this series you are reading now, which listed the locales being added to Windows 8, which include:
This is pretty exciting, since at one point Sindhi was being considered for Vista (but was ultimately not done).
I suspect that Abdul-Majid Bhurgri (who I was in contact with back in 2007 talking about Urdu and Sindhi) will be pleased to see Sindhi finally being added to Windows 8!
Interesting trivia about our support of Pakistan:
Our GEO location data for Pakistan includes the following data:
Neither name is wrong, and context varies but of course locales don't have this notion of two kinds of names. If you look at the the Developer Preview from //BUILD, we're still working out which name to use across these three different Pakistani locales. Don't worry, we'll figure it out (we had the same problem in Vista pre-release with whether to call China P.R.C. or People's Republic of China)....
A part of me wonders whether (with 11,000,000 speakers) we won't wonder about not choosing to add a ps-PK (Pashto - Pakistan), too? :-)
So over in my Explaining the Windows XP/Server 2003 Regional and Language Options Dialog post (from 2004), Mahima Natarajan asked (this weekend):
Hi Michael..I want a non-unicode program "Toolbook" to display hindi. I don't have the option "Hindi" in the Select a language for Non-Unicode program under the Advanced tab. How do I include it? Im a novice and its urgent. Please help.
I don't usually jump in to answer questions that are urgent since I'm not an official support venue.
But I figured since the answer wasn't going to be the desired one, I'd let down fast and easy rather than drawing it out....
I'm not familiar with Toolbook specifically, but I'll assume it isn't Unicode based on the description.
As the post itself explains:
Language for Non-Unicode Programs (aka Default System Locale) - Located on the third tab, this setting is the one that controls, at the machine level, the locale that will be used for all conversions to and from Unicode for applications without Unicode support (like VB 6.0, for example). If you change the Default System Locale, you will be prompted to reboot afterwards (you may be prompted for your Windows CD first if you need to install some files). But I cannot stress it strongly enough: this is the top control on the third tab. You would not believe how many people mess this up and try to change the language at the top of the first tab under "Standards and Formats"! So think carefully and allow yourself to be one of the people laughing about the confusion, rather than one of the people being laughed at.
The list specifically filters out "Unicode-only" languages like Hindi, because there is neither an ACP nor an OEMCP that can support these languages.
In fact, looking at the Unicode-only locales, their ACP is 0 (which is CP_ACP's numeric value) and their OEMCP is 1 (which is CP_OEMCP's numeric value).
Thus there is really no way to support Hindi on Windows outside of Unicode....
In fact, if you try to force the change in the registry, you can make your system unbootable and then it will force you to roll back to a "last known good" choice. And yes, I have had to help several people to untoast their systems after trying to reverse engineer the default system locale and unsuccessfully to make this change!
Unicode is the answer here -- the only supportable answer, in fact.
Now Alan asked (early in 2010):
We have many initiatives currently to bring together a computing environment that will facilitate both English and Hawaiian languages. For example, we are starting to require our enterprise application to support entry/storage/display of Unicode characters, we have engaged via a 3rd party to work with the MS Office folks to develop a Hawaiian spellchecker. One of the key items was Unicode support at the OS level. While XP doesn’t do it well, IE7 and up + Windows Vista and up are starting to provide the out-of-the-box experience we need.
One thing that seems to be outstanding is fonts. In both out-of-the-box Windows XP and Windows 7, we found only 2 Unicode fonts (Arial and Lucida). Is there more available? The Hawaiian language has non-Latin characters. In the past, we’ve use other things like prepositions to *pretend* (e.g. an apostrophe to represent an okina) but that causes all sort of issue with things like spelling and grammar checker, text parsers, etc. Not to mention it doesn’t work with any dictionary files. So font support seems to be insufficient in that area and I would like to know if there are things current available from MS and what the roadmap is on general out-of-the-box Unicode font distribution.
Of course anyone who is a fan, or even a reader, of this Blog knows from part 5 that we are adding Hawaiian - United States in Windows 8, which would nominally making him happier....
But back then there was no such plan. Murray Sargent pointed out to the raised question:
Hawaiian uses the usual Latin script plus the okina, a phonetic glottal stop, which is encoded in Unicode as U+02BB. The Calibri font used by default in Word 2010 has this character. It appears in most other common Western fonts on Windows like Times New Roman, Arial, Cambria, and Courier New. Word 2010 supports Hawaiian language tagging, so I’d think a Hawaiian spell checker is either available or would work if one is added. Others copied may know for sure.
And back then, colleague Stu Stuple pointed out:
Word will look for a dictionary for any defined tag so it is possible to create an Hawaiian speller. Note that is an Office component, not part of Windows.
This kind of says it all.
It has been in all the core fonts for some time, and Word was thinking about this before Windows got the notion of being involved.
Back then, John McConnell pointed out as a sanity check:
AFAIK, the only input issue with Hawaiian is the ‘okina. The character was added to Unicode a few years ago and has been in the core fonts nearly as long. Anyone could create a Hawaiian keyboard (with ‘okina) using MSKLC. When I’ve visited the islands, I’ve noticed that public signs don’t use the ‘okina, but use apostrophe instead. Input should also not be an issue.
Now between part 4 of the series (which allows Windows to add keyboards that aren't necessarily locales) and part 5 (which makes Hawaiian a locale), we are covered now.
Though the one item -- adding the 'okina to the "complex script" character list (so that if it isn't in a font, it will be found anyway) is the last part -- and that is a part of Windows 8, too.
It would be on them to provide for a speller how Hawiian in Office 15, I suppose? :-)
No idea what their plans are in this area.
Originally it was the "poster child" for the "localeless keyboards" feature in Windows before we made that a non issue by adding it anyway. But it is unclear whether our "off the cuff" plans would cause them to do something, too.
Now I suppose in the [now friendly!] rivalry between Office an Windows, this move in Windows 8 would (in Poker terms) be a "call and a raise" for Hawaiian....
This blog brought to you by ʻ (U+02bb, aka MODIFIER LETTER TURNED COMMA - typographical alternate for U+02BD or U+02BF, used in Hawai`ian orthography as `okina (glottal stop))
Over in the Suggestion Box, Aakash Mehendale asked:
Today's DailyWTF has an image of a newspaper article in which the text has come out mangled with a mangling I've never seen before. It seems to have become *mostly* x and z. Any ideas on what's happened here? Bad OCR? Some really oddball re-encoding issue?
" rel="nofollow" target="_new">thedailywtf.com/.../~QCCmp2Txt~.aspx
I can't find the article in question about the mangling.
Because, as may have noticed, my Blog seems to have mangled the text.
I don't know what to do with it now.
Except maybe revel in the irony?
I didn't do it myself, but I seem to have lost some info here...
Now part 5 of this series put a slightly more direct take on new locales, by providing an explicit list of them.
But I wanted to talk about one locale in particular.
And one keyboard in particular.
The locale is Cherokee.
And the keyboard is the Cherokee Phonetic layout.
I remember talking with some folks from the Cherokee Nation (ᏣᎳᎩᎯ ᎠᏰᎵ) from Oklahoma, as well as someone at Microsoft from the Eastern Band of Cherokee Indians (EBCI, or ᏣᎳᎩᏱ ᏕᏣᏓᏂᎸᎩ). They were telling us of lots of the things they were doing to help increase the usage of technology in their homes and their headquarters in Oklahoma. It's one of the biggest reasons that I enjoy this work, being able to help such endeavors.
As a part of that, Cherokee Nation's Roy Bonny and Joseph Erb mentioned the Phonetic layout and how much easier people found it to use, though with a few problems:
1) The one built via MSKLC, which was dead key based, was forced to use the wrong letters in a few cases to make up fo the fact that there was no solution to having three key strokes make one Unicode character -- e.g.
T + L + A to get Ꮭ (U+13dd, a.k.a. CHEROKEE LETTER TLA)
2) The IME-based solutions didn't work in all applications and the typing feel was not quite as natural as they would have liked.
3) The Cherokee-QWERTY keyboard that is pre-installed on Mac OS X forced people to have to remember many memorized shortcuts (not unlike to the MSKLC layout -- same problem!), such as:
Roy and Joseph both expressed the frustration their adults and elders expressed to them at such shortcuts, and the need to incorrectly type in order to correctly type things.
Honor is a big deal here, and this problem leaves a bad taste....
There is another layout, the "Cherokee Nation" layout. But that one fails on the "intuitive" metric. So if they learn it they can use it.
I really felt like more should be done here -- enabling language should be better than this; we should be better than this.
I should be better than this.
So I asked them for the list of keys and pronunciations:
I took the info, and I told them I couldn't promise them anything.
While nevertheless promising myself that I would solve this problem.
So, armed with the MSKLC layout that fell short of their expectations, the graphic above, and links to sites like this one which had some info on not just the Cherokee Nation desired phonetic layout but the alternate phonetic choices for the Eastern Band vs. Cherokee Nation, I decided to see if I could create a layout that would feel delightful to a Cherokee user.
Then of course, since this is me we're talking about, I had to blog about the methods I ended up using!
We had not yet announced that we were adding a Cherokee keyboard, or even a Cherokee locale, so I just talked about the technology, in Chain Chain Chain, Chain of Dead Keys and The Dead Keys Conundrum: An Encyclopedia Brown Mystery and both the Solution the mystery and the The Sally Kimball Addition to the solution.
This is the single most complicated layout we have, by a factor of two or more -- the most complicated layout ever hoping to be put in a Microsoft product.
If it succeeds, at least, in swaying the users.
The completed layout, I nervously sent to Roy and Joseph and Tracy and Jeff Edwards - four users uniquely qualified to judge what ought to work, even if because of how often others have fallen short.
I gave minimal instructions (basically how to install it), I wanted to see how effective I was in "translating" all of this source material into an intuitive phonetic Cherokee keyboard.
I almost trembled at the presumption, believing myself able to accomplish this. Surely I must have missed something. Who can beat Apple at delighting users? That never hapens, right?
So I held my breath, and waited.
And fretted.
Turns out, I shouldn't have worried so much. Because the results were beyond my wildest hopes.
In their own words:
From Joseph:
HSiyo Michael,
Thanks for all your help on this. We are really excited here on this keyboard. As for language experts for typing Cherokee , jeff is one of the best we have. He has perhaps typed up more Cherokee documents then anyone alive today. We rely a lot on Jeffs skill on typing fast and accurately. We will share the key board with many other Cherokees as we get closer. But jeff should be able to answer any question you might have.
Again thanks so much on this , it will change many lives here on our end
Wado Joseph
ᏣᎳᎩ ᏗᏟᏃᎮᏗ ᏂᏓᏳᏓᎴᏅᏧᎾᏕᎶᏆᏍᏗ ᎠᏓᏁᏢᏣᎳᎩ ᎠᏰᏟ
From Tracy (who noticed I got rid of one of the most notorious of the "bogus shortcuts" other layouts were forced to use):
Wow, you killed the ‘j’…there is a metal for that alone. Early glance is that is fantastic work. I’ll dig in more tomorrow.
From Jeff:
Well man we truly appreciate what you are doing for us. I wish I had made it out on initial visit to put a face with a name but my grandson was being born so I sort of had to stay behind!
But your work will be appreciated and used by not only oklahoma but also north Carolina Cherokees as well.
I like your way of thinking on the space after the s character. Unfortunately I have typed up more Cherokee documents than anyone on the planet combined! Not boasting by any means but I can type Cherokee faster than most can type English (Joseph words not mine). I actually got my Cherokee name, skasdi, from my mad Cherokee typing skills, which means in today's terms awesome, powerful or nasty, all apply. But if it's been typed chances are my fingers had a part in it. Over 6000 pages of curriculum, 5000+ word dictionary and everything in between. So to me that would be way easier grabbing the space instead of the x. I by no means consider myself a fluent speaker because I am not but when it comes to typing, reading or writing I can blow the doors off!
But using your keyboard would be best described as fluid non stop motion in my opinion which is awesome. You never have to look up, the leading is perfect and it just looks really nice when you look at the end product. Others I have used had quarks and tricks which was constantly interrupting your work but so far this one had been a home run.
But again thanks for your time and dedication to the cause. I can speak for everyone when saying this was needed years ago and it will truly help our language efforts. I will have to sneak off one day and thank you in person.
I think I may have successfully met the bar that I set for myself to delight the target users. :-)
I did finally get to meet Jeff (at the IUC last month), and the first impressions were backed up by the continuing delight. I stand ready to tweak the layout as they continue to use it for any problems they see, but so far so good.
I'll admit I'm pretty pleased about all this.
Especially the fact that no one else -- including Apple -- has ever solved this problem well enough for whatever percentage of the over 300,000 Cherokee who want to use the script, the language phonetically.
I want to visit Oklahoma next year some time, to help people and watch them be helped.
And then the next thing to work on: since even if you are on Windows 8 you can't use MSKLC to load the full layout and be able to use on other platforms, I'd like to see at some point about releasing the layout fow download, for everyone who wants to type Cherokee text using this cool keyboard.
But for now, if you pick up the Developer Preview of Windows 8, you can try it out, too.
If the code I write for tools like MSLU or MSKLC is my prose, then the Cherokee Phonetic keyboard layout is my poetry. Enjoy!
So after going way off-topic in the last part, I thought I'd come back to the reservation.
Ready?
Every version of Windows -- plus XP SP2 and XP SP3 -- has added new locales.
We've never taken any out, once they've shipped.
And we've only ever removed four sorts (described in Four exceptions to prove the rule).
Anyway, so it is hardly breaking news that we added locales in Windows 8.
The full list we've added (already mentioned at the IUC35) is:
This is a very interesting list, for several reasons, one of which is the wide differences in reasons each one was added.
In particular, there are several here that I am quite pleased about, all of which I recused myself most of the justification conversations because I was personally rooting for them and didn't want an inappropriate bias to take away from my advocacy...
Now these aren't UI languages mind you -- these are just locales -- though as I pointed out earlier, locales are huge. It takes your breath away the first time you see your locale on the Standards and Formats list -- I know because I've witnessed it several times over the years (more on this another time).
So they are cool on their on, and I'll give part of the list of why I think a few of them are especially cool:
There is Central Kurdish (Arabic) - Iraq, which allowed me to be proven wrong about my earlier statements almost 2.5 years ago in The Whey doesn't get a locale, either.
And there is Tamil - Sri Lanka, which allows me to be a part of keeping ~1/3 of my promises about Tamil locales in Windows.
And there is Hawaiian (United States), which I despondently assumed would never be added, for no reason I could fathom, until the decision was made to not refuse to add it....
For the record, I'll say that Shawn Steele's advocacy of Hawaiian was a definite factor in its approval -- at the point where we asked about it, we were past the point of expecting to succeeed; we were like the Chicago Cubs (assuming we already lost), which made the approval all the more sweet!
There is more here that is interesting and fun, which I'll be covering in future parts of the series....