Blog - Title

June, 2011

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Oops! (aka Locking yourself out of Windows)

    • 21 Comments

    Conceptually, it is a bit like locking yourself out of your car (by locking it up with the keys inside).

     But it is easy to do it -- change your password and include characters in the password that are not included on the keyboards you have for your logon screen.

    See Updating the keyboard list in the logon dialog for more on how to update the keyboard list of the logon screen -- how to get caught short here with a password containing letters nor on any of those keyboards is left as an exercise for the absent-minded!

    This problem is also caused by a scenario that I found common personally when the first internal betas of MSKLC became available, actually.

    It seems people would create keyboards that did not contain all of the letters in the password they had.

    And this was back in the XP days where you might already be in session 0,and that keyboard change might be followed by being unable to unlock the computer or login after a reboot.

    This ended up becoming quite an interesting "support" issue for those early MSKLC builds!

    And now for today's interesting problem for all you readers....

    Think of it as a Microsoft interview question that I would never ask, except maybe if it was for a test position and I really wanted to test the ability to think outside the logon box. :-)

    The question: how would you get them back into their machine.

    Scoring will be as follows:

    • 0 points for a solution that would not work (if it is creative enough I might give some points for it).
    • 5 points for the solution that the person at the Microsoft Help Desk suggested.
    • 10 points for any of the solutions that I or someone came up with that seemed kind of boring or ho-hum.
    • 15 points for any of the solutions that I or someone came up with that would work but that could be rather labor-intensive.
    • 25 points for any of the solutions that I or someone came up with that seemed kind of brilliant.
    • 50 points for any solution that hasn't ever been suggested to or thought of by me for solving this problem.

    Some time next week I'll do the roundup with my scoring decisions (which are are arbitrary and final!) and any interesting methods that are missed....

    Ready? Set? Go!

  • Sorting it all Out

    There's no "I" in IDN, part 4: the 'path' to Hell is paved with IDN bugs

    • 16 Comments

    Prior blogs in this series:

    As you work though the IDN story in your own company, you will likely find an interesting mix in the support story, just as I have.

    Perhaps one of those most interesting areas has been in the tiny (one might say "puny" to invoke a groaner of a pun!) detail work.

    Like path detection.

    There are algorithms in RichEdit that colleague Murray Sargent tells me are quite sophisticated. You get to it in theRichEdit with things like EM_GETAUTOURLDETECT and EM_AUTOURLDETECT and such. He has even mentiomed the AutoURL stuff in his blog before (like here).

    Or in places like WordPad you just type a URL or a path, or load a file with one, and you can see the effect of turning the behavior on via these programmatic means -- it wil detect the path and mark it as if it was a clickable URL/path in a browser.

    If you move over to Word it's even more sophisticated, with a config option in the UI:

    The good old "Replace Internet and network path with hyperlinks" feature!

    Very cool.

    Until you include IDN in the mix, at least.

    Let's take four URLs that could easily be created if you have set up a machine to test out IDN (server names/namespaces changed to protect something or other), and try to put them in Word 2010:

    which was unable to properly detect one of the URLs out of the four.

    Can you guess why it failed?

    Or you could try it the same four URLs in WordPad on Windows 7:

    Wow, 0/4. Not too sophieticated!

    I'll point Murray to this post, and within a few hours he'll tell me that the latest version of RichEdit (essentially the one on his machine) supports all four URLs.

    Of course not everything on Murray's machine gets checked in without a bug report so I'll work on that too. :-)

    Let's try pasting those same four URLs here to see what this Blog Editor does wih them:

    http://नांदरी.日本国.test.corp.testcompany.com
    http://idn-iis1.日本国.test.corp.testcompany.com 
    http://idn-iis1.日本国.test.corp.testcompany.com/интернет_страница
    http://テストサイト.test.corp.testcompany.com

    Wow, that's disappointing.

    It looks like there are a bunch of URL detection functions that don't do so well with IDN.

    I wonder if UNC paths fare any better?

    I'm just kidding, I don't wonder. because I tried it.

    \\idn-iis1.日本国.test.corp.testcompany.com\сетевой

    The other two (WordPad and word 2010) behaved about like the third URL did, for reasons that might be obvious if you think about how tha Autodetect code works (or doesn't, in this case).

    Now this kind of stuff is obviously not core feature work, it's a nice little "extra", but really it isn't so nice when it screws up.

    As it does with IDn on the absolue latest version of evedry product/app/control I had immediate access to.

    Sounds like there are some bugs for people to look at, huh? :-)

    In the end, the roadpath to Hell is paved with IDN bugs!

  • Sorting it all Out

    There's no "I" in IDN, part 5: Stephen Colbert's job is not in any jeopardy

    • 15 Comments

    Prior blogs in this series:

    I suspect some of my readers are either fans or at least regular watchers of The Colbert Report.

    Perhaps just my smarter readers.

    Or maybe just the ones with basic cable....

    Today's blog ends up being about a combination "tip of the hat/wag of the finger" question.

    In honor of Stephen Colbert....

    The question goes something like this:

    SUBJECT: String.Compare for double byte characters in .Net

    I have following two string characters whose comparisons in SQL are equal, however I couldn’t figure out any comparisons in .net (culture/ordinal/case insensitive) that would return me equality. Any ideas?

    Goal is to not change SQL settings, but to find insensitive compare in .net.

    String.Compare(
        "0336753496aaa@ae2.dion.ne.jp",
         "0336753496aaa@ae2.dion.ne.jp",
        CultureInfo.InvariantCulture,
        CompareOptions.IgnoreCase)

    OR (Tried all combinations)

    String.Compare(
        "0336753496aaa@ae2.dion.ne.jp",
        "0336753496aaa@ae2.dion.ne.jp",
        StringComparison.OrdinalIgnoreCase)

    Now I'll consolidate the many different tips and wags:

    First of all, a wag of the finger since the question referred to "double byte characters" despite every string involved using Unicode, in a language (C#) that uses Unicode.

    Perhaps somewhat forgivable since the example was clearly referencing Japan, so perhaps the questioner was thinking about Japanese at the time. And therefore "double byte" was just old school thinking about CJK. Kind of like how they never migrated all those people off the FAREAST domain, even as everything else started referencing east Asia. Even though domain account migrations are so much easier these days after those thousands of migrations in Windows kind of forced ITG to get better at it....

    Second of all, a tip of the hat to the genuine attempt to try to do comparisons that fold out distinctions in an attempt to get parity between SQL Server and the .NET Framework.

    Third of all, a wag of the finger for ignoring the most important distinction in this case -- the implicit Width Insensitive nature of all _C*_A* collations in SQL Server, which could have been simulated by adding a StringComparison.IgnoreWidth to the first call, had their names not masked the fundamenta nature of the "hidden width" that makes me wonder if someone in SQL Server isn't worried about their weight too much....

    Fourth of all, a wag of the finger for taking a question obviously covering E-mail Address Internationalization (EAI) but doing it without even asking the question in a way or to a distribution list that suggested they were thinking about EAI.

    With a bonus fifth of all wag of the finger to SQL Server since it is hiding so much of the problem here that people come out of SQL Server wondering how to make other products act like them, rather than coming out asking the real questions....

    Okay, seems like a lot more wags than tips on this one. And that's even ignoring the extra wags i decided to leave for another day.

    I've decided I can't do "tip of the hat/wag of the finger" very well. I should leave that sort of thing to the professionals. From now on, I will.

    I'll talk more about EAI another day, too....

  • Sorting it all Out

    Since nobody @#%&*! owns en-US…

    • 8 Comments

    Long time readers may recall Who owns English, exactly? from back in November 2006 where I talked about the philosophical issues underlying who was the "owner" of transliterations and/or translations of native language names into English.

    Everything I said there was of course true, and the issues are of course very complicated.

    But there is an additional factor to be considered here too, on much wider than the "corner case" of translated/transliterated language names.

    The factor can be summed up in a simple question:

    Who the @#%&*! owns en-US?

    I mean, it's not like we have an English Language Academy we are beholden to, and there is no special ANSI standard we claim to be compliant with.

    The original locale was put together by people in the US, sure. And English Windows has never suffered from not enough beta testers.

    But all of that is hardly official, is it?

    This occurred to me the other day when someone asked me via mail:

    You have the value for LOCALE_IPOSITIVEPERCENT for the US locales as 0, shouldn’t this be 1?

    Now the LOCALE_IPOSITIVEPERCENT lctype documentation delineates the choices pretty clearly:

    ValueFormat
    0 Number, space, percent; for example, # %
    1 Number, percent; for example, #%
    2 Percent, number; for example, %#
    3 Percent, space, number; for example, % #

    Now I have lived in the USA my entire life, though I have traveled abroad a lot (enough to make filling out an SF-86 ti be too troublesome to ever bother trying to get cleared!).

    If you showed me a formatted string like

    %50

    or

    % 50

    I'd know it was wrong. But looking at

    50%

    vs.

    50 % 

    I could go either way; I see both of them, all the time.

    To be honest I don't like either of them; I think

    50 %

    or

    50 %

    using a HAIR SPACE (U+2009) or a THIN SPACE (U+200a) in between is better than any of the other choices -- between the original two "not wrong" choices, one is too close and the other is too far

    But we don't have either as an option.

    And it's not like we have an authority we can ask.

    Since nobody @#%&*! owns en-US....

  • Sorting it all Out

    ●●●●●●●●●●●●●● isn't complex, no matter what the underlying language is

    • 6 Comments

    This blog you are reading was originally a very different one, based on people with questions about several forum posts like this one, this one, this one, and this other one. It was gutted and re-written after large thread that took place yesterday endd up seeming a lot more on topic....

    The question was clearly written, so I will give credit for that, at least:

    Hi,

    We have a UI page that takes a password input. To increase the entropy we would like to ensure that the password as characteristics such as

    a) Great than x characters (current 8)
    b) Has at least one upper case
    c) Has at least one number

    We can use the GetStringType* APIs but I am not sure if for languages such as Japanese etc. if the API will help us do what we would like to, for example not sure if all languages have upper cases, numbers etc.

    What will you suggest to enforce a high entropy password that will work globally?

    <Please redirect as appropriate>

    Thanks!

    My first thought was to when a similar question came up a little over three years ago (as previously described in You want to know what's weak? Strong password rules, that's what's weak!

    But pretty self-aware, if you catch my drift....

    It is quite easy to look at such "criteria" as being rather biased toward languages with case in them, and the question about Japanese makes things all the more complicated, truth be told. Asone colleague responded in part:

    As you look at passwords in a global perspective, it seems like there are several things that could be considered besides case. For instance, suppose you expect passwords to be comprised of characters that are somehow related—e.g. they can all be generated by the same keyboard layout (seems like a good assumption), then it seems pretty significant that an Amharic input method can generate one order of magnitude more possible values for each information element than can an English input method, and a CJK input method can generate over two orders of magnitude more possible values. In other words, even though those writing systems do not have case, the set of likely keys will be 10^2^n and 10^3^n (for password length n) greater than for passwords generated using an English input method. In other words, while you don’t have case adding to the entropy, you have different factors that might potentially add much more robustness than case can provide.

    Of course, there are other factors for something like CJK: since each information element can be interpreted as a word, it seems like there’d be a certain user propensity to create passwords that are mnemonic phrases—and that will reduce the entropy.

    Okay, this miht be overboard, though.

    Becauseone of the most interesting facts about password fields (once you ignore the fact that no one should ever be asking for passwords except through CredUI these days!) is very relevant to Amharic. And Japanese. and CJK.

    Password fields turn off the IME.

    I feel I should repeat that a little louder, as it seems vaguely important!

    Password fields turn off the IME.

    Of course this specific fact that undercuts almost every single example given doesn't attack the basic argument's soundness.

    By opening up the repertoire and suggesting more reasonable additional potential ways to add the required level of complexity, one can remove the more provincial aspects of this whole policy that to me represents the most famous of all the dumb interview questions: writing a RegEx expression to validate the password string....

    For example -- one can suggest muliple scripts, like with Japanese Katakana and Hiragana.

    Or one can suggest both full width and half with letters, like with Korean Jamo.

    Note that when they use the English keyboard for other languages (like English to get the Pinyin for Chinese), case is often not applicable there either. So requiring case often makes minimal sense, culturally.

    And before you suggest that I have knocked Japanese and Korean and Chinese out of the running, I'll point out one of the most important misunderstood features for Japanese and Korean (and even Chinese, though Chinese does it differently) -- the additional non-IME keyboard that is kind of "installed" and gets used when the IME itself is disabled.

    You can't put random Kanji/Hanja/Han in there, but you can certainly get the underlying Kana/Jamo/English that you would be typing.

    It gets even more interesting as you consider passwords in Indic languages/scripts and typing in strings of characters that can be a lot closer to meaningless given how some strings simply will never look right if you were to look at then rendered.

    Though as luck would have it, ●●●●●●●●●●●●●● isn't complex -- no matter what the underlying language is.

    And I'll add one basic clarifying assumption: people will generally type a password using a single keyboard -- so you have to decide the meaning of complex as it applies to a given keyboard.

    Or you can keep the current lame rules that tend to assume it's English or English-like.

    In the long run, I think we need to mix it up a bit, we need to add some additional entries to the recommended Password Complexity Requirements or even better the more recently updated Passwords must meet complexity requirements:

    Passwords must contain characters from three of the following five categories:

    • Uppercase characters of European languages (A through Z, with diacritic marks, Greek and Cyrillic characters)
    • Lowercase characters of European languages (a through z, sharp-s, with diacritic marks, Greek and Cyrillic characters)
    • Base 10 digits (0 through 9)
    • Nonalphanumeric characters: ~!#$%^&*_-+=`|\(){}[]:;"'<>,.?/ @
    • Any Unicode character that is categorized as an alphabetic character but is not uppercase or lowercase. This includes Unicode characters from Asian languages.

    Now look, the article makes it clear that they are using GetStringTypeW to make their decisions, which means that a whole lot of Unicode is available here -- and the current restrictions seem tailored to support some Japanese and Korean just as well as "Western" cased scripts. So we are already part way there....

    So now let's just take it the rest of the way and open up this list to even more possibilities!

  • Sorting it all Out

    An irresistible force walks into an immovable object (aka the Thai that binds us)

    • 5 Comments

    It dates back to some time between when Windows XP shipped and Windows Server 2003 shipped.

    Suddenly, in a wee bit longer than a blink of an eye, it was determined that Microsoft's essentially promiscuous use of both the Unicode PUA (private use area) and unpaired Unicode surrogates were not only bad and/or wrong -- they were in some situations potential security issues.

    This was kind of the point of Keeping out the undesirables? and similar blogs.

    Now if you ask me, we went a little overboard on this one, which is the theme of The torrents of U+fffd (aka When security and conformance trump compatibility and reality) and other U+fffd blogs in this Blog.But that's neither here nor there, today.

    I just wanted to point out that now, all these years later, the work is done. We have bombed the PUA village and we've strafed the unpaired surrogate survivors.

    Even when (as pointed out in Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy....) we find we have told tell backward compatibility to Get Bent.

    The PUA is gone from our hearts and our minds and our memories -- never to darken our doorstep again.

    Um, except the place we forgot.

    The Microsoft implementation of The Thai Pattachote keyboard layout.

    You can see it right here. Or otherwise here it is, hidden in plain sight right atop the Enter key for over a decade:

    This is just Microsoft, mind you.

    The various other places where you can see the Thai Pattachote keyboard layout, like here and pretty much most other places, if you ignore a few places like here, which seem to be following us to insert a U+f8c7 on that key.

    Note that is not one of the PUA characters on the list that once gave us Thai as we wanted it that I mentioned in Short-sighted text processing #5: PU[A]! That pad THAI is pretty spicy.....This is just a random piece of the PUA whose continued presence is mostly due to this error of ours dating back to at least Windows NT 4.0, if not earlier.

    We even have it mapped in Code Page 874, albeit underhandedly.

    Oops, we have it there in the code page, too!

    Now we have an interesting decision to make, one that will test the universal prohibition that has existed about never removing anything from a keyboard layout once it has shipped.

    Do we remove U+f8c7 from the Thai Pattachote keyboard layout in future versions?

    In this true case of an irresistible force versus an immovable object, it will be interesting to see what we do....

    The advantage here is that no matter what, we are being consistent with  previously established behavior, but the disadvantages is that no matter what, we are betraying as previously established pattern of behavior....

  • Sorting it all Out

    There and Back Again (aka ACP --> UTF-8 --> ACP)

    • 5 Comments

    Shortly after Raymond's How do I convert an ANSI string directly to UTF-8?, someone with the handle (I'm assuming it is not his real name) of Otis asked me to weigh in on the issue.

    I chose to hold back, it seemed to me that the blog and comments to it were proceeding appropriately.

    But Otis would periodiclly ping me on it, thinking there was perhaps more to say.

    I did respond to the other question Otis asked me, about whether I was jealous of the fact that more people commented on his blogs than mine -- I'm not. and you only have to look at many of the comments over there to see why.I'd probably stop reading the comments or turn them off i I had to deal with all that!

    But the big question:

    Is there a way to convert an ANSI string directly to UTF-8 string? I have an ANSI string which was converted from Unicode based of the current code page. I need to convert this string to UTF-8.

    Currently I am converting the string from ANSI to Unicode (Multi­Byte­To­Wide­Char(CP_ACP)) and then converting the Unicode to UTF-8 (Wide­Char­To­Multi­byte(CP_UTF8)). Is there a way to do the conversion without the redundant conversion back to Unicode?

    has already been answered as well as I would have answered it -- there isn't one. You should use Unicode as your pivot encoding between the ACP and UTF-8.

    And bemoan the fact that the ACP can't be UTF-8, since that would make this question much easier to answer.

    I could post many links to blogs that continually tell the story about how that isn't gonna happen but I'll just be lazy and point to one of them; UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages. It points to some of the others.

    It still ain't gonna happen.

    If you are using the ACP anywhere, then you are lossy -- you should stop doing that. Keep around the old interface if you must but tell people to not use it.

    I'll have to talk more about the UELNT project (as I called it in that post) aka the MSL8 project (as I call it here), one of these days.

  • Sorting it all Out

    No, we're still not gonna see Klingon in Unicode (Pssst! Don't tell Shawn!)

    • 5 Comments

    After Klingon was officially rejected from Unicode (as I hinted at previously in Fictional could make things less functional), we all thought we'd heard the last of it.

    Well, not all all of us.

    Just those of us with whom i find myself most comfortable hanging around. :-)

    Then just yesterday Karl Williamson sent the following to The Unicode List:

    I just found out about the information contained in the article linked
    to below.  I wonder if people knew about this when it was decided to not
    encode Klingon:

    Klingon interpreter needed for social work

    May 12 2003 at 10:12am

    --------------------------------------------------------------------------------

    Portland, Oregon - Position Available: Interpreter, must be fluent in Klingon.

    The language created for the Star Trek television series and movies is one of about 55 needed by the office that treats mental health patients in metropolitan Multnomah County.

    "We have to provide information in all the languages our clients speak," said Jerry Jelusich, a procurement specialist for the county's department of human services, which serves about 60 000 mental health clients.

    Although created for works of fiction, Klingon was designed to have a consistent grammar, syntax and vocabulary.

    And now Multnomah County research has found that many people - and not just fans - consider it a complete language.

    "There are some cases where we've had mental health patients where this was all they would speak," said the county's purchasing administrator, Franna Hathaway.

    Officials said that obligates them to respond with a Klingon-English interpreter, putting the language of starship Enterprise officer Worf and other Klingon characters on a par with common languages such as Russian and Vietnamese, and less common tongues including Dari and Tongan. - Sapa-AP

    Note the date, of course!

    Luckily Peter Edberg jumped in quickly before it oscillated into one of those long threads:

    That article was probably based on a mistake. See about 2/3 of the way down in
    http://www.cbsnews.com/stories/2003/05/14/national/main553897.shtml:

    ---
    Klingon Interpreters Out Of Work After All

    PORTLAND, Oregon - Sorry, no Klingon interpreters needed, after all. The government
    agency that treats mental health patients in the Portland, Oregon, area had listed
    Klingon as one of 55 languages that clients might speak. Now, Multnomah County
    officials are taking back their call for Klingon interpreters. County Chair Diane
    Linn says the inclusion of the "Star Trek" language on the list was a mistake.
    Officials note that no mental patient had ever come in speaking only Klingon. And
    not a dime of public money was spent on Klingon interpretation.
    ---

    Sorry.
    - Peter Edberg (Eugene, Oregon)

    This nicely managed to stop the conversation before it started.

    And oh what a conversation it would have been!

    Thankfully, there are not enough psychiatric patients claiming to only be able to communicate in Klingon that an interpreter would be required.

    And certainly not enough reading and writing in it, using the non-Latin script that not even the Klingon Language Institute uses!

    I suppose one day I should more fully cover the rejection directly.

    And I shall.

    Another day....

  • Sorting it all Out

    To get Korean wrong you have to SHIFT things a bit...

    • 4 Comments

    No, this blog is not a political statement about policies of Kim Jong-il or the DPRK or the next ruler. It is technical blog about a keyboard....

    It started not long after the Keyboard Layouts, everywhere! blog I wrote.

    The one that pointed out the revamped site showing all the Windows keyboard layouts, which in this new incarnation would support not only Internet Explorer, but now also FireFox, Opera, Safari, and Chrome.

    You can call it

    http://msdn.microsoft.com/en-us/goglobal/bb964651

    Or you can call it

    http://msdn.microsoft.com/bb964651.aspx

    Or you can even call it

    http://msdn.com/bb964651.aspx

    They all resolve to that same site.

    Shortly after that blog, in The key KEY perspective is that 'Perspective is Key', I discussed an issue in the Japanese and Korean layouts on that site that is an intentional choice, but one that some might consider to be unintuitive or even a bug.

    In the process of that, a bigger issue which one could much more defensibly consider a bug was being ignored.

    It can be found in the Korean keyboard layout. Not in the base state of either the ENG or KOR modes:

      

    both of which rermain bug free. Instead if you look at the SHIFT mode on both you'll see some problems:

       

    Now there are two problems here.

    Did you spot them both?

    They are:

    • The parentheses are backwards in both modes, and
    • The KOR mode has a bunch of blank keys on it, while the actual keyboard types the same characters that exist in the BASE state.

    You can look at this picture for a more obvious highlighting of the problems (the first problem in red, the second in green):

    Luckily the soft keyboard built into the IME does it all correctly, though:

    Now I guess the idea of only showing the "doubled' Jamo (I talked about the doubled Jamo previously in Ssang Your Life (or alternately: I'd Like To Teach the World To Ssang)) was intended as a way to emphasize them via contrast, but honestly it would probably be better to be accurate here....

    As for the reversed parentheses, that's simply a straight bug without even a weak justification....

    It looks a little like something else that isn't as much of a bug though it is a little bit of one that I'll talk about another day; but THIS is definitely just a bug.

    These bugs should definitely be looked into, though -- being correct is so much better for everyone!

  • Sorting it all Out

    Does your code avoid the [government sanctioned] Y1.45K bug?

    • 4 Comments

    So I was asked if I knew what the following comment from the site was about:

    The Microsoft Implementation of the Umm al-Qura Calendar

    The implementation of the Umm al-Qura calendar in Microsoft’s recent operating system Vista is claimed to be valid between 1318 AH and 1450 AH. However, as the Microsoft Vista algorithm erroneously assumes that the computational rule used between 1395 AH and 1419 AH was also used before 1395 AH the Microsoft algorithm will often give faulty results for dates before 1395 AH.

    I am really not sure what this comment refers to.

    I mean I have mentioned issues I know about in blogs like Long term planning is not always done, and Using a culture's format, without using that culture to format?.

    Not to mention Evil Date Parsing lives! Viva Evil Date Parsing!, explained.

    Or Grody to the Max[Date]!.

    Even without issues that would cut down our effective dates, looking to the future the 1450 limit would take us to (unless my mental math is off) some time in 2028, which is not that many years away.

    With computer programs written in .Net that work with MaxDate so gratuitously (and I have seen many of those), this is not a very wide range.

    Given that there are features in Windows that can deal with dates outside of this range (including certificate creation), I think any application running on Windows that doesn't handle the Um al-Quara calendar is pretty irresponsible....

    By the way, does anyone know what that "bug" in our implementation is? I'll ask some people around here, but I thought I'd ask my readers first in case anyone knew.

  • Sorting it all Out

    One of the coolest parts of my job is when I don't have to do it

    • 4 Comments

    Maybe I should explain the tite statement before it ends up having an effect on my upcoming review....

    You see, the other day, Dan asked:

    Subject: What are the F4 80 80 80 bytes in a UTF-8 HTML email body for?

    Hello;

    I’m working on an issue involving UTF-8 encoding.  The customer is seeing these bits coming back from an Exchange EAS call which contains utf-8 encoded content which contains: F4 80 80 80. This content is in the HTML of an email body which contains simplified Chinese and English characters.   The customer is saying that these bits are not valid.  Are these some sort of special marker in UTF-8 encoding?

    Thank you,
    Dan

    I saw the message on my phone, but I was at a late lunch (it was 1:34pm on a Wednesday). I figured I could answer it when I got back to my office.

    But there really was no need....

    Because at 2:08pm, just 34 minutes later, colleague Laurentiu resonded:

    Hello Dan,

    <F4, 80, 80, 80> is a well-formed UTF-8 sequence that represents the Unicode character U+100000 (five zeros), which is a Unicode supplementary private-use character.  Unicode contains three private-use areas: U+E000–U+F8FF in Plane 0 (Basic Multilingual Plane), and two supplementary planes, Planes 15 and 16 (Supplementary Private Use Area-A and -B).  U+100000 is the first character of Plane 16.

    Although they can contain anything the user defines, supplementary private-use areas are typically used for CJK ideographs that are not encoded in Unicode.  So it looks like somebody is using a PUA character, possibly a private CJK ideograph.

    Regards,
    L.

    And there it is -- knowing that there are other people who know the answers to many of the random globalization, localizability, world-readiness, and other issues to help out when many of these questions come up, and who are willing and able to respond is a huge help as more and more people are trying to do the right thing in their projects and products and support cases.

    I still do answer many questions. But there are many others who help out as well.

    And that is one of the coolest parts of my job -- the fact that so many others are around to help do it! :-)

    There are still questions that no one else seems to know the answers to, so I still have some utility. But its nice to know that there are others around with knowledge and interest....

  • Sorting it all Out

    Translating Tamil songs isn't entirely there yet....

    • 4 Comments

    I am so predictable, I really am.

    Like when I saw Google Translate welcomes you to the Indic web on the Google Blog, wasn't the ideathat I would respond by blogging about it kind of a foregone conclusion?

    I decide to start by taking பிறப்பொக்கும் எல்லா உயிர்க்கும் (Pirapokkum ella uyirkum), the song I talk about in The song starts something like "All are equal at birth, they should all live as one race...", and translate it.

    Of course the site is an Alpha version, as their link says:

    You can expect translations for these new alpha languages to be less fluent and include many more untranslated words than some of our more mature languages—like Spanish or Chinese—which have much more of the web content that powers our statistical machine translation approach. Despite these challenges, we release alpha languages when we believe that they help people better access the multilingual web. If you notice incorrect or missing translations for any of our languages, please correct us; we enjoy learning from our mistakes and your feedback helps us graduate new languages from alpha status. If you’re a translator, you’ll also be able to take advantage of our machine translated output when using the Google Translator Toolkit.

    Remembering that, the goal is not poetry -- it is understandability....

    It starts with something like "All living humans are one in circumstances of birth" and goes from there, right?

    And this site would be the way to take communication in Tamil and make it understandable to those who do not know Tamil.

    Here is the orginal Tamil and the translation as of a couple of hours ago, for your perusal:

    Original text Transliteration Google Translate text

    பிறப்பொக்கும் எல்லா உயிர்க்கும் -
    பிறந்த பின்னர்,
    யாதும் ஊரே யாவரும் கேளீர்

    உண்பது நாழி உடுப்பது இரண்டே
    உறைவிடம் என்பது ஒன்றே என
    உரைத்து வாழ்ந்தோம் – உழைத்து வாழ்வோம்

    தீதும் நன்றும் பிறர்தர வாரா எனும்
    நன்மொழியே நம் பொன் மொழியாம்
    போரைப் புறம் தள்ளி
    பொருளைப் பொதுவாக்கவே
    அமைதி வழி காட்டும்
    அன்பு மொழி
    அய்யன் வள்ளுவரின் வாய்மொழியாம்

    செம்மொழியான தமிழ் மொழியாம்…

    ஓரறிவு முதல் ஆறறிவு உயிரினம் வரையிலே
    உணர்ந்திடும் உடலமைப்பை பகுத்துக் கூறும்
    ஒல்காப் புகழ் தொல்காப்பியமும்
    ஒப்பற்ற குறள் கூறும் உயர் பண்பாடு
    ஒலிக்கின்ற சிலம்பும், மேகலையும்
    சிந்தாமணியுடனே வளையாபதி குண்டலகேசியும்

    செம்மொழியான தமிழ் மொழியாம்…

    கம்ப நாட்டாழ்வாரும்
    கவியரசி அவ்வை நல்லாளும்
    எம்மதமும் ஏற்றுப் புகழ்கின்ற
    எத்தனையோ ஆயிரம் கவிதை நெய்வோரும்
    புத்தாடை அனைத்துக்கும்
    வித்தாக விளங்கும் மொழி

    செம்மொழியான தமிழ் மொழியாம்…

    அகமென்றும் புறமென்றும் வாழ்வை
    அழகாக வகுத்தளித்து
    ஆதி அந்தமிலாது இருக்கின்ற இனிய மொழி -
    ஓதி வளரும் உயிரான உலகமொழி
    நம் மொழி – நம் மொழி – அதுவே

    செம்மொழியான தமிழ் மொழியாம்…

    தமிழ்மொழி – தமிழ்மொழி – தமிழ்மொழியாம்
    தமிழ் மொழியாம் – எங்கள் தமிழ் மொழியாம்
    வாழிய வாழியவே… தமிழ் வாழிய வாழியவே…
    செம்மொழியான தமிழ் மொழியாம்…




     

    Pirapokkum ella uyirkum
    pirandha pinnar

    yaddhum ooreee yaavarum kellirrr

    onnbadhu naazhi udupathu irende

    uraividam enbadhu ondre

    uraithu vazhndhom uzhaithu vazhvom..

    theedhum nandrum pirar thara vaarai yenum
    nan mozhiye nam pon mozhiyaam

    porrai puram thallzhi porulai podhuvaakave..

    amaidhi vazhi kaatum anbu mozhi
    ayyan valluvarin vaaimozhiyaam

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    orrarivu mudhal aararivu uyirinam varayile
    unarndhidum udal amaipai pagirthu koorum

    orrarivu mudhal aararivu uyirinam varayile
    unarndhidum udal amaipai pagirthu koorum

    thozhgapugal thozhgapiyamum
    oppatra kural koorum uyar panpaadu

    olikindra silamubum meghalayum sindhamaniyudane..

    vazhayapathi kundalakesiyumm..

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    kamba naataalvarum kavi arasiyevai nallaalum
    yemmadhamum yetrum puzhgal endrum
    yethanayo aayiram kavidhai neivor tharum
    thadai anaithukkum vithaaga vilangum mozhi

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    aagam endrum puram endrum
    vazhvai azhagaaga vaguthalithu
    aadhi anddam illathu irukindra iniya mozhi
    modhi vazharum uyiraana uzhaga mozhi
    thamm mozhi namm mozhi
    adhuve...

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    tamizh mozhi tamizh mozhi tamizh mozhi ya ghaaa...

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    Thamizh Mozhiyam Engal Thamil Mozhiyam

    Vazhiya Vazhiya ve..Thamizh .. Vazhiya Vazhiya ve

    SEMMOZHIYAANA TAMIZH MOZHIYAAAM

    Pirappo of all life -
    After birth,
    No Information

    Now now only eating utuppatu
    That is the only shelter
    Uraittu lived - working live

    Pirartara titum bad as well
    Our golden nanmoliye moliyam
    Away at war
    Potuvakkave item
    Quiet way
    Love language
    Ayaan's vaymoliyam valluvar

    Moliyam cemmoliyana Tamil ...

    The first creature orarivu ararivu varaiyile
    Shape analysis and confirmed unarntitum
    Tolkappiyamum olka fame
    Rate this unique element of high culture
    Olikkinra cilampum, mekalaiyum
    Projects cintamaniyutane kuntalakeciyum

    Moliyam cemmoliyana Tamil ...

    Thanmalwila nattalvarum
    Well kaviyaraci Avvai
    Load praising emmatamum
    Neyvorum many thousand poetry
    Refresh all
    Vitt, as a language

    Moliyam cemmoliyana Tamil ...

    Puramenrum life akamenrum
    Beautiful vakuttalittu
    My language is antamilatu Adi -
    Oti developing codenamed ulakamoli
    Our language - our language - it

    Moliyam cemmoliyana Tamil ...

    Tamilmoli - tamilmoli - tamilmoliyam
    Moliyam Tamil - Tamil moliyam us
    Tamil Site Support Site Support ... ...
    Moliyam cemmoliyana Tamil ...









    Perhaps we have not hit completely understandability just yet.

    Well, let's just call it a work in progress. We'll try back in a month or three and see if there are any improvements.

    In the meantime, I can still point out that I am completely in love with Shruti Hassan's voice, for those following along with that kind of thing.... :-)

  • Sorting it all Out

    There's a ™ joke in here somewhere, I just don't know what is (aka And if a 't' turned out to be 'm'…)

    • 3 Comments

    It isn't always true, but it often is -- the issue, when it came up, was brought up by a tester.

    This makes sense, really.

    No matter how much self hosting people like me might do, by the very nature of the job they do, most of the day of the tester is a mix of random ad-hoc, directed ad-hoc, and other testing....

    Now do you remember when I talked about pseudo-localization, back in One of my colleagues is the "Pseudo Man" (a rich source of puns in conversation!)?

    Well, a tester noticed that in the pseudo build a bunch of strings that were supposed to have "t" in them seemed to have "m" in them, instead -- in the pseudo build.

    A shaping problem? A font problem? A pseudo problem? No one was sure.

    Well, I did know there was no shaping involved. A font problem seemed unlikely, and a pseudo issue seemed far-fetched.

    The mail thread had some perplexed folks on it, let me tell you.

    Now the way pseudo works, each letter has a set of various "replacement" characters that look like them. So take, for example:

    t      U+0074    LATIN SMALL LETTER T

    For this letter, there are several potential "alternates":

    Letter Codepoint Name
    ţ U+0163 LATIN SMALL LETTER T WITH CEDILLA
    ť U+0165 LATIN SMALL LETTER T WITH CARON
    ŧ U+0167 LATIN SMALL LETTER T WITH STROKE
    τ U+03c4 GREEK SMALL LETTER TAU
    т U+0442 CYRILLIC SMALL LETTER TE
    U+ff54 FULLWIDTH LATIN SMALL LETTER T

    There was a part of me that was sad that the Tenge was not on the "Capital T alternatives", especially after I wrote It is with a tenge of sorrow that I say this. I mean, I'm over it now. But I was sad for a little while....

    Anyway, back to the t that becomes an m, do you know what's going on?

    Just think of it as yet another flavor of the issue i talked about in blogs like Small case is not just tinier capitals; italics are not merely slanted letters and You say ĭtalics, I say ītalics. It is much more complicated in Cyrillic...., previously.

    Judy pointed out some of the form issues, with art:


    This fact interested me for many different reasons.

    So I went off and looked at what various fonts did.

    I am pretty sure something in here might be weird. Probably not bugs though:

    Sometimes your t is gonna look like an m; deal with it!

  • Sorting it all Out

    Whither アイヌ・イタㇰ (Ainu)?

    • 3 Comments

    The message I received the other day via the Contact link was:

    Dear Mr. Michael S. Kaplan,

    We see Microsoft created Language Interface Packs for near extinct languages like Irish and locales for endangered langages like Sami. We would like Microsoft to consider creating a locale for Ainu, a language of the island of Hokkaidō. Most of us from Hokkaidō only can read and write Japanese due to policies that discriminated for many years, but we see that Meiryo, Meiryo UI, MS UI Gothic,  MS ゴシック, MS Pゴシック, MS 明朝, and MS P明朝 fonts all support extended Katakana used for Ainu. Can Microsoft take the additional steps to support Ainu?

    I am Ainu. Our language is dying, but this does not have to be. A new life in Windows can save us.

    Thank you for reading this communication and sorry for the poor English.

    I'll admit I feel a little bit under-equipped to respond to this message - my knowledge of Ainu, and of the Ainu, is not terribly extensive.

    Though for the record I have no complaints about the English in the request, I only wish my Japanese could be good enough to have understood a message written in it!

    The information in blogs like Why one LIP and not another? and One Uyghur walked into a Blog, and... help explain many of the complex factors that are involved in the language and locale lists of Windows.

    I will communicate the request on, but of course asking me is not what makes the decision happen.

    In the case of Ainu, efforts of the Japanese government, the Microsoft subsidiary in Japan, and the MS Public Sector folks in country would likely be important (just as efforts of the analogous people in Canada were important for the Inuqtitut LIP and the Mohawk locale).

    To be honest, I would hate to think of Microsoft, and Windows, being the only hope here. At best we can be a passive tool that can help an active effort to revitalize a language (on the whole it seems like proofing tools do more for that than anything Windows can do when it comes to supporting a language!).

    The BCP-47 tag for Ainu in this case would likely be ain-JP, and a custom locale could be created for it quite easily. Collation may already work (Kana works in all locales -- and the Katakana Phonetic Extensions are covered with the rest of the Kana. A dedicated keyboard via MSKLC would also help (a full IME wouldn't be needed, just a better Katakana coverage). As the person asking noted, the characters are already in the fonts....

    In fact, I would do it myself right here if I had a source from which to get the data like month and day names. I found many references to the work of Emiko Ohnuki-Tierney in regard to the concepts of time among the Ainu and they clearly have both months and days. But neither days nor months were named, and I had no luck finding this information elsewhere in quick web searches -- perhaps if anyone knows of any they could point the info out?

    Languages in trouble are deinitely a matter of concern, always. But in cases like this we shouldn't even wait for a locale or a LIP that may or may not arrive -- there is work that can be done today....

  • Sorting it all Out

    Wait til you see my Õ (Ō), Latvian edition

    • 3 Comments

    Riff on Wait til you see my 'O' [pattern] and Wait til you see my 'O'[EMCP based technology] entirely intentional!

    The note from Peter Klavins came to me via the Contact link in early March of this year:

    Subject: Latvian (QWERTY) keyboard bug, all O/Ses to W7

    Hi Michael, I would like Microsoft to accept a bug on the Latvian (QWERTY) keyboard that has been around ever since Microsoft developed that keyboard (thank you, apart from the bug, it is great!), and it is still present in Windows 7, despite my having over time written various e-mails to people in Microsoft about it (and never got a response), and also having submitted a bug to Windows Vista beta, but it eventually got closed before final release without resolving the bug. Therefore I am hoping that you will have more clout than I obviously have, but I am not sure whether according to your disclaimer on the web page that I am submitting it on, whether I should expect a response or not even for this request. The bug is clearly a bug, and the only reason that it isn't resolved is that it may be too much trouble to do so for Microsoft.

    The bug is this:

    The Latvian (QWERTY) keyboard incorrectly produces the following characters for these AltGr-key combinations: AltGr-O Õ U-00D5 LATIN CAPITAL LETTER O WITH TILDE AltGr-o õ U-00F5 LATIN SMALL LETTER O WITH TILDE The characters that should be produced are: AltGr-O Ō U-014C LATIN CAPITAL LETTER O WITH MACRON AltGr-o ō U-014D LATIN SMALL LETTER O WITH MACRON The same bug exists in the "true" non-QWERTY Latvian keyboard. The character at position 'N' on the US keyboard produces 'O' on the Latvian keyboard, but with AltGr it also produces the incorrect characters in the same way as outlined above for the Latvian (QWERTY) keyboard.

    There is no room for interpretation whether in fact it is a bug. The other vowels in the Latvian character set, a, e, i, and u all produce characters with a MACRON above when pressed with AltGr. For a more authoritative reference, look at the "comments" line in the articles about the two Unicode characters in question: Portuguese, Estonian: http://www.fileformat.info/info/unicode/char/f5/ Latvian: http://www.fileformat.info/info/unicode/char/14d/I would appreciate it if you could submit this bug into the Windows 8 database. My software engineering experience suggests that it would be cheaper for Microsoft to fix a bug in an upcoming release of Windows rather than produce a HotFix for previous versions. But nevertheless I would like the bug to be finally fixed.

    Thanks for your help.

    Regards Peter Klavins

    Now for the record, yes -- if you look at the keyboards as we document them here in the ALTGR and SHIFT+ALTGR states:

     

     

    These keyboards do indeed have

    Õ  (U+00d5, aka LATIN CAPITAL LETTER O WITH TILDE)

    and

    õ (U+00f5, aka LATIN SMALL LETTER O WITH TILDE)

    on them.

    And further, neither

    Ō (U+014c, aka LATIN CAPITAL LETTER O WITH MACRON)

    nor

    ō (U+014d, aka LATIN SMALL LETTER O WITH MACRON)

    are on either keyboard.

    I have not heard of report of this previously, so I apologize about the lack of response (if Peter provides me with more information on those reports I will follow up to see what happened and try to keep it from happening again, as the title indicates, at least on language issues!).

    However, with all that said....

    I pointed out in July of last year in I swear the Latvian bug is fixed; it was fixed 4.5 years ago!:

    You see, at this point (starting with the new Vista results), we were almost entirely conformant to LVS 24:1993.

    But there were four differences:

    • LVS 24: 1993 puts "O macron" and "o macron" in a unique alphabetic weight after "O" and "o".
    • LVS 24: 1993 puts "R cedilla" and "r cedilla" in a unique alphabetic weight after "R" and "r".
    • LVS 24: 1993 puts "Y" and "y" just after "I" and "i", kind of like Lithuanian does.
    • LVS 24: 1993 puts capitals come before small letters (like the Hungarian Technical Sort and unlike every linguistic sort we have for Latin script letters).

    Now the first three differences in this list relate to old characters from the orthography of Latvian prior to some reforms that date back about June of 1946 (see here for more info on that).

    (emphasis added to the bits related to the current discussion)

    And as the Latvian language Wikipededia article indicates:

    The letter O indicates both the short and long [ɔ], and the diphthong [uɔ]. These three sounds are written as O, Ō and Uo in Latgalian, and some Latvians campaign for the adoption of this system in standard Latvian. However, the majority of Latvian linguists argue that o and ō are found only in loanwords, with the Uo sound being the only native Latvian phoneme. The digraph Uo was discarded in 1914, and the letter Ō has not been used in the official Latvian language since 1946.

    Because of this, while it is clearly wrong to include the Õ and õ, it would not be a specific requirement for Latvian to include Ō and ō.

    Since we cannot remove letters from keyboards, the 'bug" part can't be fixed.

    For the Macron characters, Sometimes, tech companies cannot take sides -- and if Microsoft added these letters, it would be quite easy to use that fact in the argument that the orthography should be changed to allow those letters in standard Latvian.

    We try to stay out of stuff like that, so that destiny can unfold on its own. When orthographies do change, we show up to support them.

    So I guess we can call this by design -- this is like that dumb extra letter on the old Ukrainian keyboard that the language didn't need: if we put it there, we can't really remove it....

Page 1 of 2 (24 items) 12