Blog - Title

April, 2010

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    The Albanian LIP! (why can't I think of a pun for "Albanian" anyway?)

    • 1 Comments

    Perhaps I am just getting old, but I can't think of a pun for the title on this one....

    The Albanian Language Interface Pack is now available (32-bit only, requires an English base langauge).

    You can get it right here!

    And now a bit of backgroun on Albanian:

    NUMBER OF SPEAKERS:  5.1 million native speakers

    NAME IN THE LANGUAGE ITSELF:  Shqip

    Albanian is spoken by the entire population of Albania (about 3 million), another 1.5 million in the disputed region of Kosovo, and about 600,000 in the Former Yugoslav Republic of Macedonia (FYROM). Isolated and long-established pockets of Albanian exist in Greece and the south of Italy. There are two major dialects, Gheg, spoken in the north, and Tosk, spoken in the south of the Shkumbini River. Additionally two variants of Tosk have developed in Italy and Greece, brought there by emigrants and mercenary soldiers from Albania centuries ago: Arvanitika is spoken in some rural enclaves of Greece, mostly by older people; and Arbëreshë is spoken in southern Italy. Gheg and Tosk are mutually intelligible, with certain limits. Standard Albanian is based on Tosk since 1952 (From 1908 to 1952 a form of Gheg had been used).

    Albanian literature dates back to the 16th century (1555, Gjon Buzuku’s prayer book Meshari). The development of a unified literary language happened during the national renaissance (Relindja) of the late 19th century.

    FUN FACTS:

    • The name Albanian comes from a Latin term for a tribe of the region, the Albani. In the language itself, Albanian is called shqip.
    • Albanian sometimes employs terms that help avoiding taboo words so that no bad luck is generating by naming the real thing: A wolf might be called mbyllizogojën (derived from an expression meaning May God close his mouth!), a fairy shtozovalle (from an expression meaning May God increase their round-dances!).
    • The introduction of a new alphabet (see below) was opposed heavily by the ruling Turkish government.  That dispute ultimately contributed to the declaration of independence by Albania on November 28, 1912.

    CLASSIFICATION:  Albanian is an Indo-European language, constituting a separate and independent branch of this family (It is considered the sole survivor of the Illyrian branch).

    SCRIPT:  Albanian adopted the Roman alphabet in 1908 (Congress of Manastir). Until then the Greek and the Cyrillic alphabet and the Ottoman Turkish version of the Arabic alphabet had been used to write Albanian. The letters ç and ë were added, while the w is not used. There are nine diagraphs (pairs of letters used to write one sound) for certain sounds, e. g. sh (pronounced as in English) or xh (pronounced like the j in jungle). These are treated like letters in Albanian (for example in encyclopedias).

    Enjoy!

  • Sorting it all Out

    [Unicode Announcement] Call for Participation: IUC 34, Oct 18-20, 2010

    • 0 Comments

    Thought I'd put this out there for those who are interested. :-)

    Mountain View, CA, USA – April 26, 2010 – The Unicode® Consortium today announced a call for participation in The Thirty-fourth Internationalization & Unicode® Conference (IUC 34), taking place in Santa Clara, Calif., USA; October 18-20, 2010. The conference is produced by OMG™.

    The Internationalization & Unicode Conference is the premier annual technical conference for topics on the design and global deployment of multilingual applications and web sites. Internationalization and Unicode experts, implementers, clients, teachers, students, and vendors are invited to attend this unique conference. The interactive format makes the Internationalization & Unicode Conference a great place to meet and exchange ideas with leading experts, find out about the needs of potential clients, and get information about Unicode-enabled products.

    To be considered as a presenter for the conference, please submit a brief abstract before Wednesday, May 26. Topics should be related to internationalization and localization; presentations structured as tutorials are also welcome. Suitable topics include, but are not limited
    to:

    Best Practices and New Approaches

    • New technologies, algorithms and methodologies • Using internationalization libraries and programming environments • Handling bidirectional or other complex scripts • Data formats and evolving standards, e.g. XML, JSON, HTML5, DITA, • Project management for global development teams • Localization technologies, Crowd Sourcing, Machine Translation, et al • Development, test, and deployment techniques and experiences • Improving globalization capabilities within organizations • Migrating legacy applications to global markets • Unicode, Emoji, and character encodings

    Application Areas

    • Social networks
    • Search engines, SEO, discovery and navigation best practices • Websites, Cloud Computing, SAAS, and Web services • Libraries and education • Mobile applications, including iPhone, Android, iPad, Kindle, etc.
    • Publishing and broadcasting for a global audience • Internationalized Domain Names and other identifiers • Security concerns and practices

    Language and Locale Support

    • African, Asian, Middle Eastern, and support for other languages • Unicode Common Locale Data Repository (CLDR) • Font development

    Details of the call for participation are available at:
    http://www.unicodeconference.org/iuc34call

    Interested individuals or organizations are invited to submit a brief (up to 600 word) abstract of their proposed conference presentation by Wednesday, May 26 using this web form:
    http://www.unicodeconference.org/abstracts

    The Program Committee will notify authors by Wednesday, June 9. Final presentation materials will be required from selected presenters by Tuesday August 31. The conference agenda will be available by Tuesday, June 15 at: http://www.unicodeconference.org/

    Sponsorships and exhibit space are available; for more information on sponsoring contact Ken Berk at kenberk@omg.org, +1-781-444 0404. For exhibiting questions email event_marketing@omg.org . For all other questions email: info@unicodeconference.org

     

    ###


    About the Unicode Consortium

    The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards.

    The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. Members are: Adobe Systems, Apple, DENIC eG, Google, Government of India, Government of Tamil Nadu, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, Sybase, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

    For more information, please contact the Unicode Consortium http://www.unicode.org/contacts.html. For more information, please contact the Unicode Consortium http://www.unicode.org.


    About the Event Producer

    OMG™ is the Event Producer for the Internationalization & Unicode Conferences. OMG is an open membership, not-for-profit consortium that produces and maintains computer industry specifications for interoperable enterprise applications. Our specifications include MDA®, UML®, CORBA®, MOF™, XMI® and CWM™. OMG’s specifications are all available for download by everyone without charge.

    For more information about OMG, visit us online at http://www.omg.org.

    Note to editors: Unicode Standard, Unicode and the Unicode Logo are trademarks of Unicode, Inc. Unicode Consortium is a registered trademark of Unicode, Inc. OMG and Object Management Group are trademarks of Object Management Group. All other trademarks are the property of their respective owners.

    Now just so you know.... I might see if I can submit a talk or three myself, perhaps on some of the things I am doing related to collation, or language, or Tamil, or Bidi, or localizability, or keyboards, or one of the many other things I find myself doing these days. I'll keep you posted if anything happens. :-)

  • Sorting it all Out

    New for Windows 7: The PROCESS to keep MUI from being THREADbare....

    • 1 Comments

    I fully admit that once upon a time, MUI would make me bipolar.

    But thankfully, that all ended in Windows 7. With two specific changes.

    The first change is the obvious one: the bug I mentioned in that blog that existed for Vista and Server 2008 for SetThreadPreferredUILanguages? They fixed it so that in Windows 7 and Server 2008 R2, you can set a custom locale to be the thread preferred UI language!

    Now that was a bug fix. While very cool, there is something cooler.

    It has to do with the model here

    MUI was something that would work as the UI language for an entire session or per thread. And nothing in between.

    This is great if you were following the UI language being used by Windows (what the session level setting would have) or if you had one application where you had a limited number of threads and could set this information for each one that might be loading resources.

    Obviously there is a whole sheaf of scenarios for ISVs and occasionally even IHVs where having only these two levels of support was simply terrible.

    You can probably think of many of these on your own, like applications using thread pools for a big one.

    Now this architectural limitation bleeds over into managed code as well (by the way), where the CurrentUICulture is a per-thread setting.

    How many languages can you say that sucks in? :-(

    But in Windows 7 and Server 2008 R2, a new function was added: SetProcessPreferredUILanguages!

    It works the same way as its thread-based cousin, but all newly created threads from this process will pick up this default! In fact, if you change the function names in that sample and remove the MUI_THREAD_LANGUAGES constant from the call (obviously that flag makes no sense when getting the process UI languages!), it works as is.

    Simply amazing!

    There are perhaps a few places in the docs that call this a Vista feature -- it was not. New in Windows 7!

    Now .Net of course could choose in the future to pick this up, and use it -- perhaps even extend it to the AppDomain level if that makes sense, but if not make it work in all of the applications that currently have less-than-ideal behavior for their thread-level dependencies.

    So now I can set the UI language at the appropriate place -- the PROCESS level; and I can set it to any locale that is available on the system -- including one I might have just made myself!

    I really do feel that this "therapy" that Erik Fortune and the whole MUI team provided for me in Windows 7 and Windows Server 2008 R2 have mostly nursed me back to health of all of the problems inspired by previous versions.

    By the way, I verified that .Net never picks up the process default, either in an initial thread for a process if set right away or in any later thread you create within that process (I did not check what happens with processes created from a process after calling the function, so there may be a workaround to make it work?

    This behavior in .Net seems like something they are over-optimizing ("since the UI language couldn't ever change even though it could before in native contexts and now can in a very feature-driven way that .Net really ought to pick up. Note to Josh!!!!

    As soon as .Net can pick up this work too, I will officially be cured.... :-)

  • Sorting it all Out

    "Does my buttload look too big for that stream?" (from the Tales of the "That's what she said!" files)

    • 0 Comments

    The whole thing really takes me back.

    It takes me back over four years to one of my favorite blogs (What do you get when you combine a base character with a buttload of diacritics?).

    It was just a few months back in Most combining characters in a Unicode glyph/character/whatever that Shawn was talking about a kind of related issue:

    Recently on the Unicode list someone asked basically what the biggest number of combining characters could happen in a sequence.  It's as many as someone wants to use, though the normalization UTS15 adds a limit, and the font rendering problem gets weird.

    I had soime people ask me about that blog, I guess you could say it waas those people who took me back, specifically. :-)

    Now the font rendering issue Shawn mentioned is something I already talked about, and that "buttload" blog even showed how fonts exist that can kinda handle even extreme cases, such as the aforementioned "buttload" scenario -- even as others cannot.

    Now there is no UTS (Unicode Technical Standard) 15; Shawn was actually referring to UAX (Unicode Standard Annex) 15 (Unicode Normalization Forms), specifically section 21 (Stream-Safe Text Format), whifch states:

    There are certain protocols that would benefit from using normalization, but that have implementation constraints. For example, a protocol may require buffered serialization, in which only a portion of a string may be available at a given time. Consider the extreme case of a string containing a  digit 2 followed by 10,000 umlauts followed by one dot-below, then a digit 3. As part of normalization, the  dot-below at the end must be reordered to immediately after the  digit 2, which means that 10,003 characters need to be considered before the result can be output.

    Such extremely long sequences of combining marks are not illegal, even though for all practical purposes they are not meaningful. However, the possibility of encountering such sequences forces a conformant, serializing implementation to provide large buffer capacity or to provide a special exception mechanism just for such degenerate cases. The Stream-Safe Text Format specification addresses this situation.

    D7. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD.

        * Such a string can be normalized in buffered serialization with a buffer size of 32 characters, which would require no more than 128 bytes in any Unicode Encoding Form.
        * Incorrect buffer handling can introduce subtle errors in the results. Any buffered implementation should be carefully checked against the normalization test data.
        * The value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage. While it would have been feasible to chose a smaller number, this value provides a very wide margin, yet is well within the buffer size limits of practical implementations.
        * NFKD was chosen for the definition because it produces the potentially longest sequences of non-starters from the same text.

    Okay, so for this one scenario (when the stream-safe text format is needed), the unbounded case is limited to 30 of those combining characters.

    The word "buttload" in my original blog was meant to imply a very large number without giving specific bounds though some bounds are obviously implied.

    Just the other day Gweneth was in a meeting I was in and she was amused by my use of the "indefinite adjective" use of buttload in such cases.

    This 30 character limit is obviously shorter than the example I used in What do you get when you combine a base character with a buttload of diacritics?, which would mean that "my" buttload is not stream-safe; it is simply too big for that stream. :-)

    I will try not to take it too personally.

    My butt reportedly doesn't look too big in my pants, so I think I'm okay with this one blog anomaly, that does not cross over into my social life.

    Shawn also mentioned the "user character" that was represented by the largest well-known grapheme cluster in Unicode, which is:

    U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82

    also known as:

    TIBETAN LETTER HA +
    TIBETAN SUBJOINED LETTER KA +
    TIBETAN SUBJOINED LETTER SSA +
    TIBETAN SUBJOINED LETTER MA +
    TIBETAN SUBJOINED LETTER LA +
    TIBETAN SUBJOINED LETTER FIXED-FORM WA +
    TIBETAN SUBJOINED LETTER FIXED-FORM RA +
    TIBETAN SUBJOINED LETTER FIXED-FORM YA + 
    TIBETAN SIGN NYI ZLA NAA DA

    also known as:

    HAKṢHMALAWARAYAṀ

    also known as:

    ཧྐྵྨླྺྼྻྂ

    which is kind of a useless ink smudge for me, perhaps you can see it better.

    Maybe we can turn up the font size a scosh:

    ཧྐྵྨླྺྼྻྂ

     

     

     

    Better? Looks great here!

    It really is a beautiful script. And that bit of text is certified stream safe!

    Now the interesting bit about this one (you'll only see if you have a font like Microsoft Himalaya) can be noted if you put it alongside some text:

     

    ABCDEཧྐྵྨླྺྼྻྂedcbaཧྐྵྨླྺྼྻྂabcdeཧྐྵྨླྺྼྻྂEDCBA

     

     

    (Look here if you can't see it in your browser but want some idea of what the hell I'm talking about!)

    Clearly many of these subjoined beasties are underneath the main bit of the stack, well below the baseline -- enough that this particular well-known grapheme cluster isn't even using the space that a full uppercase letter could use.

    It seems to me like there oughtta be things that could be done to make Tibetan more usable on Windows, but the full scope of what would be required momentarily escapes me. It would probably take an effort akin to the one I described in Want to hear about a cool new typographic convention? Khmer, and I'll tell you about it... for Khmer, which would really require forces outside of us to see the work done. Forces inside Tibet, for instance....

    Note that the current implementation of collation in Windows does not allow a compression (i.e. a UCA contraction) of more than eight UTF-16 code units, which means that the collation of the HAKṢHMALAWARAYAṀ is probably not going to be exactly right.

    So that one character's butt is a bit large for the jeans one might try to fit it in, on Windows!

  • Sorting it all Out

    Were it not for Emoji, this blog would look very different (and would have been unnecessary)

    • 7 Comments

    At first I was U+1f600, content to realize that Unicode would never encode the Emoji, especially after so openly dismissing competing attempts at standards that included emoticons "needed by regular people in common chat-type scenarios."

    After that I was U+1f634, sleeping off the celebratory drinks afterward.

    Suddenly I was awake and U+1f62e, assuming I must have heard wrong.

    I decided to be U+1f62f, and wait for people to repeat themselves.

    Then I was U+1f615, when people were clearly continuing to talk about the Emoji.

    As it continued I was U+1f61f; could it really be happening?

    As the Emoji progressed, I was U+1f62c; what else could my reaction being during such turmoil?

    As things got further along, I was an inconsolable U+1f611, as I found that food didn't even taste like food anymore.

    And then the email from Peter came, telling me that:

    The proposal to encode Emoji characters in Unicode / ISO 10646 progressed to the last stage in the standardization process last week. At this point, the character repertoire and code positions are fixed. Final documents are still being prepared, but you can the new characters (along with other new characters being processed together) in this doc:

    http://www.dkuug.dk/JTC1/SC2/WG2/docs/n3838.pdf

    Changes to the Emoji repertoire at last week’s meeting included
    - Various changes to character names and representative glyphs
    - Addition of one new emoticon
    - Re-organization of the Emoticons block (changed code positions)
    - Some changes in the EmojiSrc.txt mapping data

    I couldn't even remember the name of the Bytext guy, but I am sure if he knew of all this, he would be quite U+1f61b about all of this, saying "Tonight I will go home and U+1f619 my wife, think about question #3 in the Bytext FAQ, and yell out the window at the top of my lungs: I TOLD YOU SO!!!."

    Thankfully, Peter mentioned 13 additional emoticons that were accepted for some later version of the standard:

    1F600    GRINNING FACE
    1F611    EXPRESSIONLESS FACE
    1F615    CONFUSED FACE
    1F617    KISSING FACE
    1F619    KISSING FACE WITH SMILING EYES
    1F61B    FACE WITH STUCK-OUT TONGUE
    1F61F    WORRIED FACE
    1F626    FROWNING FACE WITH OPEN MOUTH
    1F627    ANGUISHED FACE
    1F62C    GRIMACING FACE
    1F62E    FACE WITH OPEN MOUTH
    1F62F    HUSHED FACE
    1F634    SLEEPING FACE
    with their glyphs as shown in document N3834.

    If he had not mentioned these to-be-encoded characters, this blog would not have had nearly as much emotional structure.

    Let's face it, I am not that organized!

    Come to think of that, I could have used these ones like a month and a half ago!

  • Sorting it all Out

    Look out Maharashtra, the Marathi LIP is now available!

    • 3 Comments

    Yes, that is right, available for the English version of Windows 7 is the Language Interface Pack for Marathi! :-)

    You can download it right here (32-bit only).

    And for a bit about the language (not done previously for the Vista version):

    NUMBER OF SPEAKERS: 70 million native speakers, plus about 20 million second language speakers

    NAME IN THE LANGUAGE ISELF: मराठी

    Marathi is mainly spoken in the Indian state of Maharashtra, where it is official language, and to a lesser extent in the neighboring states. In fact, Maharashtra was formed in 1960 when the former Bombay state was split up into the linguistic areas of Marathi and Gujarati. Marathi is one of the official languages of India. Also known as Maharashtri, Maharathi, Malhatee or Marthi, the language derives its grammar and syntax from Sanskrit and is therefore one of the Indo-Aryan languages. With its about 90 million speakers it is comparable in rank with languages like Korean or Vietnamese.

    CLASSIFICATION: Marathi is the southernmost of the Indo-Aryan languages (which include Hindi, Bengali, Gujarati, Punjabi) which in turn belong to the big Indo-European language family. Its closest relative is Konkani.

    SCRIPT: Marathi is written in Devanagari (like Hindi, for example).

    You can look to Wikipedia for more info on Marathi, right here. In fact, if it makes sense someone might even link to the LIP from there? :-)

    Some other factoids, Michael Kaplan style, right here:

    • Marathi is one of the distinguished languages on the list of Windows locales whose ISO-639 code (mr) does not match its Windows 3-Letter Language Code minus the third letter (MAR --> MA), as described in LOCALE_SABBREVLANGNAME is so not an ISO-639 code. It is for this reason that the Language Bar abbreviation is MA and not MR:



    • As early as 2002, the information in the following slide was included in IUC presentations done by Cathy or me or both of us:



      But even though we both cited this difference, it was not captured in the Windows collation tables until Vista (in all prior versions the Indic tables were always combined and only Hindi amnong the Devanagari script languages was given its unique sort. As far as I know, Konkani still not captured separately, but this example we had been citing for years seemed important to do, at least. :-)

    • In response to Learning to spell in Bengali (when one has a cool input method), Suraj commented:

         Google also provides a similar tool (http://www.google.com/ime/transliteration/) which I found to be better for Marathi input. Perhaps you should have a look?

      But I lack the language knowledge to compare the two input methods for Marathi - I will leave that for others to judge....

    Enjoy!

  • Sorting it all Out

    Arising from one's own ashes. Like [up to 80% of ]a phoenix....

    • 8 Comments

    Wow, two off-topic blogs in a row. It must be one of those weekends. If =multiple sclerosis, iBots, drugs, and/or me hold no particular interest in your life, you can skip this one....

    I thought I had moved past the $1000/month phase of Multiple Sclerosis.

    Having rolled into a secondary progressive Dx (diagnosis), my Tx (treatment) options were much more limited.

    Off my plate of opportunity are the Tysabri and CRAB (Copaxone/Rebif/Avonex/Betaseron) type drugs that are such the mainstay of the "supported" world of MS treatment for those who (like me) avoid the various snake oil solutions (I carefully define "snake oil solutions" to mean those solutions that do not have multiple double blinded studies behind them, as given the nature of the disease new treatments are guilt until proven innocent in that regard!).

    But then comes a new drug.

    Ampyra.

    It is pronounced am-peer-ah (as in "Have a Beer, Ya?"), not am-pyre-ah (as in "dude, is that Elvira?").

    Gratuitous Elvira picture

    Sorry Cassandra, maybe the next drug will have a better sound-alike name for the Mistress of the Dark!

    Ampyra is also known as 4-Aminopyridine or 4-AP or dalfampridine or H2NC5H4N.

    It looks like this if you should happen to care about such things:

    As the Wikipedia article indicates:

    Fampridine has been shown to improve visual function and motor skills and relieve fatigue in patients with Multiple Sclerosis (MS). 4-AP is most effective in patients with the chronic progressive form of MS, in patients who are temperature sensitive, and in patients who have had MS for longer than three years.

    All of those criteria are fairly applicable to me at this point. And further that:

    MS patients treated with 4-AP exhibited a response rate of 29.5% to 80%. A long-term study (32 months) indicated that 80-90% of patients who initially responded to 4-AP exhibited long-term benefits. Although improving symptoms, 4-AP does not inhibit progression of MS.

    And interestingly:

    Spinal cord injury patients have also seen improvement with 4-AP therapy. These improvements include sensory, motor and pulmonary function, with a decrease in spasticity and pain.

    These are some very impressive facts and statistics.

    Speaking as someone who hasn't played the soprano saxophone in 15 years, who hasn't been to Whistler since before it was renamed to Whistler Blackcomb, who hasn't been running in 12 years, who hasn't been able to walk across the street to work in ast least 8 years, who hasn'tr been dancing without looking like a dork in 10 years, who hasn't been dancing like someone who actually knows the steps in 15 years, even thinking about this drug's possibilities is, to be frank, more addictive than heroin.

    I don't want to diss the iBot since it gave me a life again, something that had been eluding me for some time during my slouch toward Bethlehem that was the bulk of my 30's.

    But as I found myself comp'd for parties that I would never have been invited to with models and centerfolds in attendance, as I was given all access passes to shows I would gladly have paid for, as I hung out at the Playboy mansion on Halloween with the best friend I have ever had dressed in nothing but gold paint and ran into Mr. Belding, as I fielded questions from models and adult film stars about whether it would be possible to do "stuff" in that chair without blushing, I realized that it wasn't by any means the life I had before.

    It wasn't even a life I ever thought was possible.

    I won't call it better or worse; it certainly had its moments. I had fun, and I only lost my Facebook account once because of the photographic evidence thereof.

    But it wasn't my life.

    I had its full measure and it was becoming a little boring for me by the end of last November. Maybe even a little before that, mind you. But that is about when I found myslf less likely, all things being equal, to do something wild as to not do something wild.

    And I learned, by looking at it all through someone else's eyes by the middle of December that when used in more conventional situations the iBot had its ups and downs:

    • the ups are the eye level thing and the ability to handle stairs and all kinds of terrain nad the way it felt to kiss a girl while in it, and
    • the downs are the inability to out in public for more than 15 minutes without at least one and occasionally as many as ten people accosting me to say "I'm sorry, but I have to tell you how cool your chair is".

    The lesson I learned was to no longer find it so interesting to talk about the iBot when people expressed interest in it. I am distractable enough that this is something that might never had occurred to me (despite it being true) had I not been spending time with someone who I could tell had lost patience with it, in a way akin to how I felt about someone I spent time with years ago who would be stopped for autographs (I much preferred to not have that aspect of life that circumstances enforced intrude on our time).

    Now don't get me wrong, if my choice is between a regular wheelchair (or even a scooter) and the iBot, I'll take the iBot any day.

    But if the choice is between the iBot and walking (or playing the sax, or running 5 miles, or dancing a tango, or ice skating with friends, or roller skating one Sunday, or skiing one Saturday) in the inconspicuous way that everyone else does it, I'll park the iBot somewhere without blinking. I don't need to be a show (up on 2 wheels) or a spectacle (climbing stairs).

    Not park it too far away, in case I have just a Sabbath day's walk in me, even with the drug. But even so....

    Ampyra's potential becomes quite astounding now.

    I have the Rx (prescription). With my Dx (diagnosis) and my Sx (symptoms) , it is a reasonable symptomatic Tx (treatment) to try.

    By report, my insurance company covers it 100% with no co-pay, the Cadillac-esque thing that I guess the plan may in fact be.

    No pharmacy around here stocks the drug, apparently (well, none of the eight I tried) -- assuming 10mg BID (ten milligrams twice a day) -- the $1000/month price tagged drug. Talking to the pharmacy techs on a weekend, they don't even know if they can order it from their suppliers, who don't have it on the list.

    But it is only FDA-approved since January, "available" since March. So it makes sense if they have not been asked yet, or put it on their lists yet.

    I'll need to try them on Monday when they can talk to their suppliers -- just a month to start. It might be faster to get one month from the mail order pharmacy, though even that too will have to wait until Monday (they are not around on weekends either).

    About rethinking limitations and what I may or may not like in who I am now versus in the past, my friend Cathy suggested to me via Twitter:

    ...where possible, dust off that part of your personality and try, try again. Ignore stuff you can't change. QED...

    But now some things might change. Which perhaps changes thew landscape of these things.

    Last week I was talking to my neurologist, who is really my healer, confidante, and therapist, at this point. She suggested I needed to change my residual image from that guy who can run/walk/dance/skate/ski/play since not only is it no longer me, but there are fewer and fewer people who have ever (and thus can ever) see me that way who I deal with even semi-regularly. Though she admitted changing the image based on whatever level the medication puts me at wouldn't be the most unreasonable of ideas.

    She also suggested I not become so hopeful that I forget that it doesn't help everyone. I explained that I understand that.

    But the scientifically proven in multiple double blind studies chance to potentially be up to 80% of what I was?

    Holy crap!

    Who are we kidding? I'd be paying cash out of my pocket for that if insurance wasn't covering it.

    Now I just have to find a pharmacy that can get it.

    And hope.

    For the chance to rise from my own ashes in such a phoenix-like way....

  • Sorting it all Out

    Attn: Google - Amount due: USD$307.50 (FOURTH AND FINAL NOTICE)

    • 6 Comments

    This blog is from me, Michael S. Kaplan, private person, and not in connection with Microsoft in any way.

    You may want to read the previous sentence a few times before continuing. If the idea does not sit well with you, then you should leave. Now.

    It happened back in July.

    Not last July.

    Like several years ago.

    I'll start over.

    It happened back in July of 2006.

    I interviewed over at Google in Mountain View, CA.

    It was flattering to be asked to do so, and they flew me down and I had a hotel room and a rental car.

    I did sign an NDA so I am not going to talk about the interviews themselves at all. Many people seem to do that but I did sign something so I'll just not do that, and I'll talk about other stuff.

    I liked the food, and the food price (free!). The doors on their buildings were easier to open for me (in a scooter) than a lot of the Microsoft doors.

    That was nice.

    But at the end of the day, they didn't think I was a good fit for the stuff that kind of interested me, and I had no interest in the stuff they were suggesting I might be a better fit for.

    Ain't that the way these things often work? :-)

    All very amicable and nice, some very professional people throughout the day.

    And the good food. I think I mentioned that.

    Until....

    Well, how do I say it?

    They are a big huge company making billions of dollars, after all. It will sound kind of silly.

    I'll just say it.

    They didn't reimburse my expenses.

    $159.50 for the hotel room (I stayed at the Hotel Avante, an old favorite from Unicode Technical Committee meetings, and a source of a few good memories).

    $80 for the rental car. It was Budget.

    $38 for the charge to park my car at the SeaTac airport garage. I honestly only included that receipt based on something their reimbursement policy said that made me think it might be covered - that they would have reimbursed taxi cabs to and from the airport (this cost them a lot less so it seemed it was nice for me to do that!).

    I spent a little over the $30 for dinner because I like food, so we will call it $30 since that is the maximum they said they would reimburse per diem for the meal.

    They had taken care of the plane ticket already.

    So that comes to like $307.50.

    I sent the receipts as they asked me to, and got back to my actual job....

    Some time passed with no check, so I finally sent some email asking about it.

    They said they never received anything.

    The reimbursement rules I was looking at said they needed copies of all the receipts within 15 days and now it was almost two months!

    Crap.

    No worries, they said just send the receipts again and they would take care of it.

    So I did.

    And so they didn't.

    A few more months went by.

    I felt silly bringing it up again, to be honest. So I didn't.

    But I then ran into one of the recruiters I met that day, at some conference. He asked how it all went and I told him the story. He wasn't working for Google anymore but he was surprised. He encouraged me to try again, because they didn't just seem like good people; they were good people.

    So try again I did.

    Everyone was very apologetic although there was (apparently?) no record of the prior conversation. If there were one I'm sure they would have found it, I mean they do have good search stuff at Google. Everyone knows that.

    But they encouraged me to send copies of the receipts once more, and they would take care of the reimbursement.

    So I did.

    And so they didn't.

    I'll admit at that point I just kind of gave up.

    Then, just a couple of years ago, a recruiter who wasn't from Google but who was doing some recruiting for Google contacted me to see if I was interested in some specific career opportunities with Google that they thought I might be good for. I told him quite honestly that the answer was no (pointing out the "hell" is silent there but it could be implied in the tone), until Google reimbursed me for the last time and agreed to pay for everything this time upfront -- and I told him the story.

    The burned child fears the fire.

    He was surprised, too. But he said he would talk to his contacts and have someone look into this and follow up with me.

    I never heard from anyone.

    I figured my "requirements" amounting to a primadonna rider of a thing, got some note in some file that meant I would never hear from them again for job offers. But we were at a stalemate anyway in this weird chess game so I figured I could live with that. I still had an actual job, after all.

    Maybe they were mad I didn't have a GMail account or something.

    Now, fast forward to the present, not quite four years later, I hear from yet another recruiter, a contract recruiter working for Google who was pointed to me by someone working at Google (I do not know who, but they priobably didn't know this story!), asking if I would "ever consider opportunities with Google".

    This just happened. Like yesterday.

    Maybe the recruiter didn't see the postulated "cranky pants fetish" note in my file about this reimbursement situation? :-)

    Taking a step back for a moment: in the end, in the last ten years, over and above the trips other people paid for for various jobs and shows and such, between two different girls I dated who lived in Los Angeles, I have paid for dozens of trips to California. And since those two girls are all ex-girlfriends now with neither relationship "succeeding" in the larger sense, I guess you could say the costs from those trips on me were ultimately not fully "reimbursed" either.

    So why should Google be any different?

    I mean Google and me? That was just a very brief thing, actually little more than a summer "fling" that really only lasted a few days and was never "consummated".

    I will stop this metaphor now before it gets downright inappropriate!

    The whole situation technically does not violate the idea of Google "doing no evil" since looking at it from the outside it is almost certainly a bunch of bureaucratic snafus and accidents that just leave me out my $307.50 plus the price of the two extra first class stamps, etc.

    Ordinarily I'd add the cost of making the two extra copies (I felt uncomfortable going to Microsoft's copy machines to copy the receipts to get Google to reimburse me so I actually went to Kinko's (as it was called back then!) all three times. But I did not keep receipts for that so I won't ask for that money. They never said they would cover that anyway.

    But am I really asking for any money here, at this point?

    I mean, the notion of making copies again and sending them to Google again seems ridiculous to me.

    I suppose I could put copies of my receipts here:

             

    (clicking on the receipts gives you bigger versions of them)

    I did not get receipts for either the copying or the mailing, and although interest and penalties for the money that Google has had use of for nearly four years seem appropriate, I am not going to bother with it.

    Given the power of Google in regard to searching for information on the web, I am reasonably sure that the receipts will not be lost/misplaced this time.

    I doubt anything will happen, but if anyone from Google wants to forward this on to the appropriate people in charge of reimbursement then they should feel free to do so. All of the people I have talked to have been very courteous and professional to me, which makes the whole thing seem out of character. Bizarrely so, in fact.

    If someone wants to cut a check, my address should be on file (use the contact link here if it is not, and I'll give it to you). But there will be no further collection action, as the amount has essentially been written off as bad debt. No further action will be taken after this Fourth and Final Notice.

    I'll am going to go answer that recruiter now, and send the URL for this blog. I just don't have the energy to tell the story to someone from Google on the phone again and be told how strange it is and how they would take care of it, etc....

  • Sorting it all Out

    From Seattle (USA) to Coimbatore (India) in June? You betcha!

    • 7 Comments

    Regular readers1 may remember my On not being in Germany in October blog from last September, when I explained that I could not attend the Tamil Internet conference being held in Köln, Germany.

    It would have been great to attend.

    I have enjoyed the previous Tamil Internet conferences I have attended, the people I met, the feedback I got from people who attended my talks, and the connections I have made (many of which are still around to this day2).

    But I have continued during all this time to enjoy the language and the script (and the drama!) of Tamil, as well as the many technical issues that arise around it.

    Thus, while it is still early yet3, I am pleased to say on this blog you are reading right now that I plan to speak at the Tamil Internet 2010 Conference in கோயம்புத்தூர் (Coimbatore), also sometimes known as கோவை (Kovai), from June 23rd to June 27th of this year....

    When they asked what I would be willing to talk about I suggested four different topics, and they agreed with all of them (plus they suggested one or two more). They covered a wide range from Unicode to TACE16 to Indic text in Microsoft products to linguistic issues to text input to history. And we'll negoitate some of that part a bit (there are aomd poresentations I want to attend, making the continuous Michael show impractical!), and I'll take some extra days in India, while I'm going that far anyway, oif course.

    Now in rememberance of prior trips -- to Malaysia, India, Thailand, Singapore, and other countries nearby -- at various times during the year, I am preparing myself for the greater heat at time of year I will be there, than the last time I was in India.

    And having been to Chennai and Bangalore with a scooter (something I blogged about previously) -- thinking about both the benefits and drawbacks of the scooter -- I believe am prepared mentally to spend time in Southern India with an iBot -- in fact I am looking forward to spending time there again, this time with a device that lets me get around much more normally than the scooter did, with all those tall curbs without sidewalk cutouts! :-)

    Now in blogging this, I know I risk there being some unforeseen circumstance that would keep me from going, but with the visa in progress (no red flags raised yet) and the plane tickets on the way (dittro) I am quite hopeful that this will be happening. And assuming it will, and once the final list of presentations is available I'll put that list up here with information on each.

    Along with the blogs while I am there, about what's going on. A future iBot in India series -- have iBots been to India before?

    If you are in that part of the world then be sure to pop by to the conference in Kovai, introduce yourself, and mention that you are a reader here!

    Other places that might be visited while I am there are still being considered....

     

    1 - Also irregular readers with somewhat consistent reading habits
    2 - This is the principal reason I ultimately bowed out of the idea of doing confidential work for the US government since the SF-86 is much more complicated to fill out for Top Secret clearance if you have as much foreign travel as I have had and as much regular contact with foreign nationals, not to mention how much travel I have simply forgotten....
    3 - and there is many a slip twixt a cup and a lip, aa they say.

  • Sorting it all Out

    AppLocale cannot stand mute

    • 10 Comments

    So the customer question was straightforward enough:

    A user is running the English version of Windows XP with the system locale set to English-US (Windows codepage 1252). This user wants to run a popular Japanese application that is code-page based. In order to run this app flawlessly in Windows XP, the user needs to set the system locale to Japanese (Windows codepage 932) and reboot the machine. Two restrictions: the user might not be an administrator to force this setting change; and/or the user might not want to force a reboot.”

    You probably know the answer, right?

    AppLocale!

    Now the customer was generally pleased.

    Though there was a bit of feedback:

    Is there any way to hide this message?  I think this is going to work for us, it would be great if we could get rid of this message.  /q doesn’t work either.

             

    Uh oh.

    The answer to this question is that there simply isn't a way.

    AppLocale is a temporary solution for non-Unicode applications.

    It's goal was not to make this temporary solution into a permanent one -- it was not to make it easy to remove the reminder that it is a temporary solution....

    In the majority of cases if someone has no plans to tell the application writer to convert it to Unicode or to find some other application that is Unicode already, then one can live with the nag screen which is really pretty minor, as far as pains go.

    Just a symbol, a reminder nag screen to remind one of all the problems with non using Unicode!

  • Sorting it all Out

    If no one supported the OLD Old proposal, jumping in to support the NEW Old proposal may not make sense…

    • 1 Comments

    It is day two of the Text Summit, and I am doing a talk there, with the usual style or lack thereof that I bring to such things. Today's blog has nothing whatsoever to do with what I am presenting about....

    Just yesterday I talked about how for Korean You can't get this particular bit of proverbial toothpaste back into the tube.

    Just before that, I was asked a question by someone else, via the Contact link, about how to see support for the new jamo in Unicode.

    The ones I was talking about "in that blog post."

    By which he meant this blog post.

    Back in A&P of Sort Keys, part 14: The Hangul is really getting OLD I pointed out a problem-to-be that was waiting in the wings for Microsoft:

    And now we get to a slightly less contrived case, namely the various doubled and tripled conjoining Jamo both in Unicode now and the new ones being added (now in Stage 6 of the approval process) that I discuss in Using a character proposal for a 'repertoire fence' extension. They will be in these three subranges in an upcoming version:

    • 29 in Old Hangul initial consonants (in Hangul Jamo Extended-A block: A960..A97F)
    • 23 in Old Hangul medial vowels (in Hangul Jamo Extended-B block: D7B0..D7FF)
    • 49 in Old Hangul final consonants (also in Hangul Jamo Extended-B block: D7B0..D7FF)

    So these ones that were constructed now are meant to exist on their own, and you can even see the leading and trailing consonants in the proposal:

    HX124 한글 초성 비읍-시옷-티읕
    HANGUL CHOSEONG PIEUP-SIOS-THIEUTH

    HX335 한글 종성 비읍-시옷-디귿
    HANGUL JONGSEONG PIEUP-SIOS-TIKEUT

    And in the not-yet-official data for Unicode as:

    A972;HANGUL CHOSEONG PIEUP-SIOS-THIEUTH;Lo;0;L;;;;;N;;;;;
    D7E7;HANGUL JONGSEONG PIEUP-SIOS-TIKEUT;Lo;0;L;;;;;N;;;;;

    though the vowel (which would be HANGUL JUNGSEONG YO-A-I) you cannot find there, which would suggest that OpenType has at least one Jamo vowel sequence defined for Old Hangul that neither the existing Unicode standard nor any proposal from Korea lists!

    Oops?

    I wonder whose bug that is?

    Anyway, back to the point -- would it be easy to add an entry to the table for the new characters that will be added to Unicode based on the proposal any time, assuming that the Jamo exists. Thus all of these new characters can be kept backwards compatible with the old sequence, though it is likely that the order might not be the same between what the Koran proposal suggested vs. what is there now, which means Microsoft gets to decide what order it wants to be compatible with at whatever point these characters are added....

    Either with itself or with whatever order the standard suggests.

    There ia also a problem for either the OpenType info on Old Hangul or proof that the exhaustive search fo all known Old Hangul combinations is missing one that was on the OpenTtype list all along. But that is an issue for another day. Or to ignore until it all comes up again.

    But there is the main issue.

    Since Windows 7 officially picked up only Unicode 5.1 data, the bullet has been dodged for now. the (if you will pardon the unfortunate expression) landmine can be left alone for the time being.

    Support for these characters isn't in there.

    So the final decision on which way Microsoft will go is unknown.

    There are four options:

    1. Do nothing;
    2. Make the new Jamo equivalent to the existing pieces of the old one;
    3. Follow the recommendation in the Korean standard that does not make this equivalence;
    4. Do something else entirely.

    Now #1 was kind of the solution for Windows 7 obviously, but conceivably it could be the situation for future versions too.

    After all it isn't like anyone was supposedly relying on the OLD Old Hangul implementation so why jump in to support the NEW Old Hangul implementation before there is indication that it is going to be needed/wanted?

    The answer would not satisfy me personally (for whatever that is worth), but there are more formal metrics that people would use in making product feature decisions. :-)

    #2 is the solution I would prefer for a whole host of reasons, up to and including the fact that any existing data (that looks the same if you have the fonts for it) will work the same.

    But if work is done then #3 or #4 may have to be what is done (it looks like the current state machine Microsoft uses for Old Hangul can't fit all of the extra Jamo so in order to meet the requirements of the new standard a new solution will have to be written anyway, or at least a re-jiggering of the old solution if that is possible -- when the original dev is the manager of the manager of the people who would do it and the program manager is now in a different division and the yours truly is in whatever place I am).

    All of this is trivial compared to the font side of all this, where the solution is easier but the decision on what to do with the OLD support the OpenType definitions for Old Hangul and the many "no longer blessed combinations" will lead to either widespread duplicate appearances of characters that are not equivalent (great for the to-be-registered www.oldhandulspoofing.com web site), or widespread backcompat breaks as all of the formerly working OLD Old Hangul is made to look wrong and not conjoin (a problem somewhat mitigated by the fact that it is so hard to find fonts that use the support!).

    I suppose they can also do nothing too -- the burned child fears the fire is that old Irish saying, isn't it? They arguably did the most work on the OLD solution, and it was openly ignored for many years.

    To be honest I don't envy either group, as they both have a fairly nasty mess to deal with, no matter what they do. And I never even got into the input method side of all this -- who is gonna create that little (or maybe big) beast? Maybe no one? It will make the problem much easier if everyone holds their hands over their ears and says LA-LA-LA-LA-LA very loud.

    And of course everyone should be kept in sync -- the nightmare of good display mixed with strings with no weight (and, to a lesser extent, the vice versa) is truly scary....

    The more I think about the more likely I would be to sit in the "do nothing" camp for both (all three?) teams, because if one does have plans to step in a bunch of crap, it really does benefit everyone to see where things are going before one takes those overt steps.

    Plus misery loves company -- I think everyone can step together!

    But that is just me. And recent circumstances in my life had made me less playfully eager to try new stuff, and thus have probably made me a bit more cautious (a lot less likely to want to either bite the bullet or step on that thing that may be a landmine).

    Ultimately someone else (or someones else) will be deciding, maybe or maybe not based on these arguments.

    Though obviously they have some time. A bunch of it.

    What would you (the reader who read the whole blog all the way to the bottom!) do if it was your decision? Taking all of the issues I mention into account and any others you could think of, what would you do?

  • Sorting it all Out

    You can't get this particular bit of proverbial toothpaste back into the tube

    • 0 Comments

    Before I forget, I'll wish everyone a happy 420 on this day, 4/20. I'm just saying....

    It is often quite ironic that the main point of an inquiry is buried at the end of the inquiry.

    Sometimes this is done to take emphasis away from the main point as a means to avoid showing bias.

    I'll give a likely example of this morning of the first day of the Text Summit (for you MS internal folks!), talking about something vaguely relevant to the Summit itself but completely irrelevant to pretty much everything being discussed at the Summit itself, including my own talk being given there. :-)

    So anyhow, a recent question to the Contact link left me a little nonplussed:

    I have looked extensively on your blog but did not receive an answer to this question, so I will just ask directly.

    If Microsoft claims to support Unicode, how can it not put the equivalence between Unicode Normalization Forms C and D into collation for Korean?

    At first I was not sure how to respond.

    I mean, I feel like I had answered this question before in blogs like Theory vs. practice for Korean text collation and Theory vs. practice for Korean text collation, redux

    The conformance to Unicode issue vis-a-vis the UTS 10: Unicode Collation Algorithm is really not an issue at all; as the Unicode Standard itself says in the UCA document which spells out the meaning of a UTS (Unicode Technical Standard):

    A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

    Instead of UTS 10, Microsoft has its own independent support of collation, a support that by and large supports much of the intent of UTS 10. it even predates UTS 10; when one considers the number of times that Microsoft weighed in with thoughts/opinions on UTS 10 starting from when it was DUTR 10, images of Microsoft's feature telling the young DUTR "I'm am your father" are only squelched to avoid "evil empire" jokes.

    Now Microsoft supports Unicode, in many ways. Via usage in its products, via hosting their main offices in one of our own campuses in Mountain View CA, via its full membership, via board membership at present and many times in the past.

    More to the point at hand, Microsoft supports it by supporting Normalization as defined in UAX 15: Unicode Normalization Forms, which as in the previous case spells out the meaning of a UAX (Unicode Standard Annex):

    A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

    Normalization is something that Microsoft has largely supported in a de facto manner in both fonts and collation before it was formally defined as a standard, with the bulk of the exceptions rightfully considered either a) bugs to fix, or b) explicit design decisions not to.

    For the Normalization conversion itself, it is completely supported in Microsoft products and has been for many years.

    So, to review:

    • Microsoft supports UAX 15;
    • Microsoft does not support UTS 10;
    • Microsoft's own independent implementation of collation is
      • intended to support the same requirements as UTS 10,
      • was largely created before either standard existed,
      • and in de facto manner happens to support most of both UAX 15 and UTS 10 in its operations.

    Now obviously this talk of "largely supports" and "supports most of" has an obvious implicit statement that there are times the support isn't there.

    And the biggest "exception" to the idea of generally supporting them both is, ironically in the Alanis or maybe Britney sense, the most likely central point of that original question I was asked, even though it was the word at the end:

    Korean.

    Now Korean has an interesting place in languages, and in Unicode.

    A "perfect alphabet" developed in the 1500's by a king who wanted it to be easy for everyone to read and write, something opposed by the powers that could (as I discussed slightly in some of the introductory material to Traditional versus modern sorts), one could argue that its encoding in Unicode and ISO 10646 is anything but perfect.

    It is not only technically encoded 4 times in Unicode (as I mentioned in One more thing about Korean....).

    But because of the very natural and rational direct connection between the composed Hangul and decomposed Jamo, an operation that has a sound linguistic basis that no unbiased linguist would challenge, Unicode and ISO 10646 has borne the displeasure of the government for so long that the war of attrition finally succeeded in getting some Jamo added that were in theory already encoded (ref: Using a character proposal for a 'repertoire fence' extension).

    Though there were no widely used implementations available in practice, due to the government itself actively seeking to discourage those implementations that existed.

    I have referred to the chutzpah of killing one's parents and begging the court for leniency on the grounds of being an orphan in the past. :-)

    In that context, one can look at this exception to the degree to which Microsoft supports UAX 15 in collation as largely an effort in support of the government's desire to not so widely support an equivalence between the two forms of Korean characters, for modern Hangul.

    A good way to placate, if nothing else.

    From technical and "almost a linguist, minus the education" perspectives I may find the solution unsatisfying, though the workaround is easy enough: convert the string to a particular Normalization form and then compare them. This allows both the people who know they are the same to be happy to see their knowledge confirmed while still allowing those who need to differentiate them to feel they are treated differently.

    To be honest, Unicode does what it does via a long process that started as a sensible defining of canonical decompositions that finally became Normalization, all in a way that made conformance guarantees to Unicode and compliance requirements to other standards that use the Normalization definitions.

    I suspect if they knew everything we know now, even those initial canonical decompositions would likely have been defined as something else (not compatibility decompositions but some other, third type), and would have saved over a decade of headaches, both political and the other kind.

    But it is too late now, as you can't get this particular bit of proverbial toothpaste back into the tube.

    In the current situation, there is nothing you can't get if you use the support methods and functions that Microsoft provides. Even if you have to work a little harder to make some of it happen....

    And Microsoft is conformant to Unicode here, completely. It "fails" the test of being 100% conformant to the goals of UTS 10 in an area that Unicode itself would likely skip if it could. Which is a wonderful advantage Microsoft (and every other company) has over Unicode in this case, in my opinion!

    The moral of the story -- be careful what you promise. When there are people recording what is said, at least!

  • Sorting it all Out

    The sad sad tale of the BARREE YEH

    • 13 Comments

    Warning: it will take me some time to get where I am trying to go here. If you lack patience you may want to skip it today....

    Before you read this blog, you may want to look at a couple other blogs I have written:

    That second blog in particular is important, where it describes some interesting issues of representation in Pashto.

    I have been promising my old colleague Irfan (I'm not saying he is old, I am referring to how we go a long way back) since the end of 2007 that I would talk about an issue related to Urdu, and I am finally getting to it now.

    His original mail:

    As you probably know that unlike Arabic, Urdu uses yeh barree (ے).  But what is not known by many non-Urdu users is that yeh barree can be used in the middle of a word and when it is used, it looks like the Arabic yeh, like this

               سیب 

    Currently, our shaping engine doesn’t do this and will show this word as

               سےب

    Would it possible to change this in the next version of Uniscribe?  Also, do know what other dependencies there are for it if this issue is decided to be fixed in the future?

    Ah, gotta love when script rules block language rules. Not!

    Peter contributed the Unicode side of the puzzle here, explaining the nature of the blockage to a "fix" here:

    In terms of Unicode, we cannot cause 06D2 to have dual-joining behaviour: that is a normative property of the character, and if we changed the behaviour in Uniscribe we might is so doing break user documents. Jonathan describes a workaround which he believes is what users, in fact already do: type 06CC “choti yeh” for initial and medial positions for both /e/ and /i/ vowels, and use 06D2 in final position when the barree yeh form is required.

    It would be possible to propose a new dual-joining barree yah be added to Unicode. But whether that would be useful or not depends on multiple factors, including how existing data is encoded to deal with this, and how users are likely to enter data. For instance, suppose that new character were encoded at 06NN: if users continue to enter 06CC in initial and medial positions for both /e/ and /i/, so that the only change is to use 06NN in final position rather than 06D2, then nothing has been gained, and there may be some detrimental effects because old data is mismatched against new data and implementations. (E.g. users would enter search strings with 06NN and may not find old data with 06D2.)

    And this does make the whole idea of adding a new character problematic.

    It is hard to argue with usage, trying to cover what people are doing today and have been doing for years.

    In relation to Kashmiri and Urdu (which both have this issue), Kamal and Jonathan were having a conversation themselves:

    {Kamal}: According to Daniels & Bright (see Table 62.5, The Kashmiri Alphabet, p. 753), both Letter Barree Yeh with half-ring above & Letter Barree Yeh can occur in initial, medial, and final positions while 06D2 is classified as right-linking (i.e. having only final and separate shapes).

    {Jonathan:}I suspect the situation with U+06D2 BARREE YEH in Kashmiri would be the same as in Urdu. This character represents an 'e' vowel, while U+06CC FARSI YEH (known as CHOTI YEH in Urdu) represents an 'i' vowel. However, in initial or medial positions, the same YEH is used for either vowel; in the (rare) case that the writer wishes to make the distinction clear, KASRA is added before YEH.

    {Jonathan:}Urdu typists, therefore, instinctively type "choti yeh" (U+06CC) for all initial and medial YEH characters, whether these are functioning as 'i' or 'e' vowel sounds (or the 'y' semivowel), and only expect to type a different character when the special BARREE YEH final form is required. It could be argued that medial 'e' should logically be encoded as a (dual-joining) BARREE YEH, and in fact I implemented such a system (in pre-Unicode days), but in practice typists do not think this way.

    Given that usage, adding a new character at this point adds many strange legacy issues for all existing documents and trying to find information in them, in addition to complicating text input methods and spell checkers and so on.

    And this is assuming that everyone move to the new letter, when we know that not everyone would.

    These kinds of "change the way things are done in technology, years after people have have been doing it" are pretty much guaranteed to be complicated.

    Irfan pointed out how limited the backcompat issue would be, to which Peter responded:

    Responding to a couple of your comments:

     

    > should be similar to yeh where the same keystroke will bring two different shapes for barree yeh, based on its place in the word

     

    But there is also a problem of two vowels with identical shapes in non-final position, but different shapes in final position. Consider the words for ‘cats’ /billeeyan/ and ‘cows’ /gaa’een/. (My knowledge of Arabic script and of Urdu are very limited: my Romanizations may be a bit off; I’ll attempt to enter the words in Arabic script, but I probably won’t get them completely correct. And I’ll try to colour the yeh in red in each case, but Word won’t let me highlight just that letter in some cases.)

     

    بِلّیان

    گایٔن

     

    For these words, will the user type the same yeh, or different yehs? Since there’s no difference in the initial forms or in the medial forms, one might expect users would enter these the same. (Would they really know they should enter these differently? And what would be displayed on the key caps?) From the singular forms of these words, though, it’s evident that the underlying vowels are different:

     

    بِلّی

    گاۓ

     

    So, linguistically, in the plural forms of words, users should enter different yeh characters, but there’s a good chance they wouldn’t, let alone do that consistently.

    Now this puts a cat (or a cow!) among the pigeons!

    In talking about other differences that would potentially sway people to believing a new character should be added to Unicode, different sorting needs for the two letters had come up previously, but Irfan was even more interested in the example Peter used:

    I believe that the examples you provided of the plurals for cat and cow make the case even stronger of having a dual joining barree yeh.  When writing these plurals, in the traditional sense of using pen and paper, the writer simply selects a glyph—without thinking whether it’s /i/ or /e/.  However, there are two different categories of users when using a PC to write these plurals.  The first group will simply use the glyph, like they do in the traditional method, and second group will use /i/ or /e/ depending on which vowel the word has.  Of course, the accessibility of these glyphs will have some affect on the usage too.

    Since I belong to the later group, and am a bit familiar with the writing habits of that group, I would say that this group while writing words similar to cats and cows will type plurals by typing the singular first—just like in English—which will be linguistically correct.

    So, there are basically two groups: one enters yeh as a glyph and the other enters yeh which is linguistically correct. We are already supporting the first group, and now we need to support the second group for linguistic reasons, as well as well ensure that we continue to support the first group and don’t break legacy docs.

    A bit more back and forth, and Peter found something interesting in newer proposals:

    What is interesting for your case, though, is a pair of characters being added to Unicode 5.1:

    077A ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE
    077B ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT THREE ABOVE

    These are used in the Burushaski language, as well as 06D2 and 06CC. In the proposal doc (06149-bashir-prop.pdf), the author displays these characters as dual-joining. Because of that, the first draft of the Unicode character properties files listed them as dual joining. A question was raised as to why they are dual joining when the character for skeletal form they are based on, 06D2, is right joining. This led to a subsequent doc from the proposal author (07264-arabic-shaping.pdf) in which essentially the same argument you are making is presented.

    Very interesting!

    These two new characters, being added in Unicode 5.1, are interesting since they weren't really added to any of the fonts as far I could tell.

    I stopped to make sure they made it in, and they did, into the Arabic Supplement block.

    Yep, there they are. I'll blow them up for your convenience:

    Okay, so if one were willing to ignore these little numbers above the letter, one could use these two new characters.

    If fonts supported them.

    Which they don't seem to be doing yet very well.

    The conversation got away from me after that, and I don't think it came up in Unicode again though I'm not sure.

    And now we come to the bigger problem. The one I really wanted to cover today.

    Just the other day someone was talking about a bug in the Windows Fax component where it was not being mirrored under a Bidi language other than Hebrew or Arabic (a bug no one noticed since even in the three LIPs previously provided for Bidi languages -- Urdu, Pashto, Persian -- this component isn't localized). Sure enough, in true How To [NOT] detect that a locale is bidi form, the code was pretty much doing the following for its check:

    BOOL IsBidi(LCID lcid) {
        return (PRIMARYLANGID(lcid) == LANG_ARABIC ||
                PRIMARYLANGID(lcid) == LANG_HEBREW)
    }

    Oops.

    You see, the problem is that so much of Microsoft product, and of Unicode encoding, is done with the trailblazers like the Arabic language, without really taking enough time to recognize that of all of the languages that use the Arabic script, the Arabic language has the simplest requirements.

    And all of these "can't ever change" properties in Unicode were assigned before the full measure of the various things people were doing with other Arabic script language was grokked (though much of it was in Daniels and Bright already, it appears not all of it was).

    When you imagine the number of cases like the BARREE YEH in Urdu and Kashmiri (which aren't getting their letter) and the number of random one that are (some of which will get the correct joining behavior even  if their original analogues do not), it makes approaching the whole block frustrating for the myriad of people who just want their language to work.

    One could argue that some of the fault lies with the other languages added earlier, like Urdu, where the people pushing the inclusion lacked the full understanding of the consequence of these properties, but that is just an unconvincing strawman position.

    It may be 100% true, but the languages added later have the same problems, so if those earlier languages had not been added there would be even fewer flexibilities available.

    The Microsoft problem I mentioned above, which has similar causes, is more easily fixed since it is just a bug and there is no conformance requirement in keeping bits of code like that dumb. but the standards question is a thornier one, with no good answers for the many extensions to language that are asked of longstanding scripts.

    The Arabic case is particularly worse given all of the various behaviors and properties that have to be defined....

  • Sorting it all Out

    How reasonable it is to translate something is directly proportional to the likelihood someone will see it

    • 10 Comments

    We all know that English is not the only language spoken in the world.

    Even for software developers.

    Okay, not all of us know that last part, I routinely talk to people who make certain assumptions about how much English a particular segment of customers might know.

    Even the assumption that everyone in India making more than $2 a day knows English turns out to be untrue (I vividly recall the woman next to me on the plane during my last India trip whose language other than Tamil that she knew was French, not English.

    So she did not know English, I did not know French (beyond counting and a couple of phrases that might have gotten me slapped). So I found myself struggling to communicate in the only language we came even close to sharing - Tamil.

    If we had to spend a month on the plane, I'd be a fluent speaker of Tamil now. :-)

    Anyway, my point? That not everyone speaks English.

    When I bat about phrases like Extent of Localization and talk about Language Interface Packs those are simply compromises: if it didn't cost money or take hard-to-find expertise to accomplish, companies like Microsoft would localize every bleeding word of every product shipping everywhere in the freaking world.

    The reason we don't, the reason no one does, is that it is expensive.

    So we make those trade-offs.

    But even the trade-offs are complicated.

    I mean, when someone asks a question like:

    Do we have any data that demonstrates the potential value to users v. cost of localizing error messages?

    the intent is clear.

    Someone is trying to determine how important it is to localize a certain bit of a software product.

    Even the simplest question like this one simply spawns in my mind more questions, in order to give an honest opinion. Questions like:

    1. What are the target languages and markets in question? (we have anecdotal knowledge of places many developers prefer some other language, often English)
    2. Will the errors be seen mostly by end users or developers (kind of like #1 but trying to find out if we are putting the burden on those other developers)
    3. Does the product generally have errors occur as a very exceptional occurrence or will they tend to be common (obviously if they are rare the cost may not be worth the benefit)
    4. Are there tiers of error messages that would have different answers to questions 2 and 3, or even 1? (if certain problems are more common among certain markets or customer segments then the answers may be different for different customers)
    5. What has been done in the past with similar products doing the same thing?

    Extend this to all the other segments of the user interface, factor in differences of the types of products and who uses them, and so on.

    You would need extensive formal research studies to get real answers.

    Of course the difficulty in terms of time, coast, and reliability of doing formal studies to get answers to these questions are in some cases insurmountable: the cost will easily be more than the market could ever contribute in revenue in decades, for just doing the studies, let alone the actual localization!

    So the problem remains -- how to decide how much to localize if there is no good way to know how important it is to do so?

    There are some good, easy rules of thumb that can help, though.

    Like in general there is a bias in favor of localizing all top level UI. This plays neatly into answering the point #4 I raised while kind of giving reasonable context for #2 and #3.

    Thus if we ignore the "market specific guesses" since they are largely anecdotal and even our best contacts in markets cannot help us since they speak English and thus often have no real frame of reference to compare the relative importance of issues to people who are so completely unlike themselves, we make the problem at least a little easier to frame.

    Stated simply, the re-framed principle is easy enough to grok:

    How reasonable it is to not localize something is inversely proportional to the likelihood of someone seeing that thing.

    Or if you like the version I put in the title better, you can go with that, instead.

    Because once again, in an ideal world we would be shipping a Babel fish and a Star Trekian Universal Translator to every single frigging customer in the world.

    All these other conversations are about how to make sure that resources are invested sensibly enough that products do not cost more to make than they will later be able to earn back.

    Answers.

    Sigh.

    It is so hard to formulate reasonable answers beyond that.

    Well, I mean other than attacking the source a bit!

    Attend me for a moment while I do this. :-)

    For example, I tend to see some of the weirdest and most obscure and hard to understand error messages one could ever imagine.

    They seem to contain English words but use unrecognizable jargon and sentence structures with which I am entirely unfamiliar.

    I want to ask them to translate it into ENGLISH so I can understand it; the idea of trying to translate it to some other language where for most users the best one could hope for is the same experience but in their own language (be it French or Japanese or whatever).

    I would say: fix the product to make it more useful in your own native tongue before foisting it in the rest of the world so that any time you put up text the user either:

    • knows what they are being told and what (if anything) to do, or
    • can call someone for support and tell them what it says

    And that is just for the original English where we still fail, long before we add other languages to the mix.

    Which makes the answer to the original question easier: clean your own house first, then come asking about how to make the localization cheaper. :-)

    But maybe a little less snarky than that, since usually the person asking the question and when it is being asked make the odds of the software being redesigned unlikely....

    A study about my dissatisfaction at an error message I cannot understand even though it is my language, and how it compares with the dissatisfaction of someone who gets the error in a language they do not know at all, might even find that I am even less satisfied than the other person -- at least they can blame it on [a possibly otherwise good product] not being localized, whereas I have no one to blame but the original core product's poor usability! 

    How expensive is to localize a product?

    Well, it depends.

    First answer me just one question:

    How intuitive and understandable was the product, to start with?

  • Sorting it all Out

    Sing. Sing a song. Sing it Lao'd (just in case the sort's still wrong!)

    • 5 Comments

    One thing about this blog....

    It is [apparently] kinda popular.

    I have no clue why. Truly I don't.

    If you can believe the stats about it, then it ranks #32 of all the blogs on the server, and if you take off the team blogs for a moment (like IE and Excel and Outlook and such), then it is like #14.

    If you don't count Raymond's bump late last month that awoke a March 31st reddit burp and ycombinator belch, then the numbers are more accurately #34 and #16, for what its worth. That'll sort itself out soon enough when the UNIX heads realize I'm not teasing them with every blog!

    So the number is real. As unreal as that may seem to me.

    Like when the girl I'm going out with says she thinks I'm hot. I never know what to do to that, since I don't think I am. They certainly didn't all say it.

    My self image is a bit like that Cerebus picture in the corner; I think they may eventually see me that way.

    But the looking hot thing is weird, like these numbers. And the numbers are even weirder than the girlfriends since the numbers don't seem to be breaking up with me quite as readily even when I abanadon them for months for no good reason, or have an online nervous breakdown over the death of a friend, and no matter how outrageous I've been over these last few years1.

    Maybe the Blog fills some unique niche; I had a reader point out to me that if you look at the Official Google Blog they have 27 posts on accessibility, 2 on Africa, 8 on Asia, 24 on Europe, 8 on Geo, 11 on India, 7 on Latin America, and 0 on Unicode, 0 on Globalization, 0 on internationalization, 0 on localization, 0 on translation, 0 on localizability, 0 on langauge, and 0 on linguistics.

    I pointed out that this might be unfair since some of those regional tags might point to relevant content. And they did have the blog about the goats, which was kind of cool and I'll give them props for.

    Plus I am hardly in competition with the Official Google Blog anyway2.

    Anyway, if you compare the number of comments per blog on this Blog, the numbers are much lower than most of the others on the server -- even ones lower in the stats.

    I can't quite figure it -- maybe people pop in and leave when they realize I'll be talking about Mongolian or uppercase or song lyrics or iBots or Unicode or whatever, so they never make it to the bottom of the page where the Comments box is.

    Actually, I know that probably isn't what it is -- the issue is that the majority of the traffic is people coming off of Bing or Google searches. More and more Bing, by the way -- and not just from the search box here. Like Bing from outside the server.

    So when I talk about "regular readers" as if there are many of them, the number is probably not as huge as the "rank" might indicate; people just show up here searching for answers to some random thing, and I happen to write about all kinds of random things. So Presto, people get their answer and go.

    I'll keep the myth going and talk about regular readers as if you've all been here for the last half decade or so, like me.

    Hope that's okay, if not then you're leaving anyhow so who cares? :-)

    Anyway, if you've been around here for a few years at least, you may remember a little over two years ago, when I reported (in Despite progression, the bug calls out to me quite LAOdly) how the Laotian sorting in Windows did well on the consonants but not so much on vowels and tone marks. And that it was essentially broken.

    In comments, John Durdin3 and Marc Durdin4 expressed concern that even if the bug did not exist it is likely that more would need to be done and that Lao sorting would not have looked right if it was just a matter of adding the proper weights to these code points; some compressions (what the UCA calls contractions) would be needed.

    We do that with Thai (299 2-to-1 compressions and 230 3-to-1 compressions, which when applied put us within conformance range with the Royal Thai sort).

    Which makes their claims a reasonable set of suppositions.

    So obviously they, knowing a hell of a lot more than I do about Laotian, would be likely to be correct.

    And as it turns out, they were.

    Windows 7 added a 464-entry 2-to-1 Laotian compression table covering a huge array of letter and vowel combinations.

    Clearly someone felt the Vista sort wasn't the full solution.

    Maybe even th same person!

    However, in thinking about John and Marc's words....

    And the fact that the MAI EK, MAI THO, MAI TI, and MAI CATAWA tone marks were given alphabetic (primary) weight....

    And the fact that there are no 3-to-1 or 4-to-1 compressions to allow for the clustering requirements they both suggested (which is pretty much how we do it for Thai)....

    All of that makes me suspect that Lao sorting may well be closer to goal but is likely still off the mark in Windows 7.

    I will have to wait to hear from some of these very interested people on how close it ends up being -- it may also be right under the "hundreds of wrong answers that give you a right behavior" principle that table based collations can sometimes bring to the mix, which are unsatisfying to linguists since hundreds of wrongs shouldn't make a right, but I tend to be okay with due to being more results driven....

    Anyone have any opinions on this? :-)

     

    1- With the arguable high point being when a VP asked why I wasn't being fired and one HR Generalist offered me an only midly insukting RIF. Though he left and so did she, so I guess that movement kind of fizzled out.

    2 - Or looking for people who think I'm hot.

    3 - Spake John: "Conventions for sorting are probably still not fully accepted in Lao PDR, but sorting according to the rules given in Kerr's 1972 Lao-English Dictionary is widely followed.  The algorithm is (primarily) phonetic, unlike Thai, which uses an orthographic sorting approach.  From a user's perspective, it is much easier, since you can find a word in a dictionary without knowing how it is spelled.  Most Thai university students do not use a dictionary effectively - if you don't know how the word is spelled, it can be quite hard to find it (and Thai, like English, has very irregular spelling).  The problem with the Lao approach is that words (or text) *must* be split at syllable boundaries (reasonably well) before determining the sorting key for each syllable, which adds computational complexity, but can be done."

    4 - Spake Marc: "I'm not quite sure I can see how the sorting can even kinda work without taking each syllable as a whole.  Do you work on a syllable-by-syllable basis? Unless you take each syllable as a whole (initial consonant, final consonant, vowel, tone), the sorting just won't work.  And because there can be ambiguity with final consonants and open syllables, you really need to split each syllable before sorting."

Page 1 of 3 (31 items) 123