Blog - Title

August, 2010

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    And how exactly do you justify those frigging kashidas?

    • 3 Comments

    Over in the Suggestion Box, DJPN asked:

    Hi,

    could you describe the correct way, or maybe some ways ,of doing justification of text containing arabic characters.  For instance, if I use ScriptJustify (using my home grown uniscribe-using library) and pass iMinKashida>0 then it occasionally goes wrong when there are characters that have combining diacritic marks.  I note that MS Word doesn't justify using kashidas.

    Now in the past I have discussed kashidas before, in blogs like On character justification (in *both* senses) and You've got to be kashidding me and generally speaking unless you are doing all the work yourself, the ScriptJustify Functionis the way to "Kashidize" your text.

    However, the iMinKashida, documented only as:

    Minimum width of a kashida glyph to generate.

    is really just part of the story.

    In fact if you look at the remarks, it notes that:

    This function provides a simple implementation of multilingual justification. It establishes the amount of adjustment to make at each glyph position on the line. It interprets the SCRIPT_VISATTR array generated by a call to ScriptShape, giving top priority to kashida. The function uses interword spacing if no kashida points are available. It uses intercharacter spacing if no interword points are available.

    Note   Sophisticated text formatters might generate their own delta dx array by combining formatter-specific features with the information retrieved by ScriptShape in the SCRIPT_VISATTR array.

    The application should pass the justified advance widths generated by ScriptJustify to ScriptTextOut in the piJustify parameter.

    ScriptJustify creates a justified array containing updated advance widths for each glyph. When an advance width for a glyph is increased, the extra width is rendered to the right of the glyph, with a white space or, for Arabic text, a kashida.

    Note   Kashida insertion occurs to the right of the glyph to justify visually. Microsoft Word and Microsoft PowerPoint use this concept. Any change in the kashida placement algorithm should accompany a change in the corresponding ScriptTextOut handler for a particular script, for example, the Arabic TextOut justification handler.

    Because is you modify the SCRIPT_VISATTR struct that you are going to pass in here, you will note that you can modify whether you want and/or expect to see kashidaas placed here quite easily by deciding whether psva->uJustification contains SCRIPT_JUSTIFY_ARABIC_KASHIDA or not or the other various interesting potential flags in the SCRIPT_JUSTIFY Enumeration. Note for the particular case of diacritics, you know if there are diacritics present via psva->fDiacritic. What you do here can get pretty intricate!

    Now I admit this isn't very well documented; about all I could find online is docs like Uniscribe: The Missing Documentation & Examples, which is a decent brain dump of some of the work in Google Chrome to support Uniscribe that kind of ignores this kind of issue in mixed text with specific suggestions that indicate not very much customization is happening there.

    Doing this one yourself can be complicated. Though if you want to try it?

    Going in with clear goals and good sample text to try different cutomizations out with could allow you to do some fairly impressive work that the default results might not do very well with....

    Now once again this would be a great place for a Uniscribe sample. Though this one can be quite complicated to accomplish; if anyone makes the attempt and wants to discuss what they are finding, let me know and I'll writ it up in the future.

  • Sorting it all Out

    A difference that makes no difference shouldn't make a difference, but...

    • 6 Comments

    It was over five years ago that I pointed out that Number format and currency format are not always the same.

    In that blog, I pointed out that one could not necessarily assume that GetNumberFormat and GetCurrencyFormat will return the same results, because the number format and the currency format were two different things.

    And that is entirely true -- it is why there is both LOCALE_SDECIMAL and LOCALE_SMONDECIMALSEP, both LOCALE_STHOUSAND and LOCALE_SMONTHOUSANDSEP, both LOCALE_SGROUPING and LOCALE_SMONGROUPING.

    But it represents a very idealized view of locale data and what is involved with it.

    Perhaps I was just a wild idealist then and now I know better, but with half a decade to think about it, I've decided to rethink this one a little.

    Because there are really only two times that any of these three pairs of constants will return different data for any locale passed to GetLocaleInfo:

    • When there is a bug in the definitions, and
    • When you are looking at the fa-IR locale, aka 0x0429.

    Now the latter case is because it is the one time that LOCALE_SDECIMAL and LOCALE_SMONDECIMALSEP are different (it uses . aka U+002e aka FULL STOP for the former and / aka U+002f aka SOLIDUS for the latter).

    But just about anyone you talk to from Iran, they know with an exchange rate of over 10,000 Rial to the dollar it has been a long time since anyone has ever talked about any fractional Rial value (Dinars or otherwise) except in a metaphorical diddly/squat sense. Of those I asked many were unfamiliar with any expected difference in the decimal separator for the two different cases whatsoever, and those who were familiar with it considered it ridiculous to support for any time in the last century or longer.

    Ironically, I was (many years back) one of the advocates for the Bidi algorithm exception that put U+002f in the "ES" (European Number Separator) category for bidi_class. Largely to preserve this largely theoretical behavior. Though I was just one of the chorus on that one, I did not lead they choir. This change helps fractional Rial currency values in Farsi/Persian, an edge case extraordinaire!

    I don't know the exact period in history that fractional Rials became so "theoretical" so I am going to rely on what others have told me for that point.

    Either way, six different fields exist in every locale, even though it could have been three all these years.

    You can contrast this with a similar though separate case -- the one of some of the percent locale fields I described in And you can't set all of the properties all of the time... -- a case where some properties are separate but the underlying data is not (since none of them were ever different in the locale data).

    We could have just saved ourselves the extra complication and not separate things until the need was proven and the difference actually mattered.

    If the same problem came up in the future, I would not advocate the split (in either the interface or the underlying data) until there was proof that the split would be sensible and expected in actual usage.

    Because A DIFFERENCE THAT MAKES NO DIFFERENCE SHOULDN'T MAKE A DIFFERENCE.

  • Sorting it all Out

    Seeing a complex problem is not the same thing as seeing a solution

    • 4 Comments

    I had a lot of discussions and presentations and meetings and conferences and lectures while I was in India. So many that I actually lost my voice twice during that month.

    There are three in particular that stood out for me in a marked way, that has caused me to really mentally re-assess how I look at the core language and market issues in India.

    They were not the three most important conversations I had in terms of position/status/impact/influence of the people I was interacting with.

    Each of the conversations I am thinking about now were over successive days of the conference with people whose English skills on scale of 1 to 10 (with 1 being essentially no knowledge and 10 being complete fluency) would be at a 1, or maybe a 2 at best. They came from the larger conference on Tamil and saw me in the time that I was leaving the building where lunch was being served and hanging out getting sunburned (thus they were not at the Tamil Internet conference piece of it that I was speaking in, just nearby at the World Classical Tamil Conference).

    These three conversations were only possible due to the help of surrounding people who spoke both English and Tamil to act as interpreters.

    You see, the almost all of the people I interacted with in India before, during, and after the World Classical Tamil Conference were in that ~7% that know English.

    But these three were in that ~93% of those who do not.

    That invisible ~93% that no one seems to be looking at, other than perhaps in the sense that they are trying to encourage the widening of the the number of English speakers.

    The conversations were very interesting, as were their thoughts about Tamil and the conference. None of them were captives to propoganda and all of them saw a potential that impressed me because of how often those in the 93% are assumed to not be part of the solution, even though in a sense they kind of are the solution.

    Now I don't want to knock the power of learning English if you live in India -- it increases opportunities and while it may not be truth that "anybody making more than USD$2 a day knows English", there are powerful reasons to assume is it is a truism since it can be.

    Though few doubt that in Tamil Nadu one can go further with just Tamil. Since even people who know English push for more Tamil. Especially at conferences such as this!

    With a literacy rate of up to 73.5% in Tamil Nadu, focusing the efforts of technology on just the ~7% who speak English is leaving a very large group out in the cold who one can reasonably surmise are educated and interested enough to interact with technology and the wider world, and all they need is technology to "lower its minimum requirements" enough to allow for those who can't speak English.

    Because the opportunities provided to people who know English have a lot more to do with the ability to interact with technology than they do with the language itself.

    This makes the whole ENGLISH barrier feel very artificial.

    Especially in Tamil Nadu, more than almost anywhere else in India (including some parts of India that have other, more pressing, barriers blocking their forward motion).

    It is only the barrier to technology because we (the technology companies) make it the barrier by not in most cases providing localization targeted at people other than that 7%, as a crutch to help them improve their English, at best.

    I have multiple emails that were sent to me by people who saw the article in The Hindu (ref: Me in The Hindu, aka Clearly not all press is bad press) and wanted to know how they could help the effort to make computers more widely available. These are people who know English and yet recognize the bigger problem and feel that learning English is not the answer.

    Those people do not make a battalion, or even really a squad. But they mean something. Something important.

    I try to make to sure to never forget that in Tamil Nadu, English is not the best answer; it is merely the answer that is better than Hindi. No one I talked to that month misunderstood the word Hindification. or what I meant by it, even if they had never heard the term before!

    And in talking to people who had a chance to use the Language Interface Packs several of the languages of India (especially Tamil, in some cases after I cajoled them into doing so on the first day of TI2010), they were by and large not terribly impressed at the overall quality of the translation. I got this feedback specifically for Tamil, Hindi, and Bengali. But I heard secondhand reports that suggest this is a widespread issue.

    That is an issue that I believe is genuine and has a specific cause.

    Brief embracing of a tangent: True story: When the Hindi Starter Edition was being put together, the subsidiary folks in India inquired about the time frame for the new localization to be done. The folks in HQ were amazed that the subsidiary thought there was ever a plan for two separate localizations for Hindi when there weren't separate localizations based on version of Windows components for any language. They put up with it, but they knew that this would make the Hindi and other Indic language Starter Editions of Windows to be far less than ideal for the bulk of Starter Editions targeted customers.

    OK, I guess that wasn't irrelevant enough to be a true tangent, since the theoretical customer of a localized Starter Edition is the very same customer as I am talking about.

    Much of the process apparently is done without a clear target customer in mind.

    This is a bold statement, but I will say based on the authority of native speakers of these languages who do not feel there is any target that would find the localization to be "made for them."

    It is not good enough for the brand new user (as I said) and it looks wrong to the experienced user who doesn't need it but can appreciate it and would appreciate it if it looked good.

    So in essence, Microsoft spends a bunch of money to put something together (Language Interface Packs) that are free but which in some cases are not built in a way that any of the theoretical target users will consider usable. They get minimal to no feedback, which of course inspires them to make no changes because if they do not hear about problems then there is no impetus to make changes.

    I changed my Windows 7 user interface language to Tamil and showed it to those three non-English speaking people I met in Coimbatore. They all got a quick look at the tamil UI in Windows.

    All three of them were amazed and delighted to see a computer that was entirely in Tamil, just like I kind of said in that interview for The Hindu. None of them were regular computer users, and none had their own computer at home, but they had notions of what computers were and had clearly seen them before. But as they looked, I watched their delight move to confusion which they covered up as quickly as they could.

    It was clear to me that this was not built for them.

    But I knew now that a lot of people would want something.

    Seeing a complex problem is of course not the same thing as seeing a solution.

    Unfortunately.

    But perhaps there is a solution here. And I just haven't found it.

    Yet.

  • Sorting it all Out

    And here comes Macedonian!

    • 2 Comments

    The Windows 7 Macedonian Language Interface Pack is live!

    Click here to download the Macedonian Windows 7 LIP via the Microsoft.com Download Center.

    Please note that the Macedonian Windows 7 LIP can only be installed on a system that runs an English client version of Windows 7.   It is available for both 32-bit and 64-bit systems on the Download Center.

    A LITTLE BACKGROUND INFORMATION ON MACEDONIAN

    NUMBER OF SPEAKERS

    1.5 – 2.5 million speakers

    NAME IN THE LANGUAGE ITSELF:

    Македонски

    Macedonian is the official language of the Former Yugoslavian Republic of Macedonia (FYROM) where it is spoken by about 1.4 million speakers. Another 200,000 native speakers live in Greece, and there are larger numbers of speakers in Serbia, Montenegro, Bulgaria and Albania.  In some areas outside FYROM, Macedonian is not recognized as an independent language and even repressed (which is the reason why there are only estimates for the number of speakers).

    FUN FACT:

    • Macedonian grammar is markedly analytic (having a low morpheme-per-word ratio)  in comparison with other Slavic languages, having lost the common Slavic case system.

    Click here for more information about the Macedonian language

    CLASSIFICATION:

    Macedonian, an Indo-European language, is a Slavic language belonging to a group of South Slavic languages that also include Slovenian, Serbian, Croatian, and Bulgarian. It is most closely related to Bulgarian. These two languages share several features with Romanian, Greek and Albanian (though those are all from different language families) and form the so-called Balkan linguistic union.

    Click here for more information about Macedonian classification

    SCRIPT:

    Macedonian is written in a modified Cyrillic alphabet with 31 letters, more similar to that used for Serbian than those for Russian or Bulgarian.

    Click here for more information about the Macedonian script

     

    MICROSOFT-SPECIFIC AND NAME-SPECIFIC:

    Now I have previously talked about the lengths Microsoft goes to in order to avoid taking sides. In fact, it was in Who owns English, exactly? that I mentioned Macedonian in the contexct of a casde where Microsoft let the politics of another country entirely (Greece) determine the "English name" of Macedonian:

    A company like Microsoft will often make use of the LOCALE_SENGLANGUAGE, LOCALE_SLANGUAGE, and LOCALE_SNATIVELANGNAME parameters to GetLocaleInfo in ways that are fairly wimpy and non-confrontational in order to avoid taking sides -- thus we have Macedonian (FYROM) as the SENGLANGUAGE, македонски јазик or Македонија as the SNATIVELANGNAME, and whatever individual localizers come up with as the best choice for SLANGUAGE on a per-language basis.

    The most current English LOCALE_SLANGUAGE is the full Macedonian (Former Yugoslav Republic of Macedonia), meaning the English SLANGUAGE does not match the SENGLANGUAGE.

    As Wikipedia says in its Macedonia naming dispute article:

    The Macedonia naming dispute refers to the disagreement over the use of the name Macedonia between Greece and the Republic of Macedonia. Greece opposes the post-1991 constitutional name of its northern neighbour, citing historical and territorial concerns resulting from the ambiguity between it and the adjacent Greek region of Macedonia. Greece also objects to the ambiguous use of the term Macedonian for the neighbouring country's main ethnic group and language. The dispute has escalated to the highest level of international mediation, involving numerous attempts to achieve a resolution, notably by the United Nations.

    The provisional reference the former Yugoslav Republic of Macedonia (FYROM) is currently always used in relations involving states which do not recognize the constitutional name, Republic of Macedonia. Nevertheless, all United Nations member-states, and the UN as a whole, have agreed to accept any final agreement resulting from negotiations between the two countries. The ongoing dispute has not prevented the two countries from enjoying close trade links and investment levels (especially from Greece), but it has generated a great deal of political and academic debate on both sides.

    Negotiations aimed at resolving the dispute are ongoing.

    This is where that FYROM in the LOCALE_SENGLANGUGE comes from. But it is wortrh noting that the name is not the UN-brokered "compromise" name, by any means (and in Macedonian the FYROM issue is skipped entirely). Technically this makes Microsoft doing its own version of a "double name" solution, which is the solution most outside of Greece have expressed interest in as some kind of compromise.

    In any case, enjoy the Macedonian Language Interface Pack!

  • Sorting it all Out

    I went to Tatarstan and all I got was this Language Interface Pack

    • 0 Comments

    THE WINDOWS 7 TATAR LANGUAGE INTERFACE PACK IS LIVE!

    Woo hoo! :-)

    Click here to download the Tatar Windows 7 LIP via the Microsoft.com Download Center.

    Please note that the Tatar Windows 7 LIP can only be installed on a system that runs a Russian client version of Windows 7.   It is available for both 32-bit and 64-bit systems on the Download Center.

    A LITTLE BACKGROUND INFORMATION ON TATAR:

    NUMBER OF SPEAKERS

    6-7 million speakers

    NAME IN THE LANGUAGE ITSELF:

    татарча

    The Tatar language is one of the two official languages of the republic of Tatarstan in the Russian Federation (Russian being the other). Tatar is spoken there by around 5.7 million speakers; smaller communities of Tatar speakers can be found in neighboring regions like Bashqortostan, in southwestern Siberia and in post-Soviet central Asia and eastern Europe.

    During the Soviet era, Tatar lost ground to Russian; it is estimated that in the last 30 years of the Soviet Union more than 8 percent of the population of Tatarstan switched from Tatar to Russian as their preferred language. The language of high education as well as the mass media is still predominantly Russian, and in urban areas more Russian is heard. But the Tatar language is being promoted by an active language policy in the republic, and since the end of the 20th century there has been a renaissance of the language.

    Tatar has a large number of dialects, which can be classified into three major groups: Central, Western/Misharian and Eastern/Siberian. Modern standard Tatar shows features mostly of both the Central dialects (especially in lexicon and phonology) and the Western/Misharian dialects (more in morphology).

    FUN FACTS:

    • In Turkish, the Tatar language is called Turkish Tatar (Tatar Türkçesi) to stress its membership in the Turkic language family.
    • Tatar literature flourished in the empire of the Golden Horde, founded by Ghengis Khan's grandson, Batu Khan. The empire existed from the early 13th to middle 15th century.

    Click here for more information about the Tatar language

    CLASSIFICATION:

    Tatar belongs to the Northern Kypchak branch of the Turkic languages, which might belong to the (disputed) Altaic language family. The classification of Tatar itself is not undisputed either (as for most Turkic languages). The closest relative of Tatar is Bashkir, other relatives include Crimean Tatar or Kazakh.

    Click here for more information about Tatar classification

    SCRIPT:

    Until the late 1920s Tatar was written in a modified Arabic script (which did not suit Tatar well and imposed very complex spelling rules). The Latin alphabet introduced then, was replaced by a Cyrillic one already in 1939. The second introduction of a Latin alphabet, which was made official in September 2001, was reverted by the Russian Supreme Court which argued that for maintaining unity in Russia a unified script is necessary. Therefore Tatar is written in a Cyrillic script with 6 special characters unknown in Russian.

    Click here for more information about the Tatar script

    Enjoy!

  • Sorting it all Out

    The last of the Malays...

    • 2 Comments

    The Malay (Brunei Darussalam) Windows 7 Language Interface Pack is live!

    You can download it from here.

    Please note that the Malay (Brunei Darussalam) Windows 7 LIP can only be installed on a system that runs an English client version of Windows 7.   It is available to download for both 32-bit and 64-bit systems.

    A LITTLE BACKGROUND INFORMATION ON MALAY (BRUNEI DARUSSALAM):

    NUMBER OF SPEAKERS: 

    47 million speakers

    NAME IN THE LANGUAGE ITSELF:

    Bahasa Melayu

    The Malay language is official language in Malaysia and Brunei where it is spoken by about 23 million people.  In Brunei, Singapore, southern Thailand, and the southern Philippines it is called Bahasa Melayu "Malay language".  "Bahasa Melayu" was defined as Brunei's official language in the country's 1959 Constitution.

    It is a variant of a language diasystem, having its counterpart in the Indonesian language. Malay/Indonesian was a trade language since at least a thousand years on the Malaysian peninsula and the Indonesian islands; the difference between the two languages started to form only in colonial times when today's Malaysia, Brunei, and Singapore were formerly under British rule and were influenced by English while Indonesian was influenced by Dutch. The differences are still small enough to make both variants mutually intelligible.

    The grammar of this agglutinative language is rather simple: There is no inflection for both nouns and verbs, no articles are used for nouns, only very few words (those borrowed from Sanskrit) have a grammatical gender, the plural mostly gets indicated by using a numeral (often with a classifier) or simple duplication (orang, person, orang-orang, people). There are only two different tenses for verbs: the present tense and a form of future tense.

    FUN FACTS:

    • English words of Malay origin include amok, bamboo, compound (from kampong, enclosure), ketchup (originally referring to fish sauce), orangutan (literally meaning forest person).
    • While Indonesian was influenced by Dutch during colonial times, Malay borrowed many words from English. Striking examples for the resulting difference in the vocabulary include akaun (account, Indonesian: rekening), farmasi (pharmacy, Indonesian: apotek) and tiket (ticket, Indonesian karcis, from Dutch kaartje).

    CLASSIFICATION:

    Malay belongs to the Western Malayo-Polynesian languages, along with languages like Javanese, Balinese, which are spoken in Indonesia, Malagasy, spoken in Madagascar, or Tagalog, spoken in the Philippines. The Malayo-Polynesian languages form a subgroup of the Austronesian language family.

    SCRIPT:

    Malay is usually written in Latin alphabet called Rumi, although a modified Arabic alphabet (Jawi) also exists. Rumi and Jawi are co-official in Brunei.

    You can learn more about Malay here.

    Now all three flavors of this language have their LIPs released: this one, Indonesian, and Malay (Malaysia).

    Enjoy!

  • Sorting it all Out

    Yet another cost to not supporting Unicode?

    • 1 Comments

    It is very hard to type when one has a sprained shoulder. Thank goodness for Dragon Dictate! I'm just saying....

    Over in the Suggestion Box, Alex asked:

    Hi, Michael.

    I've googled through your blog, but haven't found an answer to the following question. This question is very popular (at least in Russia)  and I was surprised that you didn't covered it yet. So, may be you can tell us a story behing it. This issue is about clipboard, text and non-unicode application.

    Take a old non-Unicode application (like Notepad from Win9x) and run it on new Windows (like XP), which have 2 input languages installed (like English and Russian, for example). Suppose that "Language for non-Unicode application" setting is set to Russian.

    In Win9x you can copy text via clipboard from any application to any other application without problem. Sure, old apps don't bother to set CF_LOCALE along with CF_TEXT, but things worked very well then, since the same code page was used by all apps (okay: almost all).

    Now, take a modern "Unicode" OS, like XP. You take your old app, which served you many years, copy text to clipboard, paste it in other application (like modern Notepad) and... whoa: you get question marks or gibberish. What's wrong? Heck, you forgot to _switch keyboard input to Russian_. Once you do that - everything start acting smooth again.

    Example: dl.dropbox.com/.../552912982__krakozyb.gif

    Top row: left - notepad from Win98, right - notepad from XP. Bottom row: left: notepad from XP, right - notepad from Win98. Current input language is set to "English (United states)" (like we forgot to switch it to Russian). Red lines indicate copy/paste operations via clipboard. (I took Microsoft's application to indicate that this is not a bug in particular 3rd party application).

    The problem (as I see it) is getting Unicode text from ANSI-text. Why Windows uses keyboard input method for that, rather than using "Language for non-Unicode applications"?

    This is terrible break in user experience. Most people thinks that this is a bug. I hear this complain very often (heck, I hear cursing Microsoft in almost all cases of mentioning this). It's especially painfull to explain that your application, written a many years ago, has nothing to do with this change.

    Can you shred a bit of the light on the background of this issue, please? Why Microsoft decided to (okay, not to break, but to) compicate lives of zillions of existing applications? If Microsoft cares a lot about backward compatibility - why was such decision made?

    This sounds very familiar.

    I can't quite put my finger on it.

    Oh yeah, I was thinking of Double Secret ANSI, part 1 (Somewhere between ANSI and Unicode) and Double Secret ANSI, part 2 (the brokenest one yet, sorry 'bout that!).

    The Win9x version of this very same feature that actually allowed a tiny bit of cross-codepage stuff to work if people tried hard enough (like Adobe did) was kind of incomplete.

    The NT-based version of it fills in the holes, which are apparently what some people were relying on a little bit?

    Technically the NT-based version of this feature has always been broken though. That makes fixing it kind of a tough sell, as opposed to just supporting Unicode in the apps.

    Historically, we seem to be in the habit of breaking people who aren't using Unicode. Not because of an attempt to sabotage, but just because Unicode support gets a lot better coverage....

  • Sorting it all Out

    Ok, we all know when 6 turns out of nine. But when does 17 turn out to be 10?

    • 5 Comments

    So does everyone know when 17 equals 10?

    When there is as bug! :-)

    Now this bug was fixed a long time ago.

    It first started happening in XP, and in fact I believe it still happens there.

    When I say fixed, I mean they fixed it in the new version.

    And by new version I don't mean Windows Server 2003, as the bug still happens there.

    They fixed it in all the versions after that.

    How to get the bug? Well let's say you tried to add a bunch of keyboards using KB 289125 to handle Regional and Language Options.

    And your unattend looks something like this:

    [RegionalSettings]
    InputLocale=0406:00000406,0414:00000414,0816:00000816,041d:0000041d,0405:00000405,040e:0000040e,0415:00000415,0408:00000408,0419:00000419,041f:0000041f,040c:0000040c,0410:00000410,0407:00000407,0409:00000409,040a:0000040a,0411:e0010411,0804:00000804

    Yes, there are 17 keyboards there.

    When it runs, only 10 get installed, though.

    That seems like a bug, right?

    I wonder if anyone hads any ideas what might be happening.

    Well, it is kind of a fun, oopsie kind of a bug that happens when the code gets confused somewhere.

    Perhaps not much of a hint, but I don't want to make it too easy!

    Now although the problem was fixed, the way of doing everything was also totally changed anyway, so the fix probably wasn't needed anyway (the only people who benefited were some internal folks who might have run into the bug back when Vista was Longhorn, etc.

    Every once in a while someone hits the bug again, and it suddenly occurred to me that I never talked about it before -- so that maybe I ought to write something about it....

    Any guesses?

  • Sorting it all Out

    On Feedback (some positive, and some the other kind)

    • 2 Comments

    The other day, in response to Farsi? Persian? You'll be getting some LIP about it either way, stillife had a very interesting response, excerpted here:

    I've been using (and loving) the LIPs since Windows XP. They are really useful for getting people who haven't mastered English yet to still be connected to the world (like my grandmother). So kudos to Microsoft.

    Now I find the full response to be interesting for several different reasons.

    For today I am just going to deal with one of them; other issues can be done on other days. :-)

    The easiest part of this is the kudos part -- it is always nice to know when people find the work to be useful.

    Not all languages necessarily get lots of feedback, and generally one only hears from people when one knows them or when they are unhappy about something, but a comment pointing out one of the basic and cool uses of Language Interface Packs is nice.

    This is a usage that I wish was pushed more often -- especially for languages where this core idea should be kept in mind during the original localization phase (this is the kind of feedback that I would probably love to hear if I were involved with the actual localization in question, because it is a type of validation of goals that just can keep you happy to be working for the rest of whatever week you find out about it!

    Now of course it is just short of the most awesome type of response -- hearing from that grandmother (or whoever is directly able to use the LIP and who enjoys it).

    But for a Monday morning "pick me up" this is still pretty cool....

    I'll get to giving some other examples of feedback in this area -- some positive and some the other kind -- soon. But if you are one of those Persian speakers who has tried out the LIP (in any version), what did you think, for yourself and/or for the people you know who may not know as much English? Do you agree with the feedback, or did you see problems?

  • Sorting it all Out

    ...from the Microsoft point of view

    • 5 Comments

    Sometimes it depends on your point of view.

    Gaurav's question was:

    Hi,

    We have a question related to System.Text.Encoding.GetEncoding() API. Encoding.GetEncoding().WebName returns following values:

    Encoding.GetEncoding(1200).WebName        // 1200 represents UTF16 Little Endian
    "utf-16"
    Encoding.GetEncoding(1201).WebName       // 1201 represents UTF16 Big Endian
    "utf-16BE"

    Thus unmarked representation (“utf-16”) is interpreted as Little Endian.
    But according to RFC#2781, and below pasted excerpt, it looks like unmarked representation should be interpreted as Big Endian in the absence of BOM.
    4.3 Interpreting text labelled as UTF-16
       Text labelled with the "UTF-16" charset might be serialized in either
       big-endian or little-endian order. If the first two octets of the
       text is 0xFE followed by 0xFF, then the text can be interpreted as
       being big-endian. If the first two octets of the text is 0xFF
       followed by 0xFE, then the text can be interpreted as being little-
       endian. If the first two octets of the text is not 0xFE followed by
       0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
       interpreted as being big-endian.

    Is there a bug in the API or our understanding is wrong?

    Thanks,
    Gaurav

    As I said, it really depends on your point of view.

    The word SHOULD in standards is an interesting one though.

    In truth, the implied text when you see SHOULD is SHOULD, UNLESS YOU HAVE A REALLY, REALLY, REALLY GOOD REASON NOT TO.

    Now in the case of Microsoft, which really is a Little Endian UTF-16 shop through and through. So much so that you get other weird stuff happening like I mentioned in unicodeFFFE... is Microsoft off its rocker?.

    In fact, it only gets truly weird in cases when the cases pop up where .Net is on a platform where UTF-16 little endian may not be a sensible assumption, but thankfully such cases are mercifully few.

    Or maybe you have such a case in front of you? :-)

  • Sorting it all Out

    That Effing font is Not Safe For Work

    • 0 Comments

    It was Annie Colbert who first pointed me to it.

    At first i wasn't going to say anything.

    But the pull of a NSFW (Not Safe For Work) typeface is simply not to be denied.

    And thus I present the Effing typeface for your viewing pleasure:

    You can see the full A-Z here.

    No word on lowercase forms; I imagine some would be easier than others in a metaphor extension kind of way.

    And of course nothing for the rest of Unicode. Though that would be kind of fun to think about, too. :-)

    Some of therse letters require much more knowledge of meanings to be understood, though not in the usual linguistic sense if the words themselves. I think this could prove to be pretty distracting....

  • Sorting it all Out

    The song^H^H^H^Hbug remains the same

    • 5 Comments

    In the background, The Song Remains the Same is playing. Ironic much?

    The question came up the other day:

    {REDACTED} uses notepad to look at info from text files that contain redirect output from commands when troubleshooting user systems.  This has worked for them until they added Russian and Greek to their multilang images.

    In Windows 7 (and probably all other versions) on Russian

    If you pipe the output of a command to a text file such as ipconfig.txt


    Настройка протокола IP для Windows

       Имя компьютера  . . . . . . . . . : Xx
       Основной DNS-суффикс  . . . . . . :
       Тип узла. . . . . . . . . . . . . : Гибридный
       IP-маршрутизация включена . . . . : Нет
       WINS-прокси включен . . . . . . . : Нет
       Порядок просмотра суффиксов DNS . : redmond.corp.microsoft.com

    ……

    Opening the file in notepad gives

    I have tried various fonts and Cyrillic script option in notepad but have not gotten complete Russian output.

    Is there a combination that works for notepad?  Word can display these correctly but Office will not be available on all of their systems.

    Greek has similar issues.

    Thanks,

    Regular readers probably know what is going on already.

    Though the mention of fonts in the question may throw people off a scosh!

    It is that the console is using the OEM code page and Notepad reads non-Unicode files as if they are in ANSI code page.

    Personally, the part of the message that excited me most was the fact that more companies that buikd lots of Windows images are moving into building other language images. :-)

    To be honest, it makes me wonder whether adding one more encoding choice to the Notepad load list and save list

    for the CP_OEMCP would make sense.

    I mean, given how many commonly used command line tools will have output in these code pages, how much would it really cost to add?

    I think they really ought to add this. For the sake of this very scenario.

    Now on the other hand a fuller work item, adding it to the detection list (described in this blog post), would be a bit more problematic, though.

    I'm going to see if any of the regular readers want to guess why detection based on simple "stupid byte tricks" (apologies to David Letterman!) would be complicated (I'll answer too in the comments, eventually)....

  • Sorting it all Out

    A bit about some Arabic script but not Arabic language stuff...

    • 4 Comments

    All companies both big and small can often see the things they do influenced in interesting ways by the people who are in them.

    In the case of Microsoft, you can see proof of this in support for languages like Persian (aka Farsi).

    Despite the fact that there is no Iranian standard keyboard in Windows, there is a Persian locale and other support.

    This is not for Iran (a place we are not allowed to ship software to), it is nominally for the expat community.

    But even more than that, the particular expats who work for Microsoft are able to act as advocates for much of what is to be done. And thus they are able to help influence decisions relevant to them based on their interest in and advocacy of their language.

    Of course with a large company like Microsoft the same could be said about many different languages, but Iran is in the very small group of languages whose most obvious advocates are people we cannot actually communicate with. But if you look at that list (one that contains other areas disputed for a variety of reasons like Myanmar and North Korea), Iran stands out for two very interesting reasons:

    • The largest expat community communicating with us, and
    • The largest population of the members of this group of languages at Microsoft.

    It made me think of blogs like Arabic ≠ Hebrew, and Hebrew ≠ Arabic and in particular the Phase 4 piece of it, as Persian is one of several languages that has interesting challenges for being a non-Arabic language that uses the Arabic script, and which for all of the times it is "lumped in" with Arabic, some of the time the actual result is very different.

    Another interesting development a customer asked me about on the "Download Languages for Windows" page I blogged about the other day is the following entry:

    Language Native name Base language
    and
    edition required

    Windows 7

    Windows 7

    Windows Vista

    Windows Vista 

    Dari درى English (any edition) Coming soon Not available

    Interesting, right? :-)

    Now Dari can at some point be added to that list of languages that includes Persian, Urdu, and Pashto representing user interface languages in Language Interface Packs!

    Now all of these languages have an issue that is particular to them, and this is an issue I'll be talking about in a future blog (perhaps tomorrow!)....

  • Sorting it all Out

    The Portguese version. No, the other Portuguese version...

    • 7 Comments

    The question came up just the other day:

    One of our clients needs Windows Server 2003 x64 in PT-PT.

    We have found the PT CD but it is PT-BRA and not PT-PT.

    Do we have a PT-PT MUI?

    Thanks!

    It reminded me of my recent About that Portuguese localization question, redux..., and about the limited number of languages that both Windows Server 2003 x64 and Windows XP x64 were localized into.

    It may be surprising how the decisions are made when fewer languages are included, but they aren't that surprising if we think about it very carefully....

    Alas, there is no European Portuguese version of either of these products.

    Though the 64-bit versions of Vista, Server 2008, Windows 7, and Server 2008 R2 are all available in pt-PT for both SKU and MUI....

  • Sorting it all Out

    It would be like spelling it Anerica or something.

    • 0 Comments

    Now in the past I have talked about Microsoft's relationship with the Unicode Collation Algorithm, in blogs such as:

    And I talked about some of the technical differences between the two, as well as some of the reasons behind the differences.

    In my personal view, the UCA and the tailors in the CLDR still change a bit too often for my tastes, and I am mostly happy with what Microsoft does. But neither those facts nor the exceptions to the latter that make me say "mostly" are the subject of today's blog.

    Today I'm going to talk about one of the biggest philosophical differences that I see as a blocking issue to the idea of using either the Unicode Collation Algorithm or several other Unicode standards in Microsoft products, except in the case of data being transmitted elsewhere.

    It is not I am calling them bad -- they are not. It is just that the issue can really block the notion of considering certain operations to be desirable as built-in operations performed on all data.

    Note that insome cases, such operations may be built in already in some applications or APIs -- my opinion on such is either already known and mentioned, or you can likely guess what it would be.

    It is a principle you can see in Microsoft platform pieces, in for example other parts of Windows like NLS encodings and the code in and atop NTFS file system, a principle you can see in Jet Blue products (e.g. Exchange, Active Directory) and SQL Server.

    It was not totally mention but sort of applies in Normalization and Microsoft -- whats the story?, too.

    You may be able to guess what that principle is.

    Your guess?

    I'll just mention it, you can pretend you knew what I was getting at all along. :-)

    It is leaving the data alone and not screwing with it.

    Thus not uppercasing or lowercasing text just because you wanted to ignore case (and losing the information) and not normalizing to Unicode Normalization Form C or Form D (and losing something unque about the original form it was in) and ignoring certain "ignorable" characters (which turn out to change meanings when the characters are gone).

    Just in case you were doing something special with that text that showed different results or looked different.

    Now each of these operations from Microsoft's and/or a language's point of view can really be destructive to data, whether one imagines the stuff I mentioned in this blog or this other blog or 2.3b of Unicode's UAX 31 (Unicode Identifier and Pattern Syntax).

    For that last case, the original assumptions about characters like the ignorability of ZERO WIDTH JOINER and ZERO WIDTH NON-JOINER led to several problems that languages such as Sinhalese were seeing strings broken through normal processes that Unicode was originally recommending, because the full consequences of algorithms and layout/shaping rules (and the interaction of all of them) were not fully understood. thus the following two strings:

    Text (in case your browser knows
    how to render this properly)
    Image (in case the text
    doesn't render right for you)
    Unicode code points
    ශ්‍රී ලංකා 0dc1 0dca 200d 0dbb 0dd3 0020 0dbd 0d82 0d9a 0dcf
    ශ්රී ලංකා 0dc1 0dca 0dbb 0dd3 0020 0dbd 0d82 0d9a 0dcf

     

     

     

     

     

     

     

     

     

     

    The first one of them is meaningful -- it actually is the term for the country name of Sri Lanka in Sinhala, the other is not. And the UAX lists other similar examples, though none perhaps as top level bad to get wrong as the name of a country.

    Like spelling America as Anerica or something, because of some truncation operation that clipped a letter. Would you want such an operation running on your machine? :-)

    Anyway, the fact that database platforms like Jet or SQL Server (and file system platforms like NTFS) do not normalize means that no version of these products screwed these strings up. And comparison operations worked to treat the equal things as being the same by storing the different forms with the same weights, never transforming the strings as part of the storage or the comparison logic.

    As a point of comparison (by which I mean contrast), I am told that some platforms and databases transform the data and store just one form, since Unicode "rules" allow it and it makes some operations easier.

    Now I could claim that this was because we were smarter, but I would be lying.

    It was just that these platforms and products were formed in the primordial stew before Unicode had ideas like canonical equivalence and ignorable characters and such, then later no one wanted to change anything.

    Partially this may have been laziness, and partially it was a reluctance to change code that worked. But even then for some the idea of not "screwing up" data was present in the minds of some people. I mean if people took the time to make something different then they may have had their reasons.

    Thus Microsoft has had a long history of not wanting to go the Unicode way, since its eagerness for process and algorithm and operations has messed up things in an earlier version then fixed it in a later version feels a bit young, at times.

    Of course with a new generation of people in charge of things and those who were there before either gone or just elsewhere, I am clearly speaking of the past; I have no idea if these philosophical principles still guide the product.

    Though I am pretty sure NTFS will still keep working the same way no matter what happens. :-)

Page 1 of 3 (31 items) 123