Blog - Title

August, 2011

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    And then, the unrelated KB article fixes the problem...

    • 0 Comments

    The question was an interesting one:

    Query raised by client:

    In the last fortnight, some business unit cannot print office document correctly if the document contains Calibri font. The initial thought was to upgrade to the latest version of driver from the printer vendor but it did not resolve the issue. Searching the Internet, I found the following article which appears to resolve the issue but I don't know why. The article has no connection to the problem when reading it initially:

    The characters in an equation are not printed when you print a Word 2007 document on a Windows XP-based or Windows Server 2003-based computer

    They are using Windows XP SP3 with Office 2007 or Office 2010. We have packaged the above kb article but we are a bit uncertain to deploy it without knowing the root cause of the issue, I tried to reproduce the issue but could not reproduce it in the lab. The printing problem is affecting a few hundred users and it could be higher as the problem is hidden for most users if the user do not print document with Calibri font. I hope you could refer this question to someone in Microsoft who could give us an explanation of why the above article seems to fix the issue.

    Customer wants to know how exactly the workaround mentioned in the kb article resolves this issue. Does the workaround updates Calibri Font?

    Any help is highly appreciated.

    There are several things going on here:

    First of all, the KB article itself is referring to a completely unrelated issue that happens to provide a workaround for both problems.

    Second of all, the KB article fix does not provide an update for Calibri.

    Third of all, I have talked about many fixes that complex script provides, all of which seem unrelated, like IsComplexEnoughForYou? and IsCrAndLfComplex or what? and What happens when you involve an unenabled Uniscribe with vertical text, given that Uniscribe doesn't handle vertical text? and so on.

     

    That Install files for complex script and right-to-left languages (including Thai) option helps empower many fonts with OpenType features like the C* fonts to behave properly in a large number of circumstances. This is in fact one of the main reasons that the setting was changed to be "always on" in Vista -- to avoid the random problems that the setting being turned on fixes...

    Fourth of all, I continue to be surprised by how often people will look at three different pieces of the technology puzzle:

    • The computer
    • The operating system
    • The version of Office

    and mix and match them in ways that cause situations that were never anticipated when the technology originally shipped to be commonplace.

    Like using C* fonts when the OS doesn't have complex script support turned on (you can probably think of other examples)....

    Fifth of all, I suspect the reason that some problems occur in printing but not on the screen is different Uniscribe versions -- like the private version used by Office versus the one used by lower level print operations (or perhaps most strikingly when the latter Uniscribe is turned off -- since complex script support is).

    Sixth of all, I suspect this is a reasonable situation to consider an Office 2007/2010 bug.

    After adding all of those fonts, why wouldn't Office install complex script support too?

    To avoid these kinds of issues entirely....

  • Sorting it all Out

    When you want to get horizontal or vertical with some character, don't ask Uniscribe where to go

    • 3 Comments

    Vinay asked:

    Hi
    There are cases when copying a glyph from charmap into WORD rotates it 90 degree or sometimes 180 degree "in case of Vertical writing".

    Does this information about rotation come from Uniscribe and if yes,how to get this info??

    Examples of glyphs which are rotated 180 degree in vertical writing are U+2523,U+2517,etc and most other glyphs are rotated 90 degree(braces,brackets,etc).

    Please help me how does uniscribe help me to sort out the degree of rotation for a glyph in case of vertical writing.

    This really isn't a Uniscribe feature at all, Vinay!

    It's a font feature - with the fonts containing the glyphs to use when you are writing in vertical mode.

    When no vertical substitute is specified, the character just gets rotated.

    Now knowing what characters to rotate and what ones not to is something widely understood by font foundries, but there's no Win32 function that has the info (since fonts don't need to call such a function just to do what they already know what to do anyway).

    I suppose it would be potentially be interesting data for Unicode to provide as a property. Though it would perhaps then have to be extended beyond Han and Kana and into other scripts as well.

    No one has yet volunteered to write the Unicode Technical Report for the UVA (the Unicode Vidi Algorithm).

    Though UVA makes a great backformation of an idea, doesn't it? :-)

  • Sorting it all Out

    At this point, it's a doc bug....

    • 5 Comments

    Andrew asked:

    Is it a bug in RtlUTF8ToUnicodeN, or a doc bug in MSDN? Thanks.

    User Mode Codes:
    WCHAR  wsUtf16[100];
    ULONG  cwUtf16Written;
    CHAR   chTest1 = 'A';
    CHAR   chTest2[] = {'A','B'};

    status = RtlUTF8ToUnicodeN(wsUtf16,
                               ARRAYSIZE(wsUtf16),
                               &cwUtf16Written,
                               &chTest1,
                               sizeof(chTest1));
    printf("cwUtf16Written = %d\n", cwUtf16Written);

    status = RtlUTF8ToUnicodeN(wsUtf16,
                               ARRAYSIZE(wsUtf16),
                               &cwUtf16Written,
                               chTest2,
                               sizeof(chTest2));
    printf("cwUtf16Written = %d\n", cwUtf16Written);

    Output:
    cwUtf16Written = 2
    cwUtf16Written = 4

    MSDN:
    NTSTATUS WINAPI RtlUTF8ToUnicodeN(
      __out      PWSTR UnicodeStringDestination,
      __in       ULONG UnicodeStringMaxWCharCount,
      __out_opt  PULONG UnicodeStringActualWCharCount,
      __in       PCCH UTF8StringSource,
      __in       ULONG UTF8StringByteCount
    );

    http://msdn.microsoft.com/en-us/library/ee453688(VS.85).aspx

    Regards,

    Andrew

    Now of course that link is just one of those relevant to this function.

    The two links are:

    As this would hint at, there are a potentially huge number of callers of this function, all of whom would be broken if the buffer size variable meaning was changed.

    Therefore, whether this was ever in the noble history of RtlUTF8ToUnicodeN considered a bug in the function, it is officially now just a doc bug.

    Bugs like this that exist in different doc libraries (despite actually being in just one part of the source tree, albeit a part compiled many different times), and up being complicated to keep in sync.

    The vast differences in the two doc topics help indicate that -- and suggest that there may even be a better way to keep them in sync.

    I mean, it wouldn't have helped with this bug, but....

  • Sorting it all Out

    On swastikas that don't (or at least shoudn't) offend

    • 1 Comments

    As befits a non-practicing, passively atheistic M.O.T. (aka Jew) who has spent time researching religion in both East and South Asia, I look at the swastika through a decidedly odd lens.

    Obviously the atrocities of World War II are unsupportable, and I feel neither need nor desire to support them.

    But knowing the widespread usage connected to Jainism, Buddhism, and Hinduism across so much of East Asia and South Asia that predates Nazi Germany by millennia, the latent desire to support the symbol in these other contexts become much more of a temptation.

    Quoting a little from the Wikipedia article:

    Historical use in the East

    The swastika is a historical sacred symbol in Indian religions. It first appears in the archaeological record here around 2500 BC in the Indus Valley Civilization. It rose to importance in Buddhism during the Mauryan Empire and in Hinduism with the decline of Buddhism in India during the Gupta Empire. With the spread of Buddhism, the Buddhist swastika reached Tibet and China. The symbol was also introduced to Balinese Hinduism by Hindu kings. The use of the swastika by the Bön faith of Tibet, as well as later syncretic religions, such as Cao Dai of Vietnam and Falun Gong of China, can also be traced to Buddhist influence.

    Buddhism

    Buddhism originated in the 5th century BCE and spread throughout the Indian subcontinent in the 3rd century BCE (Maurya Empire).

    The swastika symbol (right-hand) is alleged to have been stamped on Gautama Buddha's chest by his initiates after his death. It is known as The Heart's Seal. The swastika figures on the Pillars of Ashoka.

    With the Silk Road transmission of Buddhism, the Buddhist swastika spread to Tibet and China.

    Known as a "yung drung" in ancient Tibet, it was a graphical representation of eternity.

    The paired swastika symbols are included, at least since the Liao Dynasty, as part of the Chinese language, the symbolic sign for the character 萬 or 万 (wàn in Mandarin, man in Korean, Cantonese and Japanese, vạn in Vietnamese) meaning "all" or "eternality" (lit. myriad) and as 卐, which is seldom used. The swastika marks the beginning of many Buddhist scriptures. The swastika (in either orientation) appears on the chest of some statues of Gautama Buddha and is often incised on the soles of the feet of the Buddha in statuary.

    Hinduism

    Swastika on the doorstep of an apartment in Maharashtra, IndiaThe swastika is one of the 108 symbols of Hindu deity Vishnu and represents the Sun's rays, upon which life depends. Its use as a Sun symbol can first be seen in its representation of the god Surya. The swastika is used in all Hindu yantras and religious designs.

    Swastika is also considered as a symbolic representation of Ganesha, in Hinduism. Ganesha as per Hindu rites is offered first offerings and as such in every pooja, at first Swastika is made with Sindoor during any religious rites of Hindu.

    Among the Hindus of Bengal, it is common to see the name "swastika" (Bengali: স্বস্তিক shostik) applied to a slightly different symbol, which has the same significance as the common swastika, and both symbols are used as auspicious signs. This symbol looks something like a stick figure of a human being.

    Jainism

    Jainism gives even more prominence to the swastika than does Hinduism. It is a symbol of the seventh Jina (Saint), the Tirthankara Suparsva. In the Svetambar (Devanagari: श्वेताम्बर) Jain tradition, it is also one of the symbols of the ashta-mangalas. It is considered to be one of the 24 auspicious marks and the emblem of the seventh arhat of the present age. All Jain temples and holy books must contain the swastika and ceremonies typically begin and end with creating a swastika mark several times with rice around the altar.

    Jains use rice to make a swastika (also known as "Saathiyo" or "Saathiya" in the state of Gujarat, India) in front of statues in a temple. Jains then put an offering on this swastika, usually a ripe or dried fruit, a sweet (Hindi: मिठाई, Mithai), or a coin or currency note.

    Iran

    Golden necklace of three Swastikas found in Marlik, Gilan Province Iran, dates back to first millennium B.C.In Iran, Golden necklace of three Swastika in Marlik, Gilan province Iran, dates back to first millennium B.C probably symbolising Indian influence there.

    Other Asian traditions

    During the Chinese Tang Dynasty, Empress Wu Zetian (684-704) decreed that the swastika would be used as an alternative symbol of the Sun.

    The Mandarin "wan" is a homophone for the number 10,000 and is commonly used to represent the whole of Creation, e.g. 'the myriad things' in the Dao De Jing.

    In Japan, the swastika is called manji. Since the Middle Ages, it has been used as a coat of arms by various Japanese families such as Tsugaru clan, Hachisuka clan or around 60 clans that belong to Tokugawa clan. On Japanese maps, a swastika (left-facing and horizontal) is used to mark the location of a Buddhist temple. The right-facing manji is often referred to as the gyaku manji (逆卍, lit. "reverse manji") or migi manji (右卍, lit. "right manji") , and can also be called kagi jūji (literally "hook cross").

    In Chinese and Japanese art, the swastika is often found as part of a repeating pattern. One common pattern, called sayagata in Japanese, comprises left- and right-facing swastikas joined by lines. As the negative space between the lines has a distinctive shape, the sayagata pattern is sometimes called the "key fret" motif in English.

     Of course in Unicode only the ideographic characters (U+5350 and U+534d) exist, which means that is you want to support either symbol in a [presumably non-ideographic] font of South Asia, you need to include one or both of these ideographs.

    Anyone who understands East Asian linebreak and width rules can be uncomfortable using the ideographs, for purely technical reasons.

    And while this is not the kind of embarrassing PR flap that the Bookshelf symbol font inspired back in 2004, it certainly would have a similar feel when one is reviewing such a font. It would be easy to see the swastika or the reverse swastika at the end of the list of characters in Character Map.

    Thus it would be easy to take offense.

    I find it troublesome when I consider the impact on Hindus and Jains in South Asia versus the relatively more protected Buddhists in East Asia, yet I find myself lacking the desire to suggest a new character be encoded.

    Even in Gemany the relevant laws (Strafgesetzbuch section 86a) carves out a huge exception for usage in these religious contexts -- so this ends up as more of a Unicode issue than anything else.

    Since no one else is asking, it's perhaps easier to dismiss this "requirement" as not actually being required.

    But perhaps there are some Hindus or Jains who might disagree.

    I know how I'd feel if U+2721 was bein removed from a font, so if they would feel the same then I wouldn't refuse to support their view....

  • Sorting it all Out

    Every character has a story #34: LATIN LETTER T WITH CEDILLA (U+0162/U+0163)

    • 6 Comments

    It's possible to go a long way when you don't even exist.

    Look how it worked out for the Capital Sharp S? :-)

    Some of them get baked into ISO 8859 and Unicode much earlier though.

    Like

    • Ţ (U+0162, aka LATIN CAPITAL LETTER T WITH CEDILLA)
    • ţ  (U+0163, aka LATIN SMALL LETTER T WITH CEDILLA)

    for example.

    Yes, there was a proposal for it to be used in French once upon a time, as per Wikipedia:

    In 1868, Ambroise Firmin-Didot suggested in his book Observations sur l'orthographe, ou ortografie, française (Observations on French Spelling) that French phonetics could be better regularized by adding a cedilla beneath the letter "t" in some words. For example, it is well-known that in the suffix -tion this letter is usually not pronounced as (or close to) /t/ in either French or English. It has to be distinctly learned that in words such as French diplomatie (but not diplomatique) and English action it is pronounced /s/ and /ʃ/, respectively (but not in active in either language). A similar effect occurs with other prefixes or within words also in French and English, such as partial where t is pronounced /s/ and /ʃ/ respectively. Firmin-Didot surmised that a new character could be added to French orthography. A similar letter, the t-comma, does exist in Romanian, but it has a comma accent, not a cedilla one.

    But since it never happened, it doesn't count so much.

    Let's face it, this one was added for the sake of Romanian.

    And we know how that worked out.

    (ref: The history of messing up Romanian on computers)

    Oh well, it will work out some day....

  • Sorting it all Out

    The SQL app that works fine until you have to support Chinese....

    • 7 Comments

    The other day, Serge asked me via the Contact link:

    I coming to you based on a question I have post on SQL forum relative to how can I handle Chinese characters in SQL server tables.

    I have been pointed to your blog by a guy, hoping you could help me cause I could not find any answer for now.

    The scenario is as follow :
    I have an application which is getting different text from an SQL server database. Those database can contains different text language inside and based on the system settings I am fetching from my tables proper language ID identify by its short string ( Ex: US-en would be en)

    We have then now a customer which request to get Chinese characters. For that he has send us its Chinese text in CSV format and then through how application we import those text inside SQL. After importing we have notice that all imported Chinese characters are shown in SQL server tables as "?".

    For now our SQL server version is US running under Windows server 2008 US. We use default SQL server settings for Collation.

    As I have never work with Chinese characters yet in SQL I have no idea how to set SQL server properly.

    Hope you could give me a lift
    Regards
    Serge

    By long-standing convention/decision, the default server collation for the US product is a SQL compatibility collation.

    One that cannot handle Chinese in its non-Unicode columns.

    But rather than run through collation choices that would still lead to limitations (both code page 936 and 950 are missing characters), the better strategy in the end will be to use Unicode columns.

    And of course to use Unicode throughout the process, so you never lose data....

  • Sorting it all Out

    The history of messing up Romanian on computers

    • 7 Comments

    It was years ago that I first predicted about Romanian (in blogs like Be careful what you wish for (just in case it comes true!) aka When a Cedilla needs to be a Comma Below (and vice versa)) that despite claiming to be pleased that people would continue for some time to note problems.

    The most recent proof of that showed up in my inbox the other day, from Christian Adam:

    Hi Michael, Windows 7 has a few problems left regarding the Romanian S and T comma bellow characters. Read more about them here: http://cristianadam.blogspot.com/2011/08/windows-7-and-romanian-language.html Thank you, Cristian.

    His blog covers many complaints across different fonts.

    I had been working on something else, though -- the real look into the whole history of the problem.

    All of this started many years ago, and I wanted to provide more contexf.

    In the end, I found someone who did a better job.

    In this timeline from KitBlog and a great blog, the timeline is laid bare. I will copy it here, but it is worth the visit there, and not only for the additional info!

    • 1987. Romanian language is associated with ISO 8859‑2 (Latin 2)—the international standard stipulates S-cedilla and T-cedilla glyphs. Romanian officials are oblivious to the matter. Very, very bad.
    • 1995. Unicode consortium specifies in version 1.1.5 codepoints U+015E (Latin Capital Letter S With cedilla), U+015F (Latin Small Letter S With cedilla), U+0162 (Latin Capital Letter T With cedilla), U+0163 (Latin Small Letter T With cedilla) as suitable for both Turkish and Romanian, and defined them as containing the cedilla accent. Turkish language indeed uses cedilla in U+015E, U+015F but does not make any use of U+0162, U+0163. Romanian language doesn't use any of them. Very bad.
    • 1995. Windows 95 launches with no support for Romanian language by default. Support is available on CD-ROM Extras for Microsoft Windows 95 Upgrade. The typeface ILP Rumanian B100 substitutes Q/q with Ă/ă. Dark ages. Bad.
    • 1997. Apple’s MAC OS 7.6.1 honors Romanian S/s with comma below and T/t with comma below diacritics with MacRomanian (ten years before Microsoft). Interesting enough, its tables do not resolve U+015E, U+015F, U+0162 nor U+0163 (no S/s with cedilla nor T/t with cedilla)—at all! Good.
    • 1997. Adobe Glyph List (AGL 1.0 and 1.1) specifies "Tcommaacent" and "tcommaaccent" instead of Tcedilla/tcedilla (no resolve for Scedilla and scedilla). The consequence of this decision is that Romanian documents using the (unofficial) Unicode points U+015E/F and U+0162/3 (for Ș/ș and Ț/ț) are rendered in Adobe fonts in a visually inconsistent way using S/s with cedilla and T/t with comma below. Good going bad...
    • 1997. It takes ten years for The Romanian Standards Association to react. It takes ten years for ASRO to react. In 1997 the association complains to ISO about the S-cedilla and T-cedilla standardization requesting an amendment. Good.
    • 1998. The revised version of ISO/IEC 8859‑2 (Latin 2) is ratified without the requested amendment. A note mentions that "the letters S and T with cedilla below may be used to substitute for the letters S and T with comma below". Very bad.
    • 1998. Adobe switches 015E/F back to T/tcedilla. Defines 0218/9 as S/scommaaccent, 021A/B as T/tcommaaccent before Unicode's 3.0 revision but after Apple's MAC OS 7.6.1. Good.
    • 1999. In its release 3.0 the Unicode consortium adds the mappings U+0218 (Latin Capital Letter S With comma below), U+0219 (Latin Small Letter S With comma below), U+021A (Latin Capital Letter T With comma below), U+021B (Latin Small Letter T With comma below), and defined them as containing a “commaaccent”. Great.
    • 1999. After 12 years The Romanian Standards Association standardizes the right glyphs. The Romanian Standards Association adopts SR 13411 standard that stipulates S/s-comma and T/t-comma as official Romanian letters. Good.
    • 2001. ISO publishes ISO/IEC 8859-16 also known as Latin-10 or "South-Eastern European" incorporating Romanian SR 13411 standard, in spite of strong opposition from USA's representatives and from Mr. J. W. van Wingen, Netherlands' representative. Finally Romanian language's standard form is also the correct one. Good.
    • 2001. Microsoft Office v. X for Mac OS X is released crippled, without support Unicode font display or input. Office documents with diacritics created on Windows won't display properly on the Macintosh. Bad.
    • 2001. Apple immediately aligns their OS X to ISO/IEC 8859-16. Good, but...
    • 2001. Unfortunately, Mac OS X does not recognize the "*commaaccent" glyphnames that are defined by Adobe for Romanian and Baltic languages (such as Tcommaaccent, Rcommaaccent, Kcommaaccent, Ncommaaccent) but instead only recognizes the "*cedilla" names (T/tcedilla, R/rcedilla, K/kcedilla, N/ncedilla) or the "uni****" names (uni0162, uni0156, uni0136, uni0145). This means that Mac OS X will fail to recognize the glyphs T/tcommaaccent, R/rcommaaccent, K/kcommaaccent, N/ncommaaccent and map them to their respective Unicodes. [Adam Twardoch2] Bad.
    • 2001. Microsoft along with other software vendors disregards ISO/IEC 8859-16. Ugly.
    • 2001. Microsoft Windows XP is launched. In order to correctly encode and render both S-comma and T-comma, one has to install the European Union Expansion Font Update. Unfortunately, there is no official way to add keyboard support for these characters. In order to type them, one has to either install 3rd party keyboards, or use the Character Map. Bad.
    • 2003. Macromedia Freehand MX (11) is released without OpenType support. Bad.
    • 2003. Adobe releases Creative Suite 1 applications with Unicode support. Designers are able to produce inter-platform Romanian typography without hacking fonts. Good.
    • 2003. People protest against Microsoft practices—most notable is Mr. Cristian Secară with his 2003 open letter to Microsoft Romania (link in Romanian). Good.
    • 2003. After 16 years The Linguistic Institute of the Romanian Academy finally answers. The dormant Linguistic Institute of the Romanian Academy finally honors the request concerning the exact form of the glyphs under letters S and T—says it must be a comma. Very late, still good.
    • 2004. Microsoft Office 2004 for Mac is released with Unicode support. Good.
    • 2007. Six years late and five months after Romania (and Bulgaria) joined the EU, Microsoft releases updated fonts that include all official glyphs of Romanian alphabet. This font update targeted Windows XP SP2, Windows Server 2003 and Windows Vista. Good, at last.
    • 2007. Microsoft Windows Mobile 6 is released without support for comma-below variants of S/s and T/t in any of its three bundled fonts.3 Bad.
    • 2007. Mac OS X ignores the glyph-to-Unicode mapping provided in the “cmap” table of OpenType PS (CFF/.otf) fonts, while it uses it for OpenType TT (.ttf) fonts. For OpenType PS fonts, Mac OS X uses the glyph-to-glyphname mapping provided in the font and then maps the glyphnames to Unicodes itself.4. Bad.
    • 2007. The subset of Unicode most widely supported on Microsoft Windows systems, Windows Glyph List 4, still does not include the comma-below variants of S/s and T/t. Bad, as usual.
    • 2008. Some OpenType fonts from Adobe and all C-series Vista fonts implement the optional OpenType feature GSUB/latn/ROM/locl. This feature forces S-cedilla to be rendered using the same glyph as S with comma below. When this second (but optional) remapping takes place, Romanian Unicode text is rendered with comma-below glyphs regardless of code point variants.5 Good.
    • 2008. Very few Windows applications support the locl feature tag. From the Adobe CS3 suite, only InDesign has support for it. Bad.
    • 2008. Apple updates iPhone OS X to version 2.1, adds Romanian keyboard and correct glyphs for Romanian diacritics. Good.
    • 2008. Some Nokia phones still use incorrect S-cedilla and T-cedilla glyphs.6 Some Sony Ericsson phones use an uneven T with comma below and S-cedilla combo. Bad.
    • 2010. Apple launches the iPad in the US and a few European countries—from where it is unofficially imported to Romania—without support for Romanian language. iPad's current iOS 3.2 (7B367) can display the full range of Romanian letters with diacritics but lacks the keyboard for typing them. Expected, but annoying.
    • 2010. Current version(s) of Android officially supported by the carriers (1.5–2.1) cannot display correctly all letters with Romanian diacritic marks and are completely unable to generate them (via keyboard) out of the box. In order to make diacritics available for typing, Romanian Keyboard for Android (sporting itself a Turkish S-cedilla on its icon) has to be downloaded and installed. [See comments 5157 below.] Back-to-1995-kind-of-bad.
    • 2010. Romanian-language Wikipedia switches to the correct Romanian diacritic marks. Excellent.

    While one can see many problems attributed to many companies and others over the last few decades, the worst problems in my view were the Romanian standards folks doing so much wrong, for so long.

    Now everyone is dealing with the fallout, as we all have been for years (and will continue to do so for many more)...

  • Sorting it all Out

    The secret missing Unicode letters?

    • 2 Comments

    Longtime reader ReallyEvilCanine asked:

    Michael,

    I'm splashing around in the character and glyph separation cesspool I have a question pertaining to initial, medial, final, and isolated forms.

    While different code points were encoded as presentation forms for Arabic and Hebrew back in 3.0 for backward compatibility and simplicity (respectively), vowels in a language such as Devenagari, were not supposed to be encoded. Instead the character choice and display was to be left to the renderer.

    However it appears that each form of any particular Devenagari vowel now has an assigned codepoint. Was a decision made between Unicode 3 and 5 to encode all possible presentation forms? Have I  stumbled across another "exception to the rule" (which now appears to be the rule rather than the exception)? Is there any language which I can use to demonstrate that a single codepoint may have multiple glyphs which the renderer has to pull from a font?

    Thanks,
    REC

    I'll be honest; I'm at a loss here on what on earth he could be referring to.

    Unicode didn't change its encoding mode for the various Indic scripts in general, or Devanagari in particular....

    Anyone know what he's talking about?

     

  • Sorting it all Out

    I've got your number! Here's how...

    • 2 Comments

    Amul asked me a question via the Contact link:

    I am trying to find a solution to get the decimal value to/from Unicode strings.  I have been searching high and low and came across your blog post from a few years back.
    http://blogs.msdn.com/michkap/archive/2006/10/02/783066.aspx

    My goal is to collate Hindi, Gujarati and Arabic numbers together. (Do warn me if that's a pipe dream :)

    I'm not using Windows, so all I have are the ICU libraries from IBM (and I have no idea what MS uses).  I can convert from Unicode characters to decimal numbers but I don't see anything that'll convert back to Unicode characters.

    Would you know how Windows handles this issue?

    thanks,
    Amul

    Well, there is good news and really good news and bad news here.

    First the good news:

    The managed CharUnicodeInfo class has GetDecimalDigitValue, GetDigitValue and GetNumericValue methods that can be used to get the numeric value of any of the numbers.

    The three methods are derived from the Numeric_Value and Numeric_Type properties from the Unicode Character Database....

    You still have to divine your own "interlaced differentiation", of course.

    And the really good news?

    The collation support on Windows already does the work to collate them together!

    Of course if you aren't using Windows, you can't get that stuff. :-(

    Finally the bad news: although I'm sure ICU also wraps this info, I have no idea how. So if you need the answer about ICU then someone else will have to provide it....

  • Sorting it all Out

    It isn't like bringing Democracy to Cuba, but still...

    • 4 Comments

    In the end, it reminded me of a South Park episode from years ago:

    Sheila: Alright, fine Kyle, you can go to the Raging Pussies concert if you clean out the garage, shovel the driveway and bring democracy to Cuba.
    Kyle: What's Cuba?
    Gerald: A communist country run by a dictator named Fidel Castro.
    Kyle: And do I have to shovel the whole driveway or just the side the car's on?
    Sheila: The whole thing.
    Kyle: Ah jeez.

    It started as a push to get a new requirement added to the Quality Gates, so that failing the test would block the checkin

    Meeting the folks who would have to do the work here, some pushback was expected given where we were in the cycle.

    But they were going to set the bar pretty high.

    They pointed out that without signoff from Jon Devaan and "the directors" (a senior collection of distinguished engineers, dev directors, and a VP) , it was really too late to get it done.

    So I pondered my next move.

    Cuba.JPG 
    Kyle: And do I have to shovel the whole driveway or just the side the car's on?

    It wasn't exactly bringing democracy to Cuba, but then the goal was also not as frivolous a a concert, either.

    And we're talking about some really smart people, after all....

    One of whom oversees the folks in charge of the quality gates, and another the underlying technology in question.

    This turned out to be my first solo Exec review (Jan and Dave both had conflicts - I could have delayed, but no sense putting off a trip to the dentist!).

    Anyway, the review went really well, and the meeting with the directors went really well too.

    In fact, the change made it in earlier this month.

    Mainly, I suspect, because some people figured an impossible task was better than a no. :-)

    Well that and the case was compelling.

    And Jan/Dave/Chuck were all very helpful!

    Next step: bring Democracy to Cuba!

  • Sorting it all Out

    Console Unicode support, and a little Hillel....

    • 1 Comments

    Just the other day Igor asked me:

    Hi Michael,
    As a globalization expert, I’d like to ask you : can I say that “Command line does not support Unicode input” in Windows?

    I know it has some restrictions, e.g., it does not support  BiDi, and the OEMCP should be set to according to the ACP for CJK.

    However I can type in Russian and see the text correctly  if I switch to Unicode font on En-US OS with the same system locale.

    I suppose I could have responded with a link dump -- a list of various links from the Blog.

    Or I could have just told him to search for CMD himself or whatever....

    But instead this time I gave a quick, abbreviated response:

    Well, the console can support Unicode, but:

    1. No complex script support
    2. No font substitution
    3. Some console apps don't support Unicode even though the could
    4. 1 and 2 can be fixed by using the PowerShell ISE

    Sometimes it's best to teach the Torah while standing on one foot....

  • Sorting it all Out

    Having 103, 106, or 109 keys when they may not be expected

    • 0 Comments

    Sometimes Microsoft has a bug.

    It happens.

    I know that you all have trouble believing this!

    Usually, it get fixed. Sometimes right away, sometimes eventually.

    Did we know about the bug in time? Was it fixed after the impact became clear? Every bug has its own story. It's own Battles. It's own mythology.

    But one bug really stands out in my mind.

    That bug is the periodic inability of Microsoft to handle the following keyboards well:

    • Japanese 106-key keyboard
    • Japanese 109-key keyboard
    • Korean 103-key keyboard
    • Korean 106-key keyboard

    Now these problems aren't software layout issues -- they're hardware.

    After setup, you get to them through the hardware device update UI:

    (note that it takes a minimum of NINE clicks to get to the UI here that lets you make a selection), not the UI I usually talk about:

    (it takes just FOUR clicks to get here, which is a little better, I guess).

    Now the problems here are huge, as Cameron Beccario's Installing Japanese Keyboards on Windows XP from 2005 points out.

    And as I have pointed out in blogs like Keyboards: plug-and-play, not plug-and-communicate-what-they-look-like.

    And also as countless bugs from customers and partners and OEMs and IHVs have made clear.

    The bug got "worse" in Vista (the old "text mode setup was no longer available) and then failed to improve in Windows 7.

    And the problem is simple:

    1. There is no good way to detect what the hardware is, and
    2. There is no good way to understand the problem when the wrong hardare is chosen

    So everything works fine except when your hardware doesn't match, or you buy new hardware that doesn't match.

    The communication with big OEMs like Dell (where it shouldn't be a problem since they completely control the hardware situation and can thus choose the exact driver to use!) has at times broken down and gone wrong enough like when e.g. it killed their efforts at one point to use the Japanese 109 key keyboard for Japanese customers.

    And the communication with the small IHVs and OEMs is not much better.

    Trivia -- you can plug in multiple monitors or multiple printers and have no problem loading multiple drivers for them -- this does not work for keyboards, which allow only one hardware level driver. There are good reasons for that, with a backcompat bar high enough to require weeks of review for a simple two-line change (let alone a huge architectural one like this).

    But even ignoring the multiple driver issue, this problem (choosing the right driver for the given hardware) has efforts to be fixed stopped at every turn:

    • It can't be fixed it in the OOBE (Out of Box Experience) wizard since it should have been solved by the OEM before the OOBE wizard ever comes up, and it would introduce an "un-neccesary" reboot, and confusing UI;
    • It can't be fixed in core setup because there is no good way to detect what the hardware is and no good UI to let people change it (this used to be a text mode setup option back in the XP and earlier days and didn't help much);
    • It can't be fixed in OEM communications because there is no good way to help OEMs getting it wrong understand how to do it right;
    • It can't be fixed in IHV communications because it is too hard to tell when it will be a problem until it is too late (e.g. we never know it's as bad as the 109-key keyboard until its too late);
    • It can't be fixed it in PNP because the hardware here is not idenitifying what it is well enough to automatically detect which kind of keyboard it is, in either the PS/2 or USB worlds;
    • It can't be fixed in the driver INF files because it would introduce huge backcompat problems, and they too lack the ability to detect what the hardware is in many cases;

    I could go on quoting the various issues from various teams that have over the years been asked to ease the problem, and to be honest I am just getting started with the above list to give you a flavor of the problem.

    But even though I could argue many of the arguments raised successfully, they would beat me in their next salvoe, or the one after that. These people have their reasons,and the vast majority of them are correct.

    And yet we still have this one small set of broken keyboard hardware/software scenario that belies the claims of my Keyboards: hardware vs. software blog about how well these two teams support stuff.

    Though our best hope lies in improving the OEM/IHV communication channels, as this is the most solvable area in the stack. Stay tuned....

  • Sorting it all Out

    The road to standards compat is paved with app back-INcompat

    • 4 Comments

    The other day, Jacob Schäffer responded to Windows isn't Office (and vice versa) in a comment:

    The fact is that I'm hunting a *stable* way to use Locale Names for lookup *AND* need access to locale data on XP as well as on newer OS versions - using unique Locale Names for input. However, since the Locale Names for some locales can't be directly built on XP - even with good will - this appear to be a problem for me, since the environment I work in is VBA7 in Office 2010 (which - by the way - don't implement ANY way to lookup locale information by Locale Name, but only by LCID).

    What I see is that the Windows environment can NOT deliver a *stable* mapping from Locale Names to Locale Identifiers unless I implement all sorts of workarounds. That's perfectly fine with me, since the world goes on and *if* I need backward compatibility I'm asking for trouble. Period.

    Anyway, please point to an ISO standard that define "029" as a proper country identifier. The MSDN documentation for the LOCALE_SISO3166CTRYNAME says that it should return the ISO 3166 name for a country. I'd like to know which country "029" represent and where to find the ISO standard that says so ???

    Now, let's assume that such a standard doesn't exist. Then, can developers rely on LOCALE_SISO3166CTRYNAME after all, or is the ISO standard insufficient in this regard ???

    All the best /Jacob

    One of the big problems here is that Jacob's Windows versions of interest span the period between when Microsoft's NLS data was pretty much based on Microsoft's terms and when Microsoft decided to try and follow the relevant standards that were in the process of being formalized at the time.

    Most of the actual differences are listed in Microsoft Knowledge Base Article 939949 (Error message when you run an application or try to access a Web site on a computer that has a particular .NET Framework 2.0 software update installed: "Culture name 'Culture' is not supported").

    The article enumerates 13 changes:

    Old culture name New culture name
    az-AZ-Latn az-Latn-AZ
    uz-UZ-Latn uz-Latn-UZ
    sr-SP-Latn sr-Latn-CS
    az-AZ-Cyrl az-Cyrl-AZ
    uz-UZ-Cyrl uz-Cyrl-UZ
    sr-SP-Cyrl sr-Cyrl-CS
    bs-BA-Cyrl bs-Cyrl-BA
    sr-BA-Latn sr-Latn-BA
    sr-BA-Cyrl sr-Cyrl-BA
    bs-BA-Latn bs-Latn-BA
    iu-CA-Latn iu-Latn-CA
    div-MV dv-MV
    en-CB en-029

    Most of the first eleven represent Microsoft trying to build names the same way it built LCIDs, in language-region[-script] order. But after the folks who wrote one of the early name RFCs refused to entertain the notion of "name aliases", Microsoft eventually bit the bullet and decided it was better to break backward compatibility with less commonly used locale names than have yet another Microsoft-specific standard....

    The last item was to replace an ill-advised "private" name for "English in the Carribean" (which had no ISO 3166 name), with an ISO 3166 numeric code (used by the UN, too).

    And the second-last item was to correct a mistake where Win2000 originally though Divehi had only an ISO 639-2 name, and no ISO 639-1 name.

    Can we trust the changes? Well, whether you can or not, you can trust the direction. It's coming from a good place.

    In the end, Microsoft decided to do "the right thing', and all it cost was a bit of compatibility....

  • Sorting it all Out

    If you change the behavior of typing sequences you should never type, is it a bug?

    • 0 Comments

    The Array IME has had a colorful history.

    Especially in the move from the old IMM32 based world to the the new Cicero based world....

    Finally things are pretty stable.

    I mean, they are as long as your goal is fast, error-free input!

    Even back in XP and Server 2003 and earlier, it was possible to see lots notdef glyphs:

    Of course they could not be selected, and hitting the number or trying to choose it would fail.

    It would at least beep to tell you, though!

    Now in Vista and later, the experience is not entirely unfamiliar:

    One big difference, though.

    Now, you can choose these "invalid" choices.

    Mostly they are U+25a1 (WHITE CIRCLE), a fitting image often used to represent the notdef glyph.

    Though an actual NULL might been better -- given the longstanding behavior described in Short-sighted text processing #1: Uniscribe filters nothing and The Sally Kimball Addition To The Dead Keys Conundrum: An Encyclopedia Brown Mystery for illegal sequences.

    Not in the TableTextServiceArray.txt source file. But in the code, somehow?

    Note that this matches the original behavior of the IMM32-based Array IME...

    The actual strings in the source file look like they were probably a straight migration, though the behavior didn't migrate as well. :-(

    From an input point of view, this migration was as not entirely unlike the one from Win9x to WinNT described in The Romanian keyboard layout on XP is the brokenest layout of all!

     But in the end, how serious is this change in behavior?

    Well, it depends.

    According to one point of view (which one might even call a "more Asian" point of view?), a change of behavior in invalid/illegal sequences isn't terribly important.

    Because it would be wrong to type them anyway!

    And, as a more concrete example, the original author of the Array IME, who voiced serious concerns about prior functionality loss and bugs in Vista only fixed in SP2 and Windows 7, has never AFAIK seriously complained about the issue...

    Though to someone with as much error-prone typing as I, it may seem harder to dismiss.

    What do people here think about it?

  • Sorting it all Out

    There's no "I" in IDN part 8: Punycode don't do the PUA

    • 5 Comments

    Previous parts in this series:

    The other day an interesting issue came up while a team worked to provide IDN support:

    Picking an Uri item fails when the input data is something like "http://www.覞嬁ﺞ쫽礗萦笧䶉럼.com".

    Note: the domain name portion is a randomly generated string and it doesn't have to be exactly above. It is repro with another randomly generated string also.

    Also note that if the above string is shortened to "http://www.覞嬁ﺞ쫽礗萦.com", the action works.

    In a way, the answer is in the question!

    On the road to Punycode, there are two different processes going on:

    • Canonically equivalent and otherwise "confusable" characters are all folded down into a subset of "valid" characters, and
    • Characters deemed illegal/invalid in IDN cause the process to fail.

    And characters in the PUA (PRIVATE USE AREA) like , aka U+e216, have no public, agreed upon context.

    Thus, they have no place in Punycode. Or in IDN.

    This is different but somewhat akin to the behavior of IsNLSDefinedString that I described in Keeping out the undesirables? a few years back...

Page 1 of 2 (25 items) 12