Blog - Title

March, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Font substitution and linking #1

    • 22 Comments

    Ok, there will be several posts on this topic, starting from the core support in GDI/Windows and moving concentrically outward to information on usage in Uniscribe, MLang, and Office.

    I'll start with font substitution.

    At the simplest level, this feature is what it sounds like -- simple substitution of one font name with another.

    It starts with a registry key (HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes). In Windows Server 2003 that key contains the following:

    "Arial Baltic,186"="Arial,186"
    "Arial CE,238"="Arial,238"
    "Arial CYR,204"="Arial,204"
    "Arial Greek,161"="Arial,161"
    "Arial TUR,162"="Arial,162"
    "Courier New Baltic,186"="Courier New,186"
    "Courier New CE,238"="Courier New,238"
    "Courier New CYR,204"="Courier New,204"
    "Courier New Greek,161"="Courier New,161"
    "Courier New TUR,162"="Courier New,162"
    "Times New Roman Baltic,186"="Times New Roman,186"
    "Times New Roman CE,238"="Times New Roman,238"
    "Times New Roman CYR,204"="Times New Roman,204"
    "Times New Roman Greek,161"="Times New Roman,161"
    "Times New Roman TUR,162"="Times New Roman,162"
    "Helv"="MS Sans Serif"
    "Helvetica"="Arial"
    "Times"="Times New Roman"
    "Tms Rmn"="MS Serif"

    "MS Shell Dlg"="Microsoft Sans Serif"
    "MS Shell Dlg 2"="Tahoma"

    These entries can be put into three categories (which are color-coded above):

    BLACK - these are entries that were formerly used by many applications to combine font family (name) choice with font character set choice (basically the lfFaceName and lfCharSet members of the LOGFONT struct). I'll talk more about lfCharSet and what it used to do (and sometimes still does) another day. But in any case these names are not really used much anymore. When they are used in applications, their presence in the FontSubstitutes subkey makes them work properly.

    GREEN - these entries allow some common abbreviated names to work. Their usage is self-explanatory.

    BLUE - these entries are the ones behind the huge effort to support MS Shell Dlg as a UI font name (also described here and in article 282187 in the knowledge base). In fact, these two entries are the only ones that can be considered useful for more than just backward compatibility with no longer used methodologies. Raymond Chen also has good advice about getting the right font used via DS_SHELLFONT in the articles What's the deal with the DS_SHELLFONT flag? and What other effects does DS_SHELLFONT have on property sheet pages? for those who are interested.

    Of course it seems odd that MS Shell Dlg, documented as a version independent, language independent pseudo-font name seems to be hard coded to use fonts that do not support all languages. Wasn't it designed to get people away from hard-coding Tahoma or Microsoft Sans Serif? And the answer is yes -- it was. Luckily those font names are affected by a different font mapping technology, font linking, which I will describe in a future post in this series....

    There is another kind of font substitution that is occasionally seen in documentation, which relates to printer drivers and what they do to substitute fonts built into printer hardware when thay can. My personal belief (with my admitted bias towards good international functionality) is that it is important to not use this feature since even printer fonts that accurately handle the basic glyphs seldom have the  full support for all scripts (not to mention complex scripts!). In fact, one of the first things I do with each new version of Word is find out how to set the Print TrueType Fonts as Graphics setting that allow what is on the screen to be what gets printed rather than using device fonts....

     

    This post brought to you by "ڜ" (U+069c, a.k.a. ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS ABOVE)
    A character not seen in most device fonts!

  • Sorting it all Out

    The WinForms DateTimePicker and MonthCalendar do not support culture settings

    • 8 Comments

    The issue is partially described in the Microsoft Knowledge Base (article 889834) but this article does not tell the full story (and some of what it tells is wrong).

    Let's start with the title and its problems between CurrentUICulture and CurrentCulture:

    The DateTimePicker control and the MonthCalendar control do not reflect the CurrentUICulture property of an application's main execution thread as you expected when you created a localized application in the .NET Framework or in Visual Studio .NET

    Now the MonthCalendar and the DateTimePicker are not based on the UI settings, they are based on the user settings. So even if the control is fully globalized, it would never be based on the UI settings. This is because the date, time, calendar, number, currency, and collation settings are always based on the default user locale (and on CurrentCulture in the .NET Framework). If a localized application's language were to match this, it would only because the user happened to set CurrentCulture and CurrentUICulture to be the same culture, which is often the case but does not have to be.

    Now the article is smart enough to point out the user locale settings control the language of the DateTimePicker and the MonthCalendar controls, and it does point out why -- because these two controls are wrappers around the Windows Shell common controls.

    But this is not the full story.

    Because calendars, as imperfect as they are in Win32 (cf: Calendars on Win32 -- just there for show....Calendars on Win32 -- Not all there yet) and .NET (cf: Calendars.NET -- new platform, new issues) both platforms have serious advantages over the Shell controls, since the Shell DateTimePicker and MonthCalendar common controls only support the Gregorian calendar.

    Thus even if your default user locale settings include a calendar setting where GetLocaleInfo with the LOCALE_ICALENDARTYPE returns any of the following values:

    Value Constant                     Meaning                                     
    1     CAL_GREGORIAN                Gregorian (localized)
    2     CAL_GREGORIAN_US             Gregorian (English strings always)
    3     CAL_JAPAN                    Year of the Emperor (Japan)
    4     CAL_TAIWAN                   Taiwan calendar
    5     CAL_KOREA                    Tangun Era (Korea)
    6     CAL_HIJRI                    Hijri (Arabic lunar)
    7     CAL_THAI                     Thai
    8     CAL_HEBREW                   Hebrew (Lunar)
    9     CAL_GREGORIAN_ME_FRENCH      Gregorian Middle East French calendar
    10    CAL_GREGORIAN_ARABIC         Gregorian Arabic calendar
    11    CAL_GREGORIAN_XLIT_ENGLISH   Gregorian Transliterated English calendar
    12    CAL_GREGORIAN_XLIT_FRENCH    Gregorian Transliterated French calendar

    the DateTimePicker and MonthCalendar controls will never go beyond the Gregorian calendar.

    Now the methods and properties on the calendar classes derived from Calendar class (GregorianCalendar, HebrewCalendar, HijriCalendar, JapaneseCalendar, JulianCalendar, KoreanCalendar, TaiwanCalendar, and ThaiBuddhistCalendar) contain the capabilities to let you create your own calendar. I'll try and throw such a sample of a calendar together another time.

    Note that the above limitations do not apply to the ASP.Net control (System.Web.UI.WebControls.Calendar). I will cover this control and its capabilities another day....

     

    This post brought to you by "ฟ" (U+0e1f, THAI CHARACTER FO FAN)

  • Sorting it all Out

    Even every version of XP Home is fully internationalized....

    • 18 Comments

    Last night, I received the following question in e-mail:

    Dear Michael.

    I wonder if you can clarify this matter.
    I was under the impression that Tamil Unicode was possible only under XP professional, - since regional lang settings are available in the Control Panel.
    Yesterday I bought a new laptop that came with XP home edition. The regional language settings  was not available initially. I was disappointed and thought of upgrading to XP pro. But I went to the Help Button and saw a link that installed the regional lang setting. I was able to set up Tamil Unicode settings and I am now happily using Tamil Unicode in the preinstalled Microsoft Works..
    The query is how come there is this impression that only Windows 2000 and XP pro support (Tamil)Unicode??

    Thank you  for your time

    Kalaimani
    Singapore

    I reassured him of one important fact here -- that every version of Windows 2000, Windows XP Home, Windows XP Professional, and Windows Server 2003 contains all of the international support, no matter what localized version the SKU is.

    So if you have XP Home then you have the support for Tamil, Georgian, Punjabi, Russian, Greek, Traditional Chinese, Bulgarian, Afrikaans, Catalan, Korean, Basque, Vietnamese, Spanish, Thai, French, Hindi, Japanese, Belorussian, Icelandic, Farsi, Galician, Danish, Ukrainian, Romanian, Simplified Chinese, Swedish, Konkani, Italian, German, Hungarian, Armenian, Konkani, Lithuanian, Divehi, Swahili, Czech, Dutch, Hebrew, Estonian, Gujarati, and all of the rest of them.

    You may have to install the proper international support to get the language (and some on this list are only available in XP and later), but they are all there waiting for you to use them. Today!

     

    This post brought to you by "ஶ" (U+0bb6, TAMIL LETTER SHA)
    Recently (as of Unicode 4.1) added to Unicode based on a proposal by INFITT (International Forum for Information Technology in Tamil)

  • Sorting it all Out

    I coffee, therefore IFilter (or, Language-specific processing #1)

    • 13 Comments

    Apologies for the title, I still cannot resist that sort of thing. Maybe one day....
    If you have not read it yet, look at Language-specific processing #0 for more info about this series!

    IFilter is one interface that you can use to lower the barriers between the engines that do the work of indexing and the data that may be sitting in proprietary formats. The documentation probably explains it better than I could here:

    The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter provides the foundation for building higher-level applications such as document indexers and application-independent viewers.

    Immediately several of what seems much like the shipping implementations of this feature like this will come to mind: Full Text Search in SQL Server, SharePoint, Exchange, and Index Server for starters. And then there are those like MSN Desktop Search, as well. All of the times that search suppots additional file formats. Imagine being able to get in on the fun to make sure your own format is supported for some type of indexing/searching?

    This is a COM interface so to implement it you have to implement AddRef/Release/QueryInterface as always. The additional methods you have to implement:

    • IFilter::Init - Initializes a filtering session.
    • IFilter::GetChunk - Positions filter at beginning of first or next chunk and returns a descriptor.
    • IFilter::GetText - Retrieves text from the current chunk.
    • IFilter::GetValue - Retrieves values from the current chunk.
    • IFilter::BindRegion - Retrieves an interface representing the specified portion of object. Currently reserved for future use (for now you would always return E_NOTIMPL).

    The general topic about the IFilter interface has pointers to summaries, samples, instructions on building, applying and testing filters, as well as methods to bind to already existing IFilter implementations.

    It is also nice to see such a great effort on the security side -- links and information to help guarantee that ISVs who write code against this interface do it securely. Throughout there are good warnings:

    Caution    IFilters for Indexing Service run in the Local System security context. They should be written to manage buffers and to stack correctly. All string copies must have explicit checks to guard against buffer overruns. You should always verify the allocated size of the buffer. You should always test the size of the data against the size of the buffer.

    That and a link to secure code practices to consider when implementing these interfaces are a welcome touch as far as I am concerned (as it does no good for Microsoft to write secure code if an ISV writes a component with a security issue!).

    Now note that this interface, this IFilter, is not really about language-specific processing as much as it is about format-specific processing. But one of the greatest strengths of a service like MS Search is the ability to apply it to different file formats. It makes IFilter a very important interface to stretch the boundaries of what can be searched.

    And it gives the future topics, that deal with those more linguistic aspects of language-specific processing a much wider reach than they would otherwise have. So I will give IFilter an honorary "cool" status that I would usually reserve for things more linguisticalish :-)

     

    This post was sponsored by "F" (U+0046, a.k.a. LATIN CAPITAL LETTER F)
    A letter that realized it would never get to sponsor any of the fun "F" words while I am working for Microsoft, so it thought it should take "Filter" while it was available.

  • Sorting it all Out

    Dere are qvestions? In zat <b>case</b>...

    • 16 Comments

    J. Daniel Smith asked about ToLower() (and ToUpper()) and some trouble he was having with them:

    The comment about Turkish in the docs with regards to "i" doesn't carry a lot of weight with fellow programmers and we only care about 8 languages: English, FIGS and CJK.

    One example that occurs to me is the word "Straße" in German. When upper-cased it should become "STRASSE" (no ß), but I can't seem to get code to do that. Also, being a noun, you can't lower-case this word as nouns always start with a capital in German; "straße" is wrong (unless there is a verb "strassen").

    Windows and the .NET Framework mainly support simple, reversible casing -- which is to say single code point casing that have ToUpper() and ToLower() as inverse operations that can "undo" each other. As such, you cannot use either method to convert one to the other.

    Comparison, on the other hand, will handle this case. If you compare "ß" to "SS" with CompareString and the NORM_IGNORECASE flag in Windows or the CompareInfo.Compare method and the CompareOptions.IgnoreCase flag in the .NET Framework, the two strings will be considered equal. Because in truth, they are equal -- just a case pair apart....

    This happens on all locales, not just in German -- because the "ß" (U+00df, a.k.a LATIN SMALL LETTER SHARP S) is considered to be a simple case difference away from "SS" in the default table. Give it a try!

    J. Daniel went on further to ask some additional questions:

    In German, there is always an alternate spelling for words with umlauts: "für" is the same as "fuer". However, the converse is now always true; not every "ue" can be replaced with "ü".

    Similarly for "ß", it can always be replaced with "ss" (and must when UPPER-CASING as there is no such thing as an upper-case "ß"). But not every "ss" can be replaced with "ß".

    First, I can't seem to get ToUpper() to turn "ß" into "SS".

    Second, how do I correctly deal with "für"=="fuer"?

    Ok, I think I took care of explaining the deal with the Sharp S. But let me add that this is not a conditional opertion -- Windows is neither drawing on huge German dictionaries to avoid treating them with this sort of equivalency nor using machine reading techniques and schoolboy knowledge of German to read the text....

    For the second point, you will want to look at what is known as the German Phonebook Sort -- LCID of 0x00010407. It will have all of the following equivalences in collation:

    Ä == AE
    ä == ae
    Ö == OE
    ö == oe
    Ü == UE
    ü == ue

    You can just think of collation as the technology that will travel to where casing fears to go.... :-)

     

    This post is sponsored by "Ä" (U+00c4, a.k.a. LATIN CAPITAL LETTER A WITH DIAERESIS)

  • Sorting it all Out

    "Michael, why does ToTitleCase suck so much?"

    • 13 Comments

    In the title of this post I am actually quoting email I have received on the topic, mail similar to others I have been sent many times ever since I started posting about case issues over the last few months. And that is one of the tamer ones I have received!

    People seem to hate the TextInfo class for its ToTitleCase method.

    To quote a slightly "nicer" version of the question, someone named Ruben posted the following in the suggestion box:

    Perhaps an article on the problems when using things like ToTitleCase, which is at war with just about any style guide, just loves acronyms (albeit the feeling is not mutual), and breaks spelling for languages like Dutch and Gaelic (e.g., IJmuiden and Oileán na gCapall, which are perfectly regular capitalizations in their respective languages; as an illustration on why Unicode doesn't solve linguistic issues, despite many people's assumptions that it does).

    You can probably see why I put the word "nicer" in quotes, since many will feel as I do that you do not have to use foul language to post text that is harsh and biting!

    For the origins of attemps at this method, we will need to head into the way-back machine to look at the old VB/VBA function, StrConv and its vbProperCase conversion. This function "Converts the first letter of every word in string to uppercase." It does so by defining the word breaking characters as follows:

    The following are valid word separators for proper casing: Null (Chr$(0)), horizontal tab (Chr$(9)), linefeed (Chr$(10)), vertical tab (Chr$(11)), form feed (Chr$(12)), carriage return (Chr$(13)), space (SBCS) (Chr$(32)). The actual value for a space varies by country for DBCS.

    Note that this function shows the same qualities of international ignorance, and even though the function has an LCID parameter, the actual amount of variation between locales is pretty small.

    The VB function gets a little better in VB.Net (cf: VB.Net's StrConv), in that it now has a linguistic casing option, which is great for Turkic....

    And then there is the Unicode Standard, which defines the title case property values in Unicode and the Unicode Character Database with the following excerpted quotes:

    "Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is: U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z. "

    "The choice of which words to titlecase is language-dependent. For example, "Taming of the Shrew" would be the appropriate capitalization in English, not "Taming Of The Shrew". Moreover, the determination of what actually constitutes a word is also language-dependent. For example, l'arbre might be considered two words in French, while can't is considered one word in English."

    "In most cases, the titlecase is the same as the uppercase, but not always. For example, the titlecase of U+01F1 "DZ" capital dz is U+01F2 "Dz" capital d with small z."

    "There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. Once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. There are also single characters that do not have reversible mappings, such as the Greek sigmas above."

    Obviously Unicode hints at the complexities of title case in languages, but it does not really do much to support it in data (a task that would obviously require dictionaries, rules, and data. This even maks an interesting interview question, for people who are looking into those. :-)

    While the text does talk a mean game, the actual data in Unicode for title casing is limited to a few of the digraphs like DZ (U+01f1, LATIN CAPITAL LETTER DZ), which through the miracle of title casing becomes Dz (U+01f2, LATIN CAPITAL LETTER D WITH SMALL LETTER Z. In this context the Unicode data is little more than making sure that diagaphs get their own say....

    Now let us move to the TextInfo method ToTitleCase method, which explains itself as follows:

    Generally, title casing converts the first character of a word to uppercase and converts the rest of the letters to lowercase.

    The returned string might differ in length from the input string. For more information on casing, refer to the Unicode Technical Report #21 "Case Mappings," published by the Unicode Consortium (http://www.unicode.org). The current implementation preserves the length of the string; however, this behavior is not guaranteed and could change in future implementations.

    Casing semantics depend on the culture in use. If using the invariant culture, the casing semantics are not culture-sensitive. If using a specific culture, the casing semantics are sensitive to that culture. Words that are selected for title casing depend on the language.

    If a security decision depends on a string comparison or a case-change operation, use the InvariantCulture to ensure that the behavior will be consistent regardless of the culture settings of the system. However, the invariant culture must be used only by processes that require culture-independent results, such as system services; otherwise, it produces results that might be linguistically incorrect or culturally inappropriate.

    Now, currently the only culturally different casing behavior is the same rule one sees in Turkic languages, as I described in The [Upper]Case of the Turkish İ (or: Casing, the 2nd). While the potential for richer behavior exists such as some of the cases Ruben is referring to, none of them currently happen. But the way is open in the future for such things to possibly happen.

    This would, however, be an expensive operation to get right in terms of the amount of research that would be required. The help topic is therefore at best optimistic about such work happening. It may be best to set expectations more realistically and not talk about how culturally sensitive this method is (since it is not, at least not yet!).

    Perhaps we could point out how it goes along wih Unicode's somewhat vague definition, so that at this point it is really just lame by the transitive theory of developing to a standard.

    It makes a catchy slogan -- do you think we can we put

    TextInfo.ToTitleCase -- No lamer than Unicode

    on a T-shirt? :-)

     

    This post brought to you by "NJ", "nj", and "Nj" (U+01ca, U+01cc, and U+01cb, a.k.a. LATIN CAPITAL LETTER NJ, LATIN SMALL LETTER NJ, and LATIN CAPITAL LETTER N WITH SMALL LETTER J)
    (a.k.a. the Unicode UPPERcase, LOWERcase, and TITLEcase forms of the letter)

  • Sorting it all Out

    What does the the third letter in GIFT stand for?

    • 10 Comments

    Robert Scoble has a blog. Now he does not need me link to him, as he gets plenty of attention on his own. :-)

    Anyway, the other day he made a post entitled 'Light blogging week, first look at Longhorn fonts'. In it, he talked about Bill Hill:

    Today we're gonna take a hike around campus with Bill Hill. He was our first interview on Channel 9 (and still one of our most popular). His bit about why you should put only a single space after a period is still one of my favorites.

    Don't know who he is? He's in charge of typography at Microsoft. You know, fonts and stuff. His group is spending millions of dollars in font and font-rendering technology. So, I'm sure we'll talk about the fonts that his group designed for Longhorn.

    Ed Bott has the preview of those. He links over to a Poynter Online article about the new "C-fonts" designed for Longhorn.

    Now I am not going to knock Bill Hill, he is a smart guy and he has a smart team. And he is an engaging speaker as Robert indicates. ClearType is a very cool technology, and my team works with his team on a lot of different things.

    But he is not in charge of all of the typography that happens at Microsoft.

    You see (over in another group at Microsoft) I am on the GIFT team, and GIFT does not stand for Globalization Infrastructure, Flowers and Tools. And that "F" does not stand for "Folk Singers" or "Fabulous" or "Freaking" or anything else like that. It stands for Globalization Infrastructure, Fonts and Tools. Because a few years back MST (Microsoft Typography) merged with some other folks under Julie Bennett to form the GIFT team.

    Now the typograhy folks have a lot in common with those of us in NLS in that usually people do not really notice us unless something goes wrong. But their work is no less important, in fact I would argue it is often more important since the utility of collation and casing and encoding is pretty limited if you can't see what the characters are (only people on the NLS team get good at speaking fluent question mark or square box. And it is the folks in typography who have been making it all happen for a lot of years.

    I could talk about the millions of dollars that they are spending on fonts in Longhorn for new languages (only some of which I can even talk about yet!).

    I could talk about all the work they did for Windows XP SP2 to add suppport for Bengali and Malayalam (referred to in Lions and tigers and bearsELKs, Oh my!).

    I could talk about all of their efforts for Sinhalese in advance of Longhorn (referred to in Doing a little more in Sri Lanka....).

    I could talk about the next dozen fonts that they are working on for new languages in future Windows updates and in Longhorn.

    I could talk about all of the free tools they release like the Web Embedding Fonts Tool (WEFT), the Font properties extension.

    I could talk about all of the developer tools they release, from Microsoft Font Validator to Visual OpenType Layout Tool (VOLT) to OpenType Layout Services Library (OTLS) to Visual TrueType (VTT) to the OpenType Font Signing Tool to Font Properties Editor to the OpenType Embedding SDK to the many other font development tools

    I could talk about the shaping engines that they produce for both Uniscribe and Avalon, that help make the most out of the new fonts for the languages that we support.

    I could talk about how they own the OpenType specification, about which I know just enough to realize the extent to which I am "only an egg" compared to the typographers down the hall.

    I could talk about folks like Simon Daniels who have spoken at GDDC and tons of other conferences and who I wish I could find a pointer to some of the slides he has done. Or better yet video of one of the presentations since you have to see him speak to get the full effect of the work that he describes.

    I could talk about the large community of OpenType developers out there and the exciting work that OpenType enables.

    I could talk about all of the OpenType training that they do, around the world.

    I could talk about the cool font that they helped us to deliver for MSKLC that we use to give a visible display to characters that have no visible representation, from U+0020 to U+180b to U+034f and so on.

    I could even talk about some of the many other issues that Ning, Simon, Paul, Peter, Ali, Judy, Carolyn, Julia, Cathy, Adam, Vinay, Dave, Mushegh, Nick, Sergey, David, Michel, or the rest of the typography folks are dealing with even as we speak to make the font story at Microsoft a good one, for Longhorn and for everything else.

    Or maybe I have said enough to convince everyone that the "other Typography team" at Microsoft is also a place where important work is happening for Longhorn and beyond. Even if Channel 9 has not yet paid them a visit.... :-)

     

    This post is brought to you by U+034f, a.k.a. COMBINING GRAPHEME JOINER.

  • Sorting it all Out

    Post categorization (and people who pick up the feed)

    • 9 Comments

    Ok, a poll for the people who read here regularly....

    I was pinged a few people to register this blog as being involved with the various technology areas that it touches (Windows, the CLR, SQL Server, Office, AD, globalization, etc.) and I did so.

    But note that my post categories (the list over to the left) are based on various internationalization topics, not on technology.

    This caused someone to post a comment suggesting that I fix the bug that caused the post to be listed under an unrelated technology.

    Well, they are right, no argument about that.

    To fix this, I have several options:

    1. I can add categories for the various technologies and then subscribe them as needed (seems messy since it means I will double the number of categories!)
    2. I can remove the current categories and replace them with new technology based ones (seems ugly since it does not really match how I think about stuff and much of it transcends product boundaries even if I do not call it out explicitly)
    3. I can do nothing and leave it all as it is now (seems like a bad idea if people are going to be unhappy)
    4. I can unregister in all of those product areas and just hope people wander over when they need to (will make the original folks who were pinging me unhappy)

    I guess I am leaning toward (C) or (D) at this point, slightly more towards (D) since I am not trying to maximize hit count; I'd rather be on fewer radars than too many, if I have to choose.

    Anyone have any thoughts? Does anyone use the old categories or are they sensible only to me? Would people prefer new prouduct based ones, instead?

  • Sorting it all Out

    In TV and movies, language is often done without thought

    • 8 Comments

    On the Language Log, Bill Poser posted about the use of Chinese in a particular episode of Law & Order in his post Chinese in Law And Order:

    Television is confusing. I was watching Law and Order a little earlier. It was the episode in which the police find a little Chinese girl and her baby sister alone in their apartment, their mother missing. The story is about what has happened to her. The Chinese-speaking detective and the little girl converse in Mandarin, and so do the little girl and her aunt. Near the end, when they locate the little girl's teenage sister, she and her aunt speak Mandarin with each other. But when the aunt goes into a shop in Chinatown to consult the owner, they speak Cantonese.

    He then points out the problems with this whole scenario.

    This scenario seems unrealistic to me. That the man in Chinatown should speak Cantonese is what I'd expect. Most Chinese immigrants to the US until recently spoke Cantonese. Recent immigrants include many Mandarin speakers, so it isn't a surprise that the girls and their aunt spoke Mandarin. Indeed, just recently I had what to me was the rather odd experience of encountering a little girl, maybe 8 or 9, in a shop in Chinatown, who spoke neither English nor Cantonese. We spoke Mandarin (she rather better than me - yet another area in which age and academic degrees don't help).

    What is odd is that the aunt spoke Cantonese with the man in Chinatown. Of course, many Cantonese-speakers learn Mandarin as a second language, so bilinguals are not rare, but it is quite unlikely that a Cantonese person who also knows Mandarin would speak Mandarin with her nieces. People who are basically Mandarin speakers rarely speak Cantonese; if they do it is usually because they have moved to a Cantonese-speaking area. The only other hypothesis that I can think of is that the adults are first-language Cantonese speakers who have learned Mandarin as a second language and who so strongly identify with Mandarin as the language of modernity that they have spoken Mandarin with their children and nieces. I guess that's possible, but I haven't ever met anyone like that. In my experience, Cantonese speakers always prefer Cantonese. They may make an effort to learn Mandarin because they perceive it as advantageous to know, but they would never use it with their children.

    It is often a mistake, however, to try to ascribe higher motives to writers of a gritty television show filmed in New York.

    So, I'm wondering whether the Law and Order folks had in mind some interesting scenario that would explain the choice of languages in this episode, or whether they just don't know one kind of Chinese from another, or don't think that anyone will notice.

    The latter, I would say.

    It is a bit like the work Mark Okrand did for Paramount in creating an entire Klingon language (for which he later created a dictionary). Dr. Okrand was once in school with Ken Whistler, who I have talked about previously. And there are times that he may regret the fact that most or links in Google Scholar pointing to him relate to scholarly work about a language that does not exist and whose principal speakers wear rubber protrusions for the foreheads when they speak it. Cornelis Krottje notes in his revisionary proposal of the Klingon Dictionary:

    The current dictionary of Klingon (Okrand, 1992) is a bilingual, bidirectional dictionary, consisting of a passive Klingon-English section and an active English-Klingon section. We will maintain this nature of the dictionary; the alternative, an active Klingon-English section and a passive English-Klingon section, is unrealistic, simply because of the fact that native speakers of Klingon do not exist.

    But note that despite the recognition of all of this by the lucid speakers of the language, the fact is that most of the Star Trek episodes that have involved Klingons since the original Star Trek movie for which Paramount commissioned Dr. Okrand have done so without any linguistic guidance. The script is used randomly on ships and controls, and the language used seldom matches the actual language beyond single words like nuqneH that the Klingon Language Institute has not yet managed to make as common in English as words like grok.

    The writers of Law & Order probably did not have any deep motives or hidden scenarios for what they did. I frankly doubt they even really knew that the actors did this. Perhaps it was just an easter egg that they produced for the show? :-)

    This post brought to you by "𠀀" (U+20000, the first Extension B ideograph meaning "the sound made by breathing in; oh!")

  • Sorting it all Out

    Some people should feel ashamed of themselves

    • 7 Comments

    According to Bill Vaughn in his post Petitions and other Silliness:

    When I visited the speaker’s lounge I felt like a caged bear with kids poking me with sharp sticks. It seems that the Microsoft folks in attendance took exception to the Visual Basic 6.0 petition that I signed along with a number of other MVPs. A couple implied that I would be lucky to keep my MVP status because I chose to speak up.

    If you were one of those folks who implied such a thing, then shame on you.

    An MVP is not an unpaid shill. He or she is a Most Valued Professional. This is a program which (as this site says) "...is a worldwide award and recognition program that strives to identify amazing individuals in technical communities around the world. Microsoft MVPs are recognized for both their demonstrated practical expertise and willingness to share their experience with peers in Microsoft technical communities."

    I used to be an MVP. And this is simply not done. Once (a few years ago) I know of one incident where a Product Manager implied retribution for public statements an MVP made (ironically it was also about Visual Basic). I also know that after feedback from several internal and external people that the Product Manager was himself punished, and required to apologize for doing it.

    Whoever did this was entirely out of line.

    Bill is actually a former employee of Microsoft who has forgotten more about several MS products than many of the people who are employees here will ever know.

    Even the thought of someone using their position as an employee to try to unfairly influence someone in this way pisses me off to no end. This is not why we are encouraged to go out and speak at conferences. As I said before, I did not even sign the freaking petition but I swear I am tempted to do it now, just to see if someone has the nerve to claim I do not have the right to do so. If I did not have philosophical qualms about what the petition was trying to do, I would probably do so right now, and dare these people (whoever they are) to try to show me the door.

  • Sorting it all Out

    Before you find, or search, you have to *index* (or, Language-specific processing #0)

    • 9 Comments

    (I call this post #0 since it is more of an introduction to a topic that I will be returning to on a regular basis over the next few months.)

    Back in the end of 2000, I had a meeting with the lead international program manager of SQL Server. One of the architects in the group had written a multiple page email describing the collation support in SQL Server 2000, and the PM wanted to include some more information about other parts of the SQL Server product to have a single place with all there is to know about SQL Server's international support could be found.

    They estimated it would be about 10-15 pages. I put together an outline and let them know that to cover all of the topics on that outline it would actually be more like forty. They were a little staggered by the outline, but the "cover everything" idea was theirs, not mine. So they accepted my updated number.

    Turns out we were both wrong.

    The finished white paper International Features in Microsoft SQL Server 2000 came out in April 2001 and clocked in at somewhere between 57 and 65 pages, depending on whether you had the HTML version or the Word .DOC file version (content is the same, they just format pages and margins differently).

    They got ripped off. It was a really fun project, I would probably have done it for nothing had I known how much fun it would be. I probably would have paid them for the point when the person in SQL Server marketing wanted to talk about my over-use of the word unfortunately when I talked about limitations. :-)

    Now a lot of what was there, I already knew. For those topics the white paper was a chance to get it all down in one place (there was also going to be a book by Sams Publishing independent of this white paper entitled Internationalization with SQL Server but the publisher decided the market was not big enough to sustain the book. So they paid off my advance when the book was only about 10% turned in and 50% done. My Acquisitions Editor (Sharon Cox) had left Sams so I was not at all put out by this. Especially when they paid me off. :-)

    Anyway, there were a few topics that were new to me, and one of those was the Microsoft Search service, which sits underneath SQL Server's Full-Text Search, Index Server, and Exchange Full-Text Search. And SharePoint. I had a few amazing conversations with Margaret Li where I learned about the work that the word breakers and stemmers do for the various languages supported by the engine underneath these search technologies. At the end, she pointed out that Nadine Kano first obliquely hinted at the interfaces that one would use to do this work in Developing International Software for Windows 95 and Windows NT:

    Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. The rules for Asian languages, however, are quite different from the rules for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily indicate the distinction between words by using spaces. The Thai language doesn't even use punctuation. For these languages, software applications cannot conveniently base line breaks and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.

    Because the Win32 full-text search engine for Microsoft WinHelp recognizes that word wrapping is more complex for some languages than for others, it supports the IWordBreak OLE interface. That way, if a third-party developer creates a superior word-wrapping algorithm for any language, the WinHelp engine can take advantage of it through OLE.

    The only problem is that the interface was not yet public. Oops!

    Margaret did tell me that her team was willing to do publish it but that they needed the time to get it done (and there never seemed to be enough time). Luckily someone did find the time, because today you can read all about the interfaces right on MSDN. And in this blog, in this series.... 

    This post will be the first of what will be many articles on this fascinating area that is a cousin of collation and an uncle of search, but which has many interesting features and issues of its own.

    The interfaces I will be talking about here are ones that are used by MSN Desktop Search, as I talked about previously in Give me a [word-]break! Imagine for a moment -- perhaps the act of creating such a component might one day allow components like MSN Desktop Search, SQL Server Full-Text Search, or any of the pothers to index content for you using the rules of your own language.

    That is undeniably cool, is it not?

    And it definitely falls into both of the categories of opening it all up and getting out of the way. :-)

     

    This post brought to you by "L" (U+004c, LATIN CAPITAL LETTER L)
    Because L is for Language and this letter just couldn't stay away from a cool topic like this one!

  • Sorting it all Out

    Code pages are really not enough....

    • 11 Comments

    Helen Custer, in Inside Windows NT, describes the situation back then in an interesting way:

    The lowest layer of localization is the representation of individual characters, the code sets. The United States has traditionally employed the ASCII (American Standard Code for Information Interchange) for representing data. For European and other countries, however, ASCII is not adequate because it lacks the common symbols and punctuation. For example, the British pound sign is omitted, as are the diacritical marks used in french, German, Dutch, and Spanish.

    The International Standards Organization (ISO) establish a code set called Latin1 (ISO standard 8859-1), which defines codes for all of the European characters omitted by ASCII. Microsoft Windows uses a slight modification of Latin1 called the Windows ANSI code set. Windows ANSI is a single-byte coding scheme because it uses 8 bits to represent each character. The maximum numbr of characters that can be expressed using 8 bits is 256 (28).

    A script is a set of letters required to write in a particular language. The same script is often used for several languages. (For example the Cyrillic script is used for both the Russian and Ukranian languages.) Windows ANSI and other single-byte coding schemes can encode enough charactrs to express the letters in Western scripts. However, Eastern scripts such as Japanese and Chinese, which employ thousands of separate characters, cannot be encoded usng a single-byte encoding scheme. These scripts are typically stored using a double-byte encoding scheme, which uses 16 bits for each character, or a multibyte encoding scheme, in which some characters are represented by an 8-bit sequence and others are represnted by a 16-bit, 24-bit, or 32-bit sequence. The latter scheme requires complicated parsing algorithms to determine the storage width of a particular character. Furthermore, a proliferation of different code sets means that a particular code might yield entirely different characters on two different computers, depending on the code set each computer uses.

    I thought it was interesting the way some of the technology terms were framed. It definitely does not fit the terminology we use today for several different terms. But what really caught my eye was the implicit idea that each of these code pages was enough for a language, and that the only real problems were the lack of good cross-code page support and the difficulty of parsing some of the more complex cases.

    The truth is much further from these points than you might guess. Because there are very few languages for which a code page (especially one of the 'Windows ANSI' code pages) actually has adequate coverage. I'd say that these code pages are perhaps 'good enough' for some languages but do not really contain all of the characters one might want to use to fully express information in most languages. Unicode in this context becomes more than just a luxury -- if you are missing letters you need in your language then it becomes a necessity.

    There was a recent thread in the microsoft.public.win32.programmer.international forum entitled "Developing ANSI application for multi-national Windows" where someone was strongly advocating not moving to Unicode because they believed their application (written in C, over 1 million lines, with over 50,000 strings, heavily relying on pragmas giving the code page and locale per source file to get their work done) was better served by keeping it all out of Unicode and relying on code page support. Of course almost immediately there were problems:

    My biggest wonderment, which perhaps you can answer or even solve, is why a non-Unicode localized application (for MBCS languages) will only run properly if the *system* default locale is set to the proper language.

    I run the international versions of XP and 2000, but only Unicode applications run properly unless the system default locale is set; there are no provisions that I have found that let me say, "This application uses Japanese.Japan.932." Dialog boxes, drawn text, and other problems are abundant.

    These issues are obviated by Unicode, but for a project my size that is an undertaking that will take quite a while and detract from product enhancements that are necessary for the marketplace.

    Though people did point to AppLocale as a workaround, the fundamental problems in trying to make a complex application work with such methods will (in my opinion) quickly outweigh the "benefits" of avoiding the move to Unicode. Because in the end, code pages are not really enough....

     

    This post brought to you by "©" (U+00a9, a.k.a. COPYRIGHT SIGN)
    One of the most common code points people complain they lose in their non-Unicode applications since it is not on all ACPs

  • Sorting it all Out

    The cat is on the roof

    • 18 Comments

    Warning -- no technical content!

    There is an old story/joke that has many versions, but here is the version I like best:

    A man is house sitting for his brother -- feeding the cat, getting the mail, etc. The brother calls to check in. "I’m sorry," says the man, "but your cat died."

    "What do you mean the cat died? How could you do this to me? You should have prepared me for the shock," says the man’s brother.

    "How was I supposed to prepare you?" asks the man.

    "Well," says the brother, "first you should have told me, the cat is on the roof, but don’t worry, we’re calling the fire department. Then the next time we talked you should have said, the fire department was doing everything it could and not to worry. Then the next time I called you can tell me that the cat had fallen, but not to worry -- the vet was doing everything she could to resuscitate him. Then, finally, you could have told me, the cat had died."

    "Sorry, I should have thought first" said the man, who was quite embarrassed at this point.

    "So anyway, how is the house?"

    "Um," says tha man, "your house is on the roof...."

    An interesting situation, one where on one level you understand what you are being told and yet on another just don't seem to get it.

    I find this story oddly comforting at the moment.

    You see, I just found out that my almost 10-year-old cat (Chelsea Antoinette) has cancer. It has metastasized from one mammary gland to several others. I have not heard from the radiologist yet but the veterinarian thinks it looks like it may be in the lymph nodes too.

    Now this is a way outside my limited field of knowledge of medical matters, but I know that it has to be pretty big to show up in an x-ray. It is obviously still metastatic if it is showing up in multiple places, even if the X-ray comes back negative.

    So why do the X-ray in that case?

    Well, I think it is to give the person in my position something to hang hopes on. After all it can mean extra months of life, at least. And then if the results are bad then something to focus on in terms of the severity of the disease.

    Yet even knowing this, even having this meta conversation with the veterinarian, it somehow does help.

    I tell the veterinarian the story and she smiles, then apologies for smiling. I trll her there is no need for that -- it is comforting somehow to be eased into these things, and to try to smile when you can. =And then I smile. And I took Chelsea home. Now I will await the results and pretend it makes a difference, even if it won't really, in this case.

    I went into work Sunday and did not say anything. As if that would make it less likely to be true or something. Idiot.

    And Chelsea? Well, she likes the taste of Amoxicillin. And she seems to appreciate that she is getting Fancy Feast out of a can rather than the usual dry food, even if she does not know why.

    I have to figure out where the line is drawn so that I can know the difference between being a heartless killer and being a benevolent caretaker. I'm afraid I will do what most people do in such situation -- I will wait, while she suffers. Thinking that there must be a resolution that is somehow moral, and compassionate. In the end it is incredibly selfish on my part, though in a world so backasswards that death can be act of love, a little selfishness in the hope that she won't hate having a little more time seems like more of a venal sin than a mortal one.

    In the meantime, I'll keep bringing Chelsea's food up to her, on the roof. And try to keep the house from falling on her....

  • Sorting it all Out

    A bug with the new oleaut32.dll calendar support and VB, VBA, and VBScript

    • 4 Comments

    (Recycling some electrons for a fun bug that can still be reproduced in the latest versions and service packs of VB and VBA <= 6.x (despite all the pressure I was able to muster in my position of unimportance!). To see this article in Thai, go to http://blogs.msdn.com/michkap/archive/2005/03/07/386453.aspx)

    A bug with the new oleaut32.dll calendar support and VB, VBA, and VBScript
    (Originally posted 4/9/2000)

    The new version OLEAUT32.DLL (2.40.4512.1) that ships with Windows 2000 has added support for the Thai calendar (previously it had only supported Gregorian and Hijri dates, which is all VB supports). There is an initial problem that although the DLL contains the support and MSDN documents the new capability, the platform SDK does not contain the latest header files, so you can't actually use the feature. Here are those values (which should be there for the next platform SDK in July 2000):
    #define VARIANT_CALENDAR_THAI 0x20 // SOUTHASIA calendar support 
    #define VARIANT_CALENDAR_GREGORIAN 0x40 // SOUTHASIA calendar support

    Ok, you can use the feature! Now for the bug:

    When you have this new oleaut32.dll on either a Thai Win95/Win98/NT4 machine or a US Windows 2000 machine with Thai regional settings, COM will recognize that the Thai calander is the one to use. Unfortunately, VBA and VBScript do not understand the concept of Thai dates, so they will both assume that dates like 9/4/2543 (the Thai equivalent of April 9th, 2000) is simply a very futuristic Gregorian date. Suddenly, any code you have to display dates is going to look like everything is 543 years in the future! And if your users do not type in dates that 543 years ahead, the incorrect date will then be used by your application.

    The problem is slightly worse if you use Hijri dates, which are approximately 600 years prior to Gregorian dates. Everything will seem to be way off. I am not sure how VBA is actually calculating Hijri dates, but if they are asking COM to do it, they are not properly identifying the source date as being in Thai, since it seems to think that the date is 5/1/1964 rather than the proper Hijri date, 1/5/1421. There is even more potential for corruption here, since the date does not correspond to any date that a user would understand (as opposed to the Thai date, which presumably a user who has Thai regional settings would understand).

    The real fix is obviously for VBA to have a mechanism for its date handling that uses whatever COM has available to it, rather than being hardcoded to accept only two date types when COM will now support three. To workaround this bug, you need to call the VariantChangeTypeEx function to convert a date to a Gregorian date string and a Gregorian date string back into a date. The declaration of that function and relevant constants are:

    Private Const VARIANT_NOUSEROVERRIDE = &H4 
    Private Const VARIANT_CALENDAR_HIJRI = &H8
    Private Const VARIANT_CALENDAR_THAI = &H20
    Private Const VARIANT_CALENDAR_GREGORIAN = &H40
    Declare Function VariantChangeTypeEx _
    Lib "oleaut32.dll" _
    (ByRef pvargDest As Variant, _
    ByRef pvarSrc As Variant, _
    ByVal lcid As Long, _
    ByVal wFlags As Integer, _
    ByVal vt As VbVarType) As Long

    and you can use the following procedure (if you have the new oleaut32.dll only!) to test this out:

    Sub TestDateFormats()
    Dim vSrc As Variant
    Dim vDst As Variant

    vSrc = Date
    Call VariantChangeTypeEx(vDst, vSrc, 1033, VARIANT_CALENDAR_GREGORIAN, vbString)
    Debug.Print "Gregorian date: " & vDst
    Call VariantChangeTypeEx(vDst, vSrc, 1025, VARIANT_CALENDAR_HIJRI, vbString)
    Debug.Print "Hijri date: " & vDst
    Call VariantChangeTypeEx(vDst, vSrc, 1054, VARIANT_CALENDAR_THAI, vbString)
    Debug.Print "Thai date: " & vDst
    End Sub

    The output will look like this:

    Gregorian date: 4/9/2000 
    Hijri date: 05/01/1421
    Thai date: 9/4/2543

    To convert from a string to a date, simply specify vbDate in that fifth parameter instead of vbString. You can use this same code under more pleasant circumstances to help interpret dates from different locales.

    The bug is obviously pretty bad for existing applications running on Thai machines since their behavior will be changed. I hope Microsoft addresses this one quickly.

     

    This post and its localized Thai cousin brought to you by "ภ" (U+0e20, a.k.a. THAI CHARACTER PHO SAMPHAO)

  • Sorting it all Out

    Takes your breath away

    • 6 Comments

    More non-technical content....

    You know the step you miss at the bottom of the stairs, or the one you try to take that is not there at the top? It takes your breath away. Kind of like today's news did. I'll explain...

    I am taking Copaxone every day, for my MS. It's like taking Insulin or something.

    I used to take Avonex and to be frank I liked the schedule better (just once a week). But I'd kind of feel like I had the flu for the next 36 hours, which kind of stunk, if you know what I mean.

    So I was biding my time with the Copaxone.

    Though some time this month I was going to be switching to the new drug that used to be called Antegren but was then renamed to Tysabri.

    I was kind of annoyed at the wait (the hospital wanted to set up stuff and the infusions have to be in the doctor's office). But a drug that you only have to take once a month seemed like a dream some true, you know?

    But then everything changed.

    Yesterday morning, my brother-in-law forwarded me an article through email and asked if I had heard about it -- a headline today on cnn.com. It read MS drug pulled after patient dies. The drug companies (Biogen Idec and Elan Pharmaceuticals) voluntarily pulled Tysabri for an investigation after one patient died and another contracted Progressive Multifocal Leukoencephalopathy (PML), a rare but often fatal disease of the central nervous system. Both patients were taking Avonex and Tysabri together for over two years.

    My first thought was how terrible that was. And I will admit my second thought was that maybe that would have been me some day, if they had actually rushed through that process stuff at the UWMC neurology clinic a bit faster.

    The stock both companies reportedly dropped on the news (small wonder, huh?).

    Now tonight, I am looking at this Copaxone syringe and wondering if I am taking my life into my hands by falling into this trap that the drug companies have set up. It's a pretty profitable scam they have going there, you know?

    Drugs that have a statistically significant chance of helping me will by definition have a statistically significant chance of doing the fractional value diddly/squat.

    And I don't even want to think about the miniscule chance that the rush to get what seemed like a promising drug "fast tracked" through the FDA could do to someone about to be on the bleeding edge of MS treatment. The Copaxone may be doing nothing at all for me (one has to love a $1000 per month placebo), but no one has died yet from taking it after over ten years. I'll take those odds over 1 in 5000 after just two years any day.

    Now I know none of this probably even applies to me -- they have a ton of Avonex data and nothing like this has ever been seen. Same thing for Copaxone. and even if I were on the Tysabri, I would never have gone on a combination therapy with Avonex. Even bothering to think that I had a close call is like thinking I had a close call when I was stranded in Los Angeles on 9/11, waiting for a next day flight to San Jose. In other words, I was never in any kind of danger.

    I am simply not feeling quite so experimental, if you know what I mean. I think waiting for the longer studies sounds like a safer plan....

Page 1 of 5 (63 items) 12345