Blog - Title

August, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    The Milk Bet lives!

    • 30 Comments

    They are never going to learn this one.

    Marlins suspend batboy for milk-drinking dare

    I'll ignore the suspension issue and talk about the "milk bet" here.

    Now this particular bet has been around for a long time. I first heard about it when I was working for the Access team, probably around eight years ago.

    Heath, a fellow developer on the team with an office right next door to mine, was certain that he could drink a gallon of milk in an hour without throwing up. He went to CalTech and because of this had a very logical way of thinking this through. He could easily drink one of those one pint milk containers in just a few minutes. So the gallon could be polished off easily since it really is just eight of those one pint containers.

    (for those outside the US, there are four quarts to a gallon and two pints to a quart!)

    And he had a whole hour to do it, so he could take his time and make it with no problem, right?

    Well, actually, it is wrong.

    Fellow developer Nicholas Shulman volunteered the explanation for why the bet is never won as Heath was running to the restroom to avoid throwing up in the conference room to which we had all adjourned.

    "A stomach," Nick explained with the just the right inflection for irony, "is about a half a gallon."

    Milk needs time in the stomach to be broken down before it can go on -- it does a body good, but it needs a little time to do that good. And there is simply no way to break the milk down fast enough to take in a full gallon in an hour. If you try to do so, your body will rebel and if you try and force the issue, your stomach will settle the argument for you.

    Perhaps some future CalTech or MIT student who has read this blog will either refuse the bet, or anticipatorily buy something that will break down the milk and drink a bunch of that right before the bet starts.

    (via Spencer)

  • Sorting it all Out

    What if my strings are > 2 gb?

    • 24 Comments

    We do get our fair share of silly questions here in NLS.

    I should perhaps explain what I mean by silly. :-)

    I don't think I'd ever consider a question where somebody is asking about language and how it might work in a certain situation and call that silly. I mean, that's how people learn. It's the kinds of questions that I ask of native speakers and of linguists, and even if they smile or laugh I never get the sense that they are thinking me silly for the question.

    But today, somebody who is thinking about 64-bit Windows and who assumed that one day strings that are greater than 2 GB would be common looked at our signature for CompareString:

    int CompareString(
        LCID Locale,
        DWORD dwCmpFlags,
        LPCTSTR lpString1,
        int cchCount1,
        LPCTSTR lpString2,
        int cchCount2
    );

    and suggested that perhaps those int parameters containing the string lengths ought to be size_t instead.

    Now I would like to forget about the argument that this is a public API that is been around since NT 3.1. It's obviously important here, and makes a suggestion a little bit silly, but not everyone really pays attention to what's in NLS API or how long it's been there.

    I'd also like to forget about the argument that 2 GB strings are uncommon, because one day they may not be. Especially in the 64-bit world. There may be a perfectly valid reason to have huge strings.

    The real problem I have here, and what makes the question in silly to me, is the notion that you need to do linguistic comparisons on strings that are greater than 2 GB in size.

    There is simply no way to justify this is a reasonable use of the collation functionality in NLS API.

    Perhaps some of you may disagree with this notion, and I'll be curious how people respond to this post. If you are somebody disagrees, please be sure to include information about your "reasonable example" so that people have a chance to appropriately judge the judgment being used. :-)

     

    This post brought to you by "§" (U+00A7, a.k.a. SECTION SIGN)

  • Sorting it all Out

    What the hell does HTTP_ACCEPT_LANGUAGE mean?

    • 18 Comments

    The question is a simple one: what the hell does HTTP_ACCEPT_LANGUAGE mean?

    The answer is also quite simple: IT DEPENDS.

    The user is sending information from their browser, and could mean any of the following things:

    • language/locale to use for formatting/collation preferences
    • language/locale to use for the UI
    • language/locale about which to provide content
    • location for which to provide information

    Now sometimes all of the settings will be the same. It is obviously more common for that to be the case. But it is a huge Internet and frankly there are a lot of times that they're not the same. It is unfortunate and all of these different items have to be filtered through a single setting across all of the browsers. But life is about dealing with things as that are, not as we want them to be.

    It is therefore importantcrucial to recognize that a user may have any of these in mind, and be careful not to assume too much based on the HTTP_ACCEPT_LANGUAGE -- giving them an easy way to change the settings if you assumed more than they wanted you to....

     

    This post brought to you by "Ǯ" (U+01ee, a.k.a. LATIN CAPITAL LETTER EZH WITH CARON)

  • Sorting it all Out

    My kingdom for some Unicode controls

    • 18 Comments

    I have certainly done my share of pushing for Unicode controls in various programming languages on Windows. From the UniToolbox controls link on this very blog to the book I wrote for Visual Basic (see Chapter 6 online!) -- this is the one that Joel Spolsky said all of the very nice things about in this post and the audio interview it links to (in the interview I was an example, he was mainly talking about how Amazon ratings/comments can be particularly biased/skewed by folks with a "smear tactic" agenda, a point on which I agree with him -- but I usually just filter the anonymous comments to get a more accurate answer!).

    (Joel, I'll cover my thoughts on a book in another post!)

    Anyway, I am a huge fan of Unicode controls.

    In prior versions of VB (<= 6.0) they were only half-Unicode, by which I mean they all had Unicode interfaces but for the most part were wrappers around non-Unicode intrinsic or common controls. Which means a lot of conversions back and forth (and back again in many cases on NT-based platforms since the underlying controls themselves are Unicode!). So you get all of the space and performance penalties of Unicode with none of the benefits (like the Shell Unicode interfaces in Windows 95!).

    It was very exciting that in .NET all of the WinForms controls are 100% Unicode any time the OS could support it happening. Even on Win9x all of the owner draw controls still support Unicode, and some of the common controls. You can see some of this in the documentation, like in this topic:

    However, certain controls do not support Unicode in Windows 98 and Windows Millennium Edition. These controls, all of which inherit from the common control, will process data with the Windows code pages, as ANSI. These controls are: TabControl, ListView, TreeView, DateTimePicker, MonthCalendar, TrackBar, ProgressBar, ImageList, ToolBar, and StatusBar. The result of this is that you cannot display Unicode data in these controls on the listed platforms. For example, you cannot display Japanese characters on an English Windows 98 system.

    We'll ignore the technical mistakes here and the fact that it does not mention some of the intrinsic-based controls like the TextBox also have this problem (and especially the fact that some of the common controls actually do support Unicode on Win9x, and will work properly in WinForms!) and concentrate on the issue that there are a few controls which will not support Unicode on Win9x, even in WinForms.

    It is easy to get worked up about this, but these days I do not. After all, the only time I ever run Windows 98 or Millenium these days is when I am looking at an MSLU bug, and it has been a long time since one of those has needed a look. And even if the controls fully supported Unicode, usually the fonts would not be there so all you would see is a bunch of square boxes a.k.a. NULL glyphs (��������) which is really not much better in terms of information than a bunch of question marks (????????).

    For me it is enough that everything is Unicode whenever it can be. Thats cool.

    Now the final frontier is C++ projects -- since so many people still don't create the projects as Unicode ones, and a lot of developers still write that TCHAR code even if they are only writing for NT-based platforms like Win2000 or XP or Server 2003 or Vista -- LPTSTRs and TCHARs, yuck!

    In NLS we put our foot down in Windows Server 2003 -- no new NLS API functions will be written with ANSI counterparts. And we're continuing that in Vista. Not everyone has gotten the word on this yet, so we'll need to step up on the "internal evangelism" with other teams and groups. But it should be easier to suggest that people write less code, I think -- much easier than to suggest that people need to write twice as many functions and messages!

    The old functions will still work, sure. But there is plenty of new functionality like FindNLSString and NormalizeString and lots more that I will be covering in future posts -- and it is Unicode only, like many of the new locales in Vista are.

    So if you are writing C/C++ applications, you have to ask yourself if you really want half the world to have to speak fluent question mark to use the products you write?

     

    This post brought to you by "" (U+0f40, a.k.a. TIBETAN LETTER KA)
    (A letter that you are probably not looking at a NULL GLYPH for if you are running Vista Beta 1!)

  • Sorting it all Out

    Vietnamese is a complex language on Windows

    • 17 Comments

    Back in May of 2004, Quan Nguyen sent a message to Dr. International about Vietnamese collation in Windows and the .NET Framework:

    I tried to sort Vietnamese characters according to Vietnamese collation rules, as precribed in http://vietunicode.sourceforge.net/charset/vietalphabet.html. However, .NET Framework's built-in sort order for CultureInfo("vi-VN") seems not correct. What should I do to get it to sort according to Vietnamese alphabetical order? 

    This was not the only place that this information was asked -- Quan had asked this same question on several newsgroups and other places. We requested some more details, did the investigation, and were able to report on the claim -- he was right, there were a few letters that did not sort properly. In the end, the problem basically consisted of the uppercase and lowercase versions of the following letters:

    Of course since these letters are in Unicode and are used by several other languages, they have some default weights -- but they are not in the Vietnamese exception table. And their weights in the default table are not completely correct....

    Now no one had reported this problem before, so hopefully these are letters that are not used often in Vietnamese in situations where the small but definite differences in collation would be noticed.

    Which is not to say it is not a bug or that it should not be fixed -- it definitely is.

    But it is to perhaps explain why it took so long for someone to report to Microsoft a bug that has been in the code page and sorting tables since the very first Vietnamese enabled versions of Windows....

    Now Windows code page 1258 has its own set of problems here, because the above characters are not in cp1258, either. Well, they sort of are as combining characters since the code page has U+0300, U+0301, and U+0303 on it -- but the conversion to and from Unicode of the above characters can be quite nightmarish, for the reasons I mention when I pointed out a few of the gotchas of MultiByteToWideChar. We would have had to include them as the precomposed form listed above, and there are not enough free slots to do so (even if we were able to modify code pages, which we are not when I explained about we cannot change the code pages).

    So let's just assume that cp1258 is about as limited in use as all of the rest of the attempts at the other (at last count 42!) 8-bit encodings of Vietnamese are (they all have problems due to the fact that there are too mny characters or not enough slots to put them) and stick with Unicode....

    Getting back to collation, this particular problem that Quan Nguyen reported is fixed in the updated sorting tables in LonghornVista Beta 1. It could not be fixed in earlier versions of Windows or the .NET Framework as requires a major version change for Vietnamese to change the weights of code points that already have weights defined, so Vista is our first chance to make the fix (Whidbey's sorting tables are not being updated so the fix could not be made in .NET 2.0).

    On a happier note, the font story for Vietnamese has been really good on Windows for a while now, for all of these various letters.

    And the Vietnamese LIP was released in March 2005 which is also pretty awesome.

    It just took a little while for the NLS side of GIFT to catch up with everyone else, that's all. :-)

     

    This post brought to you by "Ý" (U+00dd, a.k.a. LATIN CAPITAL LETTER Y WITH ACUTE)

  • Sorting it all Out

    Mitigation tools for IDN security problems

    • 16 Comments

    Back in January, just before the flap at the hacker's convention with the paypal.com like that used a cyrillic 'a' to prove that IDN without a way to ferret out phishing attacks, I posted my own post entitled International Domain Names? The sign on the door says 'Gone Phishing'....

    It was an interesting flap because the RFCs for Internationalized Domain Names clearly points out the dangers and talks about the need to do some extra work to avoid security issues, but several browsers jumped ahead to support them and then just as quickly rushed out to turn them off by default.

    Folks at Microsoft, who knew about the need to do work here first, did not jump ahead without looking. And Microsoft was complimented for not jumping in too quickly. :-)

    Unicode has move in to assist with Unicode Technical Report #36: Unicode Security Considerations.

    And now Microsoft has some functions to help ISVs jump in (functions that can and will also be used in future versions of Microsoft products!).

    Here it is: Microsoft Internationalized Domain Names (IDN) Mitigation APIs 1.0.

    From the overview:

    The "Internationalized Domain Names Mitigation APIs" download includes several API functions to convert an IDN to different representations, as well as several API functions specifically intended to allow applications to mitigate some of the security risks presented by this technology. The functions IdnToAscii, IdnToUnicode, and IdnToNameprepUnicode each convert an IDN string to a particular form. The functions DownlevelGetLocaleScripts, DownlevelGetStringScripts, and DownlevelVerifyScripts allow applications to verify that the characters in a given IDN are drawn entirely from the scripts associated with a particular locale or locales. However, these functions are only helpers; applications have still to perform comprehensive threat modeling and create appropriate mitigation for these threats.

    Also included are the Unicode normalization APIs IsNormalizedString and NormalizeString, which are used by the mitigation APIs.

    This package is supported on XP (Service Pack 2 or later) and Server 2003 (Service Pack 1 or later). And differently named functions will also be in Vista!

    For info on the Normalization API functions, look here.

    For info on the IDN API functions, look here.

    The cool functions in the package to help with the mitigation (they make use of ISO 15942 for their script definitions):

    You can use these functions as part of your strategy for dealing properly with internationalized domain names -- warning users of potentially dangerous links to information.

    Awesome!

     

    This post brought to you by "а" (U+0430, a.k.a. CYRILLIC SMALL LETTER A)

  • Sorting it all Out

    International support as a developer 'tax' ?

    • 15 Comments

    I read Raymond Chen's post entitled How do you convince developers to pay their "taxes"? with interest. And it made me wonder whether you could lump international support in this list, any time an international market is not specifically being targetted.

    It is an interesting question. Is proper support for internationalization akin to power management, roaming user profiles, Fast User Switching, Hierarchical Storage Management, multiple monitors, Remote Desktop, and 64-bit Windows, any time you are not specifically aiming to ship support in another country and/or for another language?

    What do people here think?

  • Sorting it all Out

    The Keyboard Convert Service

    • 14 Comments

    Some of you will remember a while back about when Kate Gregory inspired me to talk about why sometimes the keyboard does not do what I tell it to!

    Sometimes you think you are typing in one keyboard, but it turns out you are typing in another.

    So some of the truly awesome GIFT team folks in Ireland decided that maybe there was something better that could be done about the problem than just blogging about it.

    Turns out they were right -- they created the Keyboard Convert Service, a free download that will do its darndest to fix the text in these cases!

    I will be trying ot out this week to see if they were able to integrate some of the feedback I gave them early in the project cycle. :-)

  • Sorting it all Out

    Why I think the thread locale really stinks

    • 14 Comments

    I do not like the thread locale.

    Yes, GetThreadLocale and SetThreadLocale are two of the many NLS functions that the GIFT team supports.

    And yes, if we are to look at the functions we own as if they are our children then we are supposed to love them all.

    But in that case, I guess I am a lousy parent (if you will recall, I think that SetLocaleInfo really stinks, too).

    First of all, there are the weird dependencies in USER32 and SHELL32 that probably ought to be using the user locale but instead use the thread locale and fall back to the system locale if something goes wrong.

    Second of all, there is the poor story in GetThreadLocale:

    Return Values

    The function returns the system's default user locale.

    Remarks

    When a thread is created, it uses the system default user locale. The system reads the system default user locale from the registry when the system boots. This system default can be modified for future process and thread creation using Control Panel's International application.

    Since it always returns the thread locale, which starts its life as the user locale (set in Regional Options) but can be changed by a call to SetThreadLocale, you can make a fair case that both parts of the text are losuy.

    Third of all, there is the worse story in SetThreadLocale:

    Return Values

    If the function succeeds, the return value is a nonzero value.

    If the function fails, the return value is zero. To get extended error information, call GetLastError.

    Remarks

    When a thread is created, it uses the system default thread locale. The system reads the system default thread locale from the registry when the system boots. This system default can be modified for future process and thread creation using Control Panel's International application.

    The SetThreadLocale function affects the selection of resources that are defined with a LANGUAGE statement. This affects such functions as CreateDialog, DialogBox, LoadMenu, LoadString, and FindResource, and sets the code page implied by CP_THREAD_ACP, but does not affect FindResourceEx.

    Windows 2000/XP: Do not use SetThreadLocale to select a UI language. To select the proper resource that is defined with a LANGUAGE statement, use FindResourceEx.

    Where do I start? The function should return an LCID on success -- the previous thread locale! -- not a BOOL. Just like functions like SetWindowLong does. Oh well, I guess that is not the end of the world.

    There is that same type of silly text about the 'system default thread locale' which is a beast not found in nature. The notion that it is read from the registry on boot is also crap. it is based on user locale. Always.

    Then there is the text about how resource loading is affected -- and a warning sans explanation to not use that functionality on Win2000 and later. Since it is not supported on Win9x, what it is really saying is "good for only NT 3.1, 3.5x, and 4.0". In other words a warning that the function is not useful on modern platforms unless you want to affect the Shell and User subsystem in strange ways. Although it does not bother to say so.

    Fourth of all, the fact is that resource loading is incredibly complicated, in part because of this questionable functionality. It is not based on the user locale, and it is not in essence based on the thread locale unless you change it. In which case it suddenly is based on the thread locale. Unless it is based on the UI language -- which it should always be. I swear you need a spirit dancer and a Ouija board to know what resources load if you start mucking with the various locale settings, and that is mostly because of this weird setting that hovers between UI language and user locale -- the thread locale.

    On older platforms it is worse -- the system locale is the one that is used except when you set the thread locale (even though the thread locale is initially the user locale). I guess that is about the same as Win2000 and later, just substitute system locale for UI language.

    This is by the way (now that I think about it) the first reasonable explanation for the strange Visual Basic <= 6.0 behavior where in the IDE the user locale is the language used for resource loading, even though in the compiled application that would not work -- VB was setting the the thread locale to the user locale and confusing millions of VB developers with this never-before-now-fully-explained behavior. Geez.

    Anyway, we document that developers should use LOCALE_USER_DEFAULT and not GetThreadLocale when they are trying to respect the user's preferences.

    If you ask me, we should treat every case where the thread locale is currently used as a bug to be fixed and not as a legacy behavior to be coddled. We have been slowly breaking the behavior anyway with each version without explaining why since it was already broken, so why not just cut the cord and stop using this dastardly functionality?

    The thread locale really stinks, after all!

     

    This post brought to you by "ײ" (U+05f2, HEBREW LIGATURE YIDDISH DOUBLE YOD)

  • Sorting it all Out

    'Our move. What do you think?' I asked. 'Napalm' she said.

    • 13 Comments

    (nothing technical here, at least not for computers -- if the MS thing doesn't interest you then you may want to skip this one)

    I did lift and modify the title quote from the movie Barbarians at the Gate. George Roberts (played by Peter Dvorsky) is talking with Henry Kravis (well played by Jonathan Pryce) just after a meeting where they were stonewalled. The clock is ticking, and time is running out. So George asks the question and Henry makes clear that a slash and burn strategy is called for.

    A great movie that I enjoy a lot, and a scene that has been running through my head a lot over the last few days, which have had some fairly huge distractions....

    I had an unusual appointment with my neurologist on Monday (I think I had already mentioned that it was a fairly odd day, professionally). We were talking about recent declines in my MS that did not seem related to a specific clinical exacerbation. Now this is a disturbing trend, any way you look at it. If you want to reverse these things and not have them stay as permanent deficits, you have to act quickly.

    But the options for acting are limited -- there are just not very many conventional treatments available, and as unconventional as some people believe me to be, I am about as likely to start the snake oil rounds as I am to sprout wings and fly to Guam. I certainly am not going to jump into the exciting world of stem cell research. There is just not a lot out there....

    So my neurologist made a very responsible suggestion -- that it may be time to think about Novantrone.

    Now Novantrone is pretty much the napalm of the MS world -- an antineoplastic agent (basically chemotherapy), and you can watch how much the official site for the drug tries to tap dance about how much milder the drug is when used for MS than for cancer (whoever decided it was okay to sell it under the same brand name rather than a different one was not thinking too clearly that day, if you ask me).

    But even though it is the napalm treatment, it is pretty much one of the only treatment options that really exist for what they like to euphamstically call "worsening MS". It is the sort of thing I had already learned a lot about because I have a morbid fascination for such things. The party line about how it works is pretty harsh, even after the spin doctors have had their way with it:

    Novantrone works differently from other MS drugs This difference provides hope for people who have experienced worsening symptoms of MS while getting other treatments.

    Other drugs used to treat MS (like interferon) "moderate" the immune system. You can think of it as "calming down" certain immune cells, but not completely reducing their numbers.

    Novantrone works by "suppressing" the immune system. The active substance in Novantrone (called mitoxantrone) affects DNA, a basic building block of all cells. You can think of Novantrone as killing certain cells in the immune system (called T cells, B cells, and macrophages). These are the cells thought to become abnormal and lead the attack on the myelin sheath—producing the brain or spinal lesions characteristic of MS.

    You may hear the term chemotherapy associated with Novantrone. For some people, the word "chemotherapy" triggers scary images. However, the word simply means using chemicals to treat disease.

    When used to fight cancer, chemotherapy drugs are given in doses that are strong enough to kill aggressively-growing tumor cells.

    However, when Novantrone is used to treat MS the goal is to inactivate or destroy any immune cells that are "misguided." When Novantrone is prescribed by a doctor for MS, the recommended treatment schedule is far less intensive than for cancer. The overall dosage is given less frequently over a longer period of time.

    Well, okay. But the list of side effects are the same, the worries about the potentially destructive effect on the heart is in there, the small (less than 1%) chance of getting leukemia is present, and the fact that the effects can still be a worry years after you have stopped the treatment is sobering. The Patient Information Leaflet goes into the amusing under any other circumstances things like blue-green urine and the whites of the eyes turning blue. Some of the other risks are also interesting....

    But the fact that hits me hardest is not even one of the side effects -- it is that this drug has a lifetime maximum dosage of somewhere around 8-12 doses.

    This is where that morbid fascination thing kicks in -- a drug with such long-term destructive potential that there is not just a daily or weekly maximum but an actual maximum over the period of a lifetime.

    Wow. Or whatever the negative version of the word 'Wow' is.

    Am I really doing that badly?

    Of course that is the problem, and that is why it might be the right thing to try -- because I am not doing so badly that this would be a useless idea; it may well be the perfect time to knock this monkey off my back for a little, try to unseat the mostly unfettered ascendency of multiple sclerosis a bit.

    The other problem is a psychological one -- when the MS was not affecting me as much, I was in denial and everyone would go on about how great my attitude was and what a great example I was setting (and everyone has a friend or relative who is doing worse than I am). But now that it is affecting me and I finally may be reaching a more real and healthy attitude, a lot of people don't know what to do with me and have trouble thinking of me as setting a great example for anybody.

    I used to think that just about everything I ever needed to know I learned from the movie Apocalypse Now (yes, many will find that to be disturbing!), but this time Coppola failed me -- despite Robert Duvall's memorable Kilgore quote, I am not loving the smell of this napalm in the morning.

    On the other hand, I am finding comfort in Captain Willard's (Martin Sheen's) words, modified for this situation. "I took the Novantrone consult. What the hell else was I going to do? But I really didn't know what I'd do when I found if it was right."

    So we'll see what happens.

    This post may not have been everyone's cup of tea, so I am sorry for that. But it is hard for me to completely separate the various pieces of my life since they are all me, in the end. One way or another I do have to sort it all out....

  • Sorting it all Out

    Every character has a story #13: U+0241 and U+0294 (upper and lower case glottal stops)

    • 13 Comments

    It started just the other day when John Jenkins asked on the core Unicode mailing list:

    Now that we have an uppercase glottal stop, any recommendations as to how it should look in a font?  Both the uc and lc glottal stops occupy the full space from baseline to cap height, or so I've always understood it…

    To which Peter Constable suggested:

    He he he... This is the same matter I raised several months ago.

    A slight clarification to what you wrote: the orthographic bicameral glottal stops used by some Athabaskan languages are x height and cap height. The caseless glottal stop used in phonetic transcription is cap height -- identical to the orthographic uppercase.

    Well, the matter has temporarily ‎gotten slightly more complex: last week, UTC approved the addition of LATIN SMALL LETTER GLOTTAL STOP as the x-height case pair to 0241 (tentatively assigned to 0242). If this is accepted in WG2, then it would likely go in amendment 3, meaning post the next version of Unicode. So, if you are creating fonts *now* to support TUS4.1, you can choose between:

    • having identical glyphs for 0294 and 0241, anticipating a later addition of the lowercase pair to 0241;
    • you can innovate a distinction between glyphs for 0294 and 0241 (e.g. make 0241 slightly wider); or,
    • don't support 0241 until you can also support 0242.

    Another logical option consistent with TUS4.1 would be to make 0294 x height, but I'd strongly advise against that, as it likely won't be consistent with a future version of the standard, and would not work for phonetic transcription, which will likely be relevant to a larger number of Apple customers than the orthographic usage.

    Ken Whistler's response was perhaps a little more cynical, though accurate as usual!

    Ah, John, you arrive right on time with the first set of misconceptions about what is going on.

    What happened last week is that a *lowercase* glottal stop was added, invalidating the relationship between the Unicode 4.1 *uppercase* glottal stop and the erstwhile glottal stop, and returning the erstwhile glottal stop to glorious *un*cased status. (See my data file message from yesterday.)

    So now we need *3* glyphs for glottal stops.

    First we have the *real* glottal stop, U+0294, used in most orthographies without case.

    It started out as a tiny hook, grew to the top half of a question mark to accomodate linguists filing the dots off their typewriters to be able to type the thing. It grew further, under the auspices of the IPA, into a taller and taller glyph, in a largely vain attempt to convince Europeans that "nothing" could be a *real* letter -- merely by making it so big they could no longer ignore its appearance in text.

    Then the Chipewyan, Dogrib, and Slavey communities in NW Canada, aided and abetted by linguists who should have known better, invented a case pair for glottal stop. And because IPA had turned the thing into a monumental cap form in their effort to get people to take it seriously, the Dene decided, quite reasonably, that that monstrosity *was* a capital letter, and so invented a tinier version to be their normal glottal stop -- the lowercase one in running text.

    In actual samples of Chipewyan and Dogrib texts using the case pair, the distinction is basically between a cap-height capital glottal stop and an x-height small glottal stop, otherwise of the same shape. I don't know of any systematic way to distinguish their cap-height capital glottal stop from what we now have as a caseless U+0294 glottal stop -- because, frankly, I don't think that issue ever occurred to the people who were using it and creating fonts for it.

    Michael Everson has stepped in, in L2/05-194, with an attempt to make a 3-way glyphic distinction. But I consider the shapes for U+0241 and U+0242 (the uc/lc pair) to be typographic fantasizing. We now have a LATIN CAPITAL LETER GLOTTAL STOP that has grown "fatter" in another vain attempt to convince people it is a *real* letter and to be systematically distinguished from an ordinary glottal stop.

    Oh well. Take L2/05-194 for what it is worth. That is the font that is likely to go into the book to confuse future generations further. Now, basically because typographic inflation and casepair invention has still not convinced Europeans that glottal stops are real -- we have embarked down the road of multiplying the encoding of them. Perhaps by the time we have encode 7 more glottal stops (in addition to U+0241, U+0242, U+0294, U+02BC, U+02BE, U+02C0, and U+097D) we'll finally manage to convince people to take it seriously. :-(

    Ken further responded, to Peter's post:

    > If this is accepted in WG2, then it would likely go in amendment 3,
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    It *must* go in Amendment 2. The whole point of the rush on their lowercase pairs for uppercase characters is because the door slams shut on casefolding stability issues as of Unicode 5.0.

    > meaning post the next version of Unicode.
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    meaning Unicode 5.0.         

    > Another logical option consistent with TUS4.1 would be to make 0294 x height, but I'd strongly advise against that,

    As would I. You might as well go with the fantasy, Big 'N Tall uppercase glottal stop to convince people for TUS 4.1 that it is the appropriate uppercase for U+0294, and then leave Ol' Fat Boy there in your font when you add the x-height U+0242
    for Unicode 5.0 as the actual lowercase for it.

    --Ken

    Oh, and since you asked -- no, I'm not bitter at all about this travesty. ;-)

    Michael Everson agreed with this last assessment:

    > then it would likely go in amendment 3, meaning post the next version of Unicode.

    No, it has to go into the FPDAM2 because of the case-folding stability lockdown.

    >- you can innovate a distinction between glyphs for 0294 and 0241 (e.g. make 0241 slightly wider); or,

    I do this. The width of the footstem for 0241 should be the width of  a capital I; the width of 0294 should be the width of a small i.

    Though he did differ from some of Ken's opinions:

    >returning the erstwhile glottal stop to glorious *un*cased status.

    Which is the way it is used in IPA and in other Canadian  orthographies: as an uncasing character ignored e.g. in title casing.

    >...convince Europeans that "nothing" could be a *real* letter -- merely by making it so big they could no longer ignore its appearance in text.

    I don't know why you are picking on Europeans, Ken.

    >Then the Chipewyan, Dogrib, and Slavey communities in NW Canada, aided and abetted by linguists who should have known better, invented a case pair for glottal stop.

    Case-pairing is a perfectly natural thing for people to want to do, and it is no surprise that people did this, and I for one don't find it as distasteful as Ken does.

    >And because IPA had turned the thing into a monumental cap form in their effort to get people to take it seriously, the Dene decided, quite reasonably, that that monstrosity *was* a capital letter, and so invented a tinier version to be their normal glottal stop -- the lowercase one in running text.

    But of course the UTC decided that 0294 could not be turned from Ll to Lu, which is why 0241 was added as Lu, but this caused the problems that L2/05-194 identifies and addresses.

    >Michael Everson has stepped in, in L2/05-194, with an attempt to make a 3-way glyphic distinction. But I consider the shapes for U+0241 and U+0242 (the uc/lc pair) to be typographic fantasizing. We now have a LATIN CAPITAL LETER GLOTTAL STOP that has grown "fatter" in another vain attempt to convince people it is a *real* letter and to be systematically distinguished from an ordinary glottal stop.

    Capital letters should have the same vertical weights. I, T, Y, and the capital glottal have stems which should be cap width. 0294 should have the same width as l and i. This is pretty easy, really.

    And of course Michael had to respond to Ken's postscript:

    >Oh, and since you asked -- no, I'm not bitter at all about this travesty. ;-)

    We've only ended up encoding what SIL asked for in the first instance, based on what was Peter's correct analysis. What was accepted by UTC then was an overunification, and that's had to be corrected now. Sorry if that smarts.

    John Jenkins responded to part of Ken's post, adding the ideographic point of view to all of this:

    > ...we'll finally manage to convince people to take it seriously. :-(

    Nah, we'll have to wait until there is an isomorphism between glottal  stops, middle dots, turtles, and grass radicals.  :-)

    Ken had to respond to Michael Everson's post about Europeans (just as Michael had to respond to Ken's!):

    > I don't know why you are picking on Europeans, Ken.

    Because it was the Spanish, English, Portuguese, French, and the Dutch, with their impoverished, Latin-based writing systems, who colonized the Americas, Michael.

    At least if the Arabs had colonized the Americas instead of the Western Europeans, they would have recognized (and written) a glottal stop when they heard one. On the other hand, then we'd probably be arguing about how to encode the 7 dots above and below the LAM needed for writing an ejective lateral affricate in these languages. :-(

    Rick McGowan had some fun with John's foray into ideographs:

    > Nah, we'll have to wait until there is an isomorphism between glottal stops, middle dots, turtles, and grass radicals. :-)

    Oh! oh! Look what I just found... The rare "Spotted Vegetarian Turtle Stop"...

    [MK -- picture not posted here in the blog, but it was pretty funny]

    To which Benson Margulies responded:

    Hmm. A mark turtle stop?

    Michael Everson responded to this one effectively:

    >Hmm. A mark turtle stop?

    A mo' tur'le sto'.

    Michael then responded to Ken's European qualification:

    >Because it was the Spanish, English, Portuguese, French, and the Dutch, with their impoverished, Latin-based writing systems, who colonized the Americas, Michael.

    Pity the Finns didn't take UPA with them.

    Peter responded to the theorizing about colonizing of the Americas with an interesting thought...

    Hmmm... There are Korean linguists who think that Hangul would do at least as good a job as Latin as a script that can be adapted to any transcription or orthographic need. So, what if the Koreans had colonized the Americas? I guess we'd be re-opening debates about Hangul encoding models.

    To which Michael Everson responded:

    Yeah... we only have three.

    Thomas Milo had a little fun with Peter's notion, too...

    What if...

    https://www.winston.nl/sitenieuw/artrooms/ar203.html
    https://www.winston.nl/sitenieuw/artrooms/poitiers.pdf

    A MILLENNIUM BUG

    Almost two millennia ago the Romans conquered the Netherlands. Today the Dutch still use Latin letters to write their language. Down with perennial spelling reforms. Just change the bloody script! That would have happened anyway, if Charles Martel at Poitiers AD 732 had lost the battle against the Muslim raider Abd-ar-Rahman ibn-Abdullah al-Ghafiqi. As a result, many European languages, including our own, might have been written with Arabic letters ever since and look today like medieval Spanish or 19th century Bosnian Croato-Serbian.

    What have we learned? Well, that the glottal stop is harder than we thought it would be. For me, native speaker of no language that has a glottal stop, who thinks of it as looking like punctuation, it is even harder to imagine. Like a lowercase question mark or something. I guess that's why my thoughts of linguistic aptitude are just delusions....

    If nothing else the whole conversation shows that the internal list can be just as off the wall as the external one can be! :-)

  • Sorting it all Out

    ELK stampede!

    • 12 Comments

    Back in January I was saying Lions and tigers and bearsELKs, Oh my! when I first started talking about these new beasts here and the way that new locales were being added.

    Now, the ELKs are stampeding again! You can look at 897338 in the Knowledge Base for info, or head right to the download center and see locale support added to Windows XP SP2 for all of the following languages:

    • Bosnian (Cyrillic, Bosnia and Herzegovina) 
    • Filipino (Philippines) 
    • Frisian (Netherlands) 
    • Inuktitut (Latin, Canada) 
    • Irish (Ireland) 
    • Luxembourgish (Luxembourg) 
    • Mapudungun (Chile) 
    • Mohawk (Mohawk) 
    • Nepali (Nepal) 
    • Pashto (Afghanistan) 
    • Romansh (Switzerland)

    Awesome!

     

  • Sorting it all Out

    Let's get vertical

    • 11 Comments

    (computerized apologies to Olivia Newton John)

    Dmilat asked (in the Suggestion Box):

    @-prefixed fonts

    If you try to manually type a font name like @Arial Unicode MS in MS Word font selection combo-box and then enter a text with some CJK hieroglyphs (make sure the font name did not change), those characters will be turned 90 degree. I believe this allows for vertical text layout that may be used by people from east asian countries. What is amazing that I failed to find any info on that in MSDN. Is it kind of undocumented feature ? Can you give more info on that ?

    This feature has been around for a long time, actually. I look in Nadine Kano's book for the first time I had seen mention of it (see the mention here in Vertical Writing and Printing). An excerpt here:

    As the following illustration shows, displaying text vertically doesn't mean that you simply rotate an entire line of text by 90 degrees. Most characters remain upright, but others, such as those identified by arrows, change orientation.

    Fortunately, with Win32 you don't need to write code to rotate characters. To display text vertically on Windows, enumerate the available fonts as usual and select a font whose typeface name begins with the at (@) character. Then create a LOGFONT structure, setting both the escapement and the orientation to 270 degrees. Calls to TextOut are the same as for horizontal text.

    The Far East Win32 SDK contains a sample application called TATE (short for tategaki, meaning "vertical writing") which demonstrates how to create fonts and display vertical text. Figure 7-22 shows a sample file displayed in TATE using a horizontal font. Selecting a vertical font from the Font dialog box (see Figure 7-23 below) causes the text to be displayed vertically. (See Figure 7-24 below.)

    And so on. See the link for the full story. :-)

    There are probably other mentions in both the Platform SDK and MSDN, but it is harder to find them with symbols like @ usually being ignored in searches. :-)

     

    This post brought to you by "@" (U+ff20, FULLWIDTH COMMERCIAL AT)

  • Sorting it all Out

    GetDateFormat is Gregorian based

    • 11 Comments

    The GetDateFormat function either takes a SYSTEMTIME struct or if you pass a NULL for that parameter it uses the GetSystemTime function to retrieve the current date in the form of a SYSTEMTIME struct.

    One weird thing about all of that? Well, SYSTEMTIME is Gregorian calendar based.

    So even if you have some other calendar selected as the current calendar (or if you specify a locale that has some other calendar as its default), you have to pass the current date as a Gregorian thing in order to format it as something non-Gregorian.

    Some people have trouble with this idea conceptually (so luckily this is hidden from all but the developer types, who usually can deal with it okay). But in the end the system has to pivot off of some calendar. It is hard to begrudge software for going with the calendar that is the default where the software comes from, originally.

    Imagine if it were not true, and every developer who called GetSystemTime was required to understand every other calendar in the world, depending on user settings.

    Imagine how your code would only work with some user settings and not others!

    One thing that is important to remember is my bias -- remember that old post Calendars on Win32 -- just there for show.... that kind of explains how the calendars are just for display purposes. The same is true of the dates, which is why everything I have said in this post is basically true.

    As an alternative, we could have made all date functions use some kind of system that is indepenent of all calendars and pivot off of that. But imagine doing all that work to make sdure it is harder for everyone to work with the software. I am personally kind of glad that such a decision was not made. :-)

     

    This post brought to you by "۝" (U+06dd, ARABIC END OF AYAH)

  • Sorting it all Out

    Is there a new scam going around?

    • 11 Comments

    In the past few weeks, I have had messages coming to me from many different sources (my email addresses, my home phone number, the contact link on this site, etc.).

    Each message is different, but each has one thing in common with the others: someone claiming to know me from some time in the past is either

    • asking if I am the "right Michael Kaplan", or
    • assuming I am and asking what happened to me

    I have not answered any of them yet -- none of them are ringing any bells and some can't be right since they refer to living in places I have never visited.

    Is there some new scam or identity theft thing going on?

Page 1 of 5 (64 items) 12345