Blog - Title

March, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    What's the problem with MapVirtualKey[Ex], on CE and otherwise?

    • 4 Comments

    Also over in the microsoft.public.win32.programmer.international newsgroup, Norman Diamond (who is clearly doing a lot of stuff with CE these days!) asked:

    In Windows Mobile 5 (Windows CE 5), when calling MapVirtualKey with the second parameter set to 2,
    http://msdn2.microsoft.com/en-us/library/ms911789.aspx
    says:

    *  uCode is a virtual-key code and is translated into an unshifted character
    *  value in the low-order word of the return value. Dead keys (diacritics)
    *  are indicated by setting the top bit of the return value. If there is no
    *  translation, the function returns 0.

    The fact is that when uCode is virtual-key code 0x27 (VK_RIGHT), MapVirtualKey doesn't return 0, it returns 0x27.

    So even my programming to produce a reverse translation table (since Windows CE lacks the misnamed VkScanEx API) gets screwed, and applications that use it get screwed.

    Although I'm reading MSDN in English, most of the target environments are other foreign languages and the character code to VK code mapping is not constant.  I need the correct table.  How can I compute it?

    Well, it has been many years since I have done any CE programming at all (the last time is when I was working as a contract PM in CE Services, over six years ago!). But I'll speak of the things I know of, and then make some informed guesses as to the rest.... :-)

    First, there is the piece of this doc that is a bug -- even in the non-CE case.

    The claim they always make is that for one of the mappings, "uCode is a virtual-key code and is translated into an unshifted character value in the low-order word of the return value."

    Well, perhaps I am on drugs for thinking of this, but for me the definition of an unshifted character in this context is the character that is produced when you hit the key without any of the shift keys.

    Beyond that, in a quick test here:

    using System;
    using System.Runtime.InteropServices;

    namespace Testing {
        class MappingIsScrewy {
            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayoutW(string pwszKLID, uint Flags); 

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr dwhkl); 

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern uint MapVirtualKeyExW(uint uCode, uint uMapType, IntPtr dwhkl); 

            [STAThread]
            static void Main(string[] args) {
                IntPtr hkl = LoadKeyboardLayoutW("00000419", 0);
                Console.WriteLine(hkl.ToString("x8"));
                Console.WriteLine(MapVirtualKeyExW(0x41, 2, hkl).ToString("x8"));
                Console.WriteLine(MapVirtualKeyExW(0x27, 2, hkl).ToString("x8"));
                Console.WriteLine(UnloadKeyboardLayout(hkl));
            }
        }
    }

    The console output of this code will be:

    04190419
    00000041
    00000000
    True

    This is an uppercase letter being returned on that second line. And an uppercase A (U+0041), no less, when VK_A (0x41) is passed. Even though the Russian keyboard actually puts a U+0444 (CYRILLIC SMALL LETTER EF) at VK_A.

    Thus MapVirtualKey and MapVirtualKeyEx both will return the uppercase character on Windows, and furthermore even if you call MapVirtualKeyEx with the HKL of a non-English keyboard, the uppercase "A" is still being returned. 

    Which probably just means the VK is being returned. This does seem to limit the usefulness of MapVirtualKeyEx.

    MSKLC never ran across a problem here as it only used mapping type 3 -- MAPVK_VSC_TO_VK_EX, not mapping type 2 -- MAPVK_VK_TO_CHAR. The more useful function for us was always ToUnicodeEx, since the tool is only ever interested in the actual character (including the shift state when applicable). It also understands properly the notion of "unshifted character". :-)

    Oh well, at least on Windows if the VK does not map to a character, it will not even return the VK; the function will return 0.

    On the other hand, according to Norman on CE it will not even do that. :-(

    I guess that means if you to get the right character information (on ANY platform), use ToUnicodeEx and friends instead. Those mapping functions are a whole lot better at the SC <--> VK type mappings....

     

    This post brought to you by ф (U+0444, a.k.a. CYRILLIC SMALL LETTER EF)

  • Sorting it all Out

    Will the real Unicode character message please stand up?

    • 1 Comments

    Over in the microsoft.public.win32.programmer.international newsgroup, Norman Diamond asks:

    http://msdn2.microsoft.com/en-us/library/ms646288.aspx
    *  The WM_UNICHAR message is equivalent to WM_CHAR, but it uses Unicode
    *  Transformation Format (UTF)-32, whereas WM_CHAR uses UTF-16. It is
    *  designed to send or post Unicode characters to ANSI windows

    So both WM_UNICHAR and WM_CHAR use Unicode (though different varieties), but only one of these posts Unicode characters to ANSI windows?  The other one posts a non-Unicode Unicode, or what?

    *  If wParam is not UNICODE_NOCHAR, return FALSE. The Unicode DefWindowProc
    *  posts a WM_CHAR message with the same parameters and the ANSI
    *  DefWindowProc function posts either one or two WM_CHAR messages with the
    *  corresponding ANSI character(s).

    So maybe the non-Unicode Unicode is one or two WM_CHAR messages with ANSI character(s) instead of UTF-16?

    http://msdn2.microsoft.com/en-us/library/ms646276.aspx
    *  The WM_CHAR message uses Unicode Transformation Format (UTF)-16.

    So the WM_CHAR message doesn't use ANSI.  Or applications aren't supposed to expect ANSI from WM_CHAR, they're only supposed to get surprised if WM_UNICHAR was handled by DefWindowProc and resulted in ANSI?

    I'll admit the documentation could be clearer here, but the behavior Norman was most confused about is not too hard to unravel:

    • WM_UNICHAR is always sent as UTF-32 any time a WNDPROC gets it as a message -- so in the case where you are being sent a Unicode code point value representing a supplementary character, you will get just one of these messages1;
    • WM_CHAR to a Unicode WNDPROC is always sent as UTF-16;
    • WM_CHAR to an ANSI WNDPROC is always sent in the code page associated with the KLID, not the HKL, as discussed in this post.

    However, the text here around the wParam has some real problems too, though in defense of the doc writers for this case, the BEHAVIOR is quite confusing here. The full text is:

    wParam

    Specifies the character code of the key.

    If wParam is UNICODE_NOCHAR and the application processes this message, then return TRUE. The DefWindowProc function will return FALSE (the default).

    If wParam is not UNICODE_NOCHAR, return FALSE. The Unicode DefWindowProc posts a WM_CHAR message with the same parameters and the ANSI DefWindowProc function posts either one or two WM_CHAR messages with the corresponding ANSI character(s).

     And the return value info is:

    An application should return zero if it processes this message.

    Now all of this resembles English but it is a bit more complicated. :-)

    First of all, the wParam info talks about returning FALSE or TRUE when the return value info keeps you focused on the fact that the return is going to be 0 or not.

    Second of all, if it is a supplementary character, a Unicode WNDPROC will get two WM_CHAR message, not one.

    But what is the point of the rest of the text?

    What is is really trying to say is that the only time it is okay to return 0 (FALSE) is when what was passed was not a character (wParam is UNICODE_NOCHAR). Otherwise, you should always return TRUE for this function.

    But how easy is it to glean that from the text given? Not very....

    1 - To be perfectly honest, it is unclear to me how often this will be true for non-CJK supplementary characters, since the text that goes through user/userk keyboard layout dlls is using two UTF-16 code points in the form of a "keyboard ligature" and whether it is smart enough to make those two UTF-16s into one UTF-32 is an unknown....

     

     

    This post brought to you by 𐠁 (U+10801, a.k.a. CYPRIOT SYLLABLE E)

  • Sorting it all Out

    It was déjà vu all over again, this time in quite an oymoronic way

    • 1 Comments

    It was just the other day that I got the message via the contact link:

    Michael,

    Hi.

    I'm a Neurologist in Dallas, TX, and a computer person.

    My Internet Explorer 6 (running on Win98) suddenly started using a strange font where the html page coding was PRE{font-family:sans-serif} Text between <PRE> </PRE> was coming out in Noam New Hebrew (file name is _AE4B263.TTF) instead of Arial.

    Interestingly, that became the default font for Write 3.1 (old 16 bit Windows 3.1 word processor). As I understand it, IE picks the generic CSS fonts such as serif, sans-serif, etc., and gives you no direct control over which it chooses.

    I followed some advise on the internet to:

    1. shut down IE
    2. uninstall the font from Control Panel (made a copy of the file in another directory)
    3. re-start IE and confirmed it has switched to another font (in this case Temp Installer Font, almost as bad)
    4. cold boot to make change permanent
    5. re-install Noam New Hebrew

    Now IE6 and write3.1 use Temp Installer Font

    Where is the info kept on what is the system's generic font.

    I tracked changes and couldn't find it in the registry, nor ini files. The changes of deleting the ttf file just show in c:\WINDOWS\FONTS\ and HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\Fonts

    I'd like to be able to manually tell it Arial.

    It must be a true system object since write3.1 picked it up.

    An interesting problem, and not just because I knew what was behind about half of what our neurologist was running into and had a strong hunch about the other half. :-)

    First, the half that I knew about....

    It is never really fully understood by many people how font selection works on Windows. The truth is that whether one is a programmer calling CreateFont or CreateFontIndirect or CreateFontIndirectEx or the GDI+ System.Drawing.Font class or the WPF/Avalon System.Windows.Medua.FontFamily class or someone font face authoring a web page or even someone using the Shell Font Picker common dialog, the truth is that when pick a font, you are not really picking a font.

    What you are actually doing is picking some descriptive attributes that will describe something that the system will try to map to a font.

    Now sometimes you can't tell the difference. Like when the descriptive attributes are so specific that you end up right on the money.

    And other times, not so much. Like this time.

    When you give a generic description for a font, then the first one that the system believes is a good enough match will be the one that you get.

    Which kind of leads us to the other half. The half relating to the post from earlier today.

    Not since the events of It was déjà vu, man. Pure déjà vu... has the feeling of having been there before been so strong.

    You see, the various duplicated font caches built up by the various technologies build up their data from the font list in the registry -- a font list that is read in via an order that is subject to the same less-than-intuitive registry value ordering logic that I was talking about yesterday.

    This allows for reasonably deterministic behavior in font selection on a given machine, behavior that is subject to the fonts one adds and removes from the font folder, of course. :-)

    As a side point, there are currently only two ways to get the actual order that is in the registry that will be used by the system when looking for fonts:

    1. Get it programmatically via a series of RegEnumValue calls
    2. Export the registry key in question to text -- the values will be in that same order.

    If you try to do this with HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Fonts, then you will see an important change between prior versions of Windows and Vista. And by important I don't mean intentional or feature-filled, I mean significantly different. In a way that can really impact font selection in the new operating system.

    As I mentioned, the change was not intentional, it just has to do with the way that the new setup in Vista accomplishes its goals via the installation of different components, and the fact that the ordering of font components was not done with the impact that the order can have in mind. I suspect there may yet be a bug or two reported by customers due to side effects from this change....

    So now we know why the behavior is consistent and why it can easily be changed by removing and adding fonts and why adding a font back after removing it does not mean that you will always get back the same behavior. And why you may end up with odd choices if you have been doing a lot of adding and/or removing of fonts.

    So now let's look at the last bit and try to explain the actual behavior....

    Let's start by taking a closer look at what was being requested in this case.

    The PRE Element in HTML is defined quite clearly -- it Renders text in a fixed-width font. Which is clearly something that Arial is not.

    The only possible way to get Arial to be selected when a fixed width font is requested is when every better possible choice is unavailable....

    So to be honest, if one is looking for Arial, then the PRE Element is the wrong thing to be use anyway. Although other aspects of the element are what often lead people to want to use it, such as the way the text is directly rendered, including spaces which would otherwise be truncated. Which makes it unfortunate that the functionality of these two different attributes are combined into one element, but perhaps understandable why people might want to try....

     

    This post brought to you by й (U+0439, a.k.a. CYRILLIC SMALL LETTER SHORT I)

  • Sorting it all Out

    It was déjà vu, man. Pure déjà vu....

    • 4 Comments

    It happened some time ago. Right after the Community Server update on the blog, in fact.

    I was bothered by the way that the links over on the right hand side were displayed in the order they were added, even though in the design interface they were sorted alphabetically.

    This drove me nuts since the re-ordering buttons were gone and it was now a huge pain to sort them in the order you wanted given that the interface for adding them would never show the order they were going to be displayed in. I kind of gave up on ordering them, mostly -- too much of a pain in the ass.

    But let me tell you, It was like déjà vu for me.

    The whole problem -- a user interface for modifying a list being in one order while the actual order it was an interface for being something else? I had definitely been there before....

    The first time was a while back.

    Long before Vista shipped, in fact.

    Mike and I were going over to talk to a man named Dragos.

    We were looking for some ideas on the best way to optimize an HKCU registry story that we couldn't get out of just yet, and if anyone could help us figure out how to make things go faster, everyone said it would be Dragos.

    It was a great meeting, and one of the reasons that working at a place like Microsoft can be a lot of fun1.

    A bunch of good old fashioned developer geek brainstorming happened in that meeting, as we laid out what we were doing and what we had tried, and he suggested some things for us to try out that just might make things better.

    One of the things he suggested had to do with the way registry values (i.e. the ones you would get from RegEnumValue) were stored.

    Turns out that the order you see in RegEdit is just a fluffy sort done in the tool, the actual order happens to be a FIFO kind of thing, with them stored in the order they are added. Changing a name keeps the position as it was, but renaming would delete and re-add and the value would be at the end of the list. And when you read values in from a function like RegQueryValueEx or the new RegGetValue (Larry Osterman's "favorite" Win32 API function from last January), the OS scrolls through the values until it reaches the one you wanted (kinda what you'd have to do yourself if you were doing it yourself by calling good old RegEnumValue, which is not a coincidence).

    "Would it always be this way?" I wondered later in mail to Dragos, thinking about the comments in this post of Raymond Chen's, where he warned:

    There is no guarantee that the order of RegEnumValue will be "in order of creation". The registry code was tweaked for performance in Windows XP and I suspect it will be tweaked for performance in the future. One of these tweaks may change the order of enumeration, since that is unspecified.

    and the Platform SDK documentation about the order in RegEnumValue:

    dwIndex
    The index of the value to be retrieved. This parameter should be zero for the first call to the RegEnumValue function and then be incremented for subsequent calls.

    Because values are not ordered, any new value will have an arbitrary index. This means that the function may return values in any order.

    I was encouraged not to worry in our case, since

    • in the first place the docs here are pretty much just plain wrong. But since there is no UI that shows the actual order, so from a user's point of view, it really is kind of arbitrary, so it is not quite so wrong;
    • in the second place even if it did ever change, it would not happen in a service pack -- that would be a pretty big re-architecting;
    • in the third place, something like that would obviously be communicated widely to developers working on Windows in a future version.

    So as long as we were willing to not cry foul later if we ended up having to rethink our optimization (due to the registry having had its architecture rethought), then there was no reason to worry (note that most people are not willing to do that, which is why Raymond's specific warnings in this case about undocumented behavior are completely true -- the unclear doc story tells people plain as day that they can really only rely on bubkes in the long run!).

    I only mention it here because I know the people who read this blog are really smart. Like DNRC smart. Never the kind of people who would only read the first half a paragraph while ignoring the warnings in the second half.

    (Plus if somebody wanders in after a Google search who isn't as smart, someone can make them feel plenty foolish for not reading the whole thing, so it all works out!).

    Now you may be wondering, if the original incident happened years ago and the one that inspired the deja vu happened a while go too, why I am taking about this now. Especially since neither story has much to do with the internationalization kind of posts I usually do.

    The answer is that it is all a kind of a foundation sort of thing, one that I will be using as the basis for my next post.

    Plus it makes for a good story, don't it? :-)

     

    1 - It is actually one of the things I like about my new job -- I get to spend time in the same sort of meeting, and even get to be the person who is pointing out the best way to do things!

     

    This post brought to you by Й (U+0419, a.k.a. CYRILLIC CAPITAL LETTER SHORT I)

  • Sorting it all Out

    Having a 'c09' container clearly implies that one can contain an Aussie

    • 0 Comments

    So Ben asked me via the contact link:

    Hey Michael,

    Great Blog you've got going here. A colleague of mine at work introduced me and I thought i'd ask you a question which has been plaguing me for a while.

    I want to change the Display Name for my users within the display specifiers in my AD (Server 2003). I want English Australia to display as "Surname, GivenName".

    The problem I face is, I can't seem to find the correct container for English Australia. The only one which comes close is CN=409, but thats for English US.

    Do you what the correct container is for English Australia and how I get it to populate in my AD?

    Thanks

    Ben

    I don't actually control an Active Directory to test this on, but I'll say what I know of from reading some docs, and then maybe ask a few others to correct my understanding if it is not right. :-)

    Active Directory is limited in the locales you can specify to the legal Windows locales on the machine. As the DisplaySpecifiers Container topic states:

    The Configuration container stores the DisplaySpecifiers container, which then stores containers that correspond to each locale. These locale containers are named using the hexadecimal representation of the locale identifier. For example, the US/English locale container is named 409, the German locale's container is named 407, and the Japanese locale's container is named 411.

    There is also a C++ code sample in that topic that binds to the DisplaySpecifiers container for a specific locale.

    (The sample uses the GetSystemDefaultLCID function to get the locale, but clearly one could substitute MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_AUS) or 0x0c09 instead. We'll ignore the groaner of using the default system locale here)

    For Ben's case where he is looking for the container string, a CN=c09 should do the trick?

     

    This post brought to you by  (U+1ec7, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW)

  • Sorting it all Out

    Dumb ellipses?

    • 6 Comments

    In the spirit of dumb quotes

    A regular reader pointed me this one....

    It would seem that Word' AutoCorrect feature strikes again. :-(

    Take a look at the episode list provided by epguides.com list for the show What About Brian:

    You might recognize how some lines have had ... (U+002e U+002e U+002e) replaced by … (U+2026), rather than just doing something different typographically with the three dots.

    I wonder if it would have any impact to start an online petition to turn this feature off by default would stop the madness before it destroys the internet, Professor Chaos style!

     

    This post brought to you by  (U+2026, a.k.a. HORIZONTAL ELLIPSIS)

  • Sorting it all Out

    When language codes are reported that don't actually exist

    • 12 Comments

    Marvin asked in the Suggestion Box:

    Looks like GetLocaleInfo(0x46C, LOCALE_SISO639LANGNAME,...)

    returns "ns" on XP. Which is not a legal ISO-639-1 abbreviation. Should have been "nso" from ISO-639-2.

    Is it a bug or something by [unimaginable] design?

    Yes, Marvin is correct. The locale represented by the LCID 0x046c (Sesotho sa Leboa - South Africa) should never have returned a two-letter code since it has no two letter ISO 639 code.

    There are a few such mistakes that can be found in the code from the various ELK locales that shipped with XP SP2 and after XP SP2.

    I don't really have an excuse to claim here, it was a transitional period for the custodian of the data and there were some misunderstandings about how to produce some of the data. It is the sort of thing that could normally be caught in review but perhaps the eagerness people had to see some of this support they missed a few things.

    These problems were fixed in Vista, for what its worth (it is actually one of the issues behind differences between XP and Vista that were discussed in this post, though not the cause of problems described in this other post). As was the process for reviewing and verifying this kind of data....

    Sorry about that. :-(

     

    This post brought to you by (U+29f3, a.k.a. ERROR-BARRED BLACK CIRCLE)

  • Sorting it all Out

    Somewhat irreverent theories on the origins of the alphabet

    • 5 Comments

    A friend of mine got me the new-ish 3-disc collection from the comedian Gallagher.

    In Melon Crazy, at 6:48 into the show, he points out that we "...park in a driveway and drive in a parkway".

    Like I said before. :-)

    Now Melon Crazy dates back to 1984, so maybe Steven Wright said it first or maybe not (not sure when Steven first said it, and none of the web sites that list the quote bother to say where it is from -- I hate unattributed quotes sometimes, it encourages the logic that attributes Elmo's Got a Gun to Weird Al Yankovic).

    Linguistic highlights include his theories about the origins of the alphabet (from Over Your Head) -- "...based on some kind of a bookkeeper's code to keep the Jews' and the Egyptians' nose out of the Phoenician cattle business".

    Or in another show when he talked about the order and why it started with ΑΒΓΔ -- "ah beh gu dae -- have a good day!"

    Very silly, but definitely fun....

    So, anyone know when (well, what show) Steven Wright first made the driveway/parkway observation? I know now that I wasn't crazy (or if I am then it is for other stuff, not for this), now I am just curious who said it first....

     

    This post brought to you by Β (U+0392, GREEK CAPITAL LETTER BETA)

  • Sorting it all Out

    Warning: when private is used in public, it can really suck

    • 1 Comments

    Jeffrey's question earlier today:

    Hi experts,

    My customer has a custom designed font with some special characters in the Private Use Area (PUA). He is having a problem when trying to paste these characters (as text) into a RichEdit control in his application. Basically the character that gets displayed in the RichEdit is not the one he pasted in even though the font in the RichEdit control is set to his font.

    For example, if he pastes in character 0xF012 (which is an 'x' with a bar on top) from Notepad, he instead got a double-headed vertical arrow character, which we are not even convinced is from his font. A similar replacement seems to happen for most of his PUA characters.

    This occurs on Windows XP, so RichEdit 4.1 we think.

    Any ideas on this strange problem?

    Well, I had my suspicions, but Murray pointed out the actual problem, and it actually relates to the way that SYMBOL FONTS map to the PUA, as I have discussed in previous posts like More than you ever wanted to know about CP_SYMBOL, Strangely Symbolic font issues, and 'Doctor, it hurts when I do this.' Well, don't do that!

    The solution for dealing this over-eager helping out that RichEdit does with the PUA code points between U+f000 and U+f0ff (I would start the range at U+f020 but they apparently do this weird font behavior starting with U+f000.

    Well, avoid that small section of the PUA if you want to use RichEdit, because even if they take out this functionality in the future there are going to be multiple existing versions that are going to do this.

    Now I personally find this behavior to be inexcusably lame, and putting in such a a public contract for use of the PRIVATE USE AREA is just really ugly....

    But at least we know why RichEdit is doing what it is....

     

    This post brought to you by U+f020, a <Private Use> character

  • Sorting it all Out

    The lazy yet foxy jackdaw I love jumped over my quick brown sphinx dog of quartz

    • 19 Comments

    A couple of days ago when I wrote about how In Vista, jackdaws appear to be somewhat endangered, I mentioned

    ...both strings are actually in Message Compiler resources which means they could actually be localized (though note that the above algorithm means that localization might make the situation worse here, not better. On top of that, what do you do when you have a font with no latins in it? By this algorithm, they will just get another Latin script string which will still have to use font linking to find the glyphs to display.

    With the help of Claus Juhl (you my recall him from the Channel 9 video I posted about last August), I was able to look to see what the localizers for the various Windows language releases did to to both strings:

    • The quick brown fox jumps over the lazy dog.
    • Jackdaws love my big sphinx of quartz.

    Here are some of the highlights....

    First of all, Jackdaws love my big sphinx of quartz was not localized for any language. You can contemplate what this means for the algorithm I posted. :-)

    Second of all, this is clearly a problem like the one from 'Cette phrase en français est difficile à traduire en anglais', since it clearly not intended that an actual translation of 'The quick brown fox jumps over the lazy dog' be done. What is desired is a pangram covering the letters in the target language.

    Let's see how it worked out with a bunch of those languages:

    Arabic: ‏‏من طلب العلا سهر الليالي. 

    Bulgarian: Вкъщи не яж сьомга с фиде без ракийка и хапка люта чушчица!

    Chinese (PRC): The quick brown fox jumps over the lazy dog1.

    Chinese (Taiwan): 微風迎客,軟語伴茶

    Czech: Příliš žluťoučký kůň úpěl ďábelské ódy!

    Danish: Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen Walther spillede på xylofon.

    Dutch: Pa's wijze lynx bezag vroom het fikse aquaduct.

    Finnish: Tämä on malliteksti.

    French: Voix ambiguë d'un cœur qui au zéphyr préfère les jattes de kiwis.

    German: Franz jagt im komplett verwahrlosten Taxi quer durch Bayern.

    Greek: Θέλει αρετή και τόλμη η ελευθερία (Ανδρέας Κάλβος).

    Hebrew: ‏‏דג סקרן שט לו בים זך אך לפתע פגש חבורה נחמדה שצצה כך. 

    Hindi: सारे जहाँ से अच्छा हिंदोस्तां हमारा. 

    Hungarian: Árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

    Italian: Cantami o Diva del pelide Achille l'ira funesta.

    Japanese: Windows でコンピュータの世界が広がります。

    Korean: 다람쥐 헌 쳇바퀴에 타고파.

    Norwegian: En god stil må først og fremst være klar. Den må være passende. Aristoteles.

    Polish: Zażółć gęślą jaźń.

    Portuguese (Brazilian): abcdefghijklmnopqrstuvwxyz.

    Portuguese (Iberian): A rápida raposa castanha salta em cima do cão lento.

    Russian: Съешь еще этих мягких французских булок, да выпей чаю.

    Slovak: Kŕdeľ ďatľov učí koňa žrať kôru.

    Slovenian: V kožuščku hudobnega fanta stopiclja mizar in kliče

    Spanish: El veloz murciélago hindú comía feliz cardillo y kiwi. La cigüeña tocaba el saxofón detrás del palenque de paja.

    Swedish: Flygande bäckasiner söka hwila på mjuka tuvor.

    Turkish: abcçdefgğhıijklmnoöpqrsştuüvwxyz.

    My favorites are Japanese and Iberian Portuguese.

    How about yours? :-)

    You may be wondering why I tagged this post with 'Unicode Lame List' -- just keep in mind how poor all of these sentences are at dealing with the issue of showing what makes a font unique to any user who might be curious. Just remember, it is not really the localizers who are lame here -- it is the implementation....


    1 - When it doubt, don't translate? :-)

     

    This post brought to you by (U+1006, a.k.a. MYANMAR LETTER CHA)

  • Sorting it all Out

    Double Secret ANSI, part 2 (the brokenest one yet, sorry 'bout that!)

    • 14 Comments

    So when I first posted Double Secret ANSI, part 1 (Somewhere between ANSI and Unicode), I was only a little bit surprised when regular reader Mihai knew where I was heading....

    (Mihai works for a company that produces some of those Double Secret ANSI applications I was talking about!)

    You see, we kind of broke some of those Double Secret ANSI applications in Vista. 

    I was a bit slow posting the follow-up as I waiting to see what the plans were. And Shawn has just posted about that in Some Keyboards fail with ANSI applications on Windows Vista RTM, and about how a fix is being planned.

    This is very cool, and I am sure we'll both be posting about the fix when it is available. It is a pretty bad bug.

    How the bug happens is interesting though, so I thought I'd talk about that for a bit....

    You see, at the point where a keyboard layout is loaded, the kernel mode keyboarding code thunks to a function that uses TranslateCharsetInfo and the LANGID that is inside of the KLID (Keyboard Layout Identifier) of the keyboard. It gets the lsCsbDefault piece of the LOCALESIGNATURE and uses it for three purposes:

    • To decide whether to pass INPUTLANGCHANGE_SYSCHARSET in the WPARAM of the WM_INPUTLANGCHANGEREQUEST notification1;
    • To choose what goes in the WPARAM of the WM_INPUTLANGCHANGE notification;
    • To have the code page to use for all ANSI/Unicode conversions that are needed for all non-Unicode applications.

     It is that last bullet point that is where the power of double secret ANSI applications comes from. :-)

    And it is that bug in a bunch of the LOCALESIGNATURE data in Vista that leads to the problems in those same applications for some keyboards :-(

    Which is not to say that there aren't analogous bugs in the LOCALESIGNATURE data in prior versions.

    Because there are. Remember this post?

    It is just that the values are much more broken in Vista. Did I mention that I am glad we are working to get this fixed?

    Though to be honest, in a future version I am going to try to look into simplifying this work a bit. I mean, rather than having user/k thunk to uer mode to call gdi32.dll so it can call kernel32.dll to retrieve what was originally loaded in kernel mode anyway to retrieve what has (historically speaking) been the single least reliable piece of locale data we ship, there are several links there that are simply not needed.

    Talk about needless dependencies!

    Especially when you consider the fact that it is incredibly difficult to ever retrieve the code page value that is being used in the keyboarding code at any later point in the use of the keyboard layout. I mean, what's a nice, honest double plus ANSI application supposed to do? :-)

    There is probably a whole bunch more here that I can talk about another time, so if the topic is interesting to you then stay tuned....

    About the only mildly reassuring point here is that most apps that don't support Unicode never do much beyond the default system code page. Which is not saying much.

    For all of those double plus ANSI applications, sorry about that. This was not some kind of stunt to get people to write Unicode applications.

     

    1 - More on the weirdnesses here beyond these ones another day.

     

    This post brought to you by А (U+0410, a.k.a. CYRILLIC CAPITAL LETTER A)

  • Sorting it all Out

    A way better model for features, part 2

    • 5 Comments

    You may have seen the first part of this series (A way better model for features). Think of this as part 2....

    This time I am going to use as a victimexample the Microsoft National Language Support Downlevel APIs 1.0, which I have talked about previously in posts like this one.

    Now I am not going to complain about the fact that they are tagging a small subset of functions from the NLS API as "APIs", though clearly someone is not paying attention to the fact that API stands for Application Programming Interface, and thus a function is not an API, and neither is a small set of functions APIs, plural.

    But this is just one of those minor language issues I try not to become all mavenesque about. And the other reason I won't complain about this is that I used to make the same mistake. I don't anymore and I hopeful that all of my colleagues will come to the same realization one day. :-)

    To see what my complaint is, let's look at the functions in this library:

  • DownlevelLCIDToLocaleName - converts an LCID to a locale name.
  • DownlevelLocaleNameToLCID - converts a locale name to an LCID.
  • DownlevelGetParentLocaleLCID - retrieves the parent LCID for a locale.
  • DownlevelGetParentLocaleName - retrieves the locale name for the parent of the indicated locale.
  • Now let's compare that little set of functions to the ones they replace in the downlevel situation:

    At this point, you may see what the problem is -- you need to write two different kinds of code to use the two different functions!

    Do what is the benefit to use locale names downlevel if all of the code that is going to be written has to be special cased anyway? There is no real benefit to moving the code to use names here, one may as well either

    1. have the special case code use LCIDs downlevel and names on Vista, or
    2. just stick with LCIDs on all platforms.

    Since the whole goal of nlsdl.dll is to try to make sure people don't do #2, this particular set of functions fail the basic requirements that led to the DLL being created.

    A much better model here is code that can look identical on all platforms. What I would do is one of the following, instead:

    • Ideally, add downlevel wrapper functions with the same names to kernel32.lib so that one can just compile always and the code in the .LIB will either use the stub or call the OS function;
    • As a fallback, do the same thing with stubs but put them in a separate .LIB file and then put in MSLU loader style rules to get the downlevel functions used.

    This way, there is a good and easy way to write the new code, not the old code. To use names and not LCIDs. And there is no eventual fear of needing to revisit code based on changes.

    To be honest, I would extend the model and add stub versions of all of the locale name based Ex functions added in Vista (discussed previously here) and put them in that same library. Because no one is going to want to write "If on Vista do this entirely different operation" type code, so the best thing to do is add these wrappers and call it a day.

    The existing DLL is simply not such a hot model for helping developers migrate to using locale names rather than LCIDs....

    So it looks like MSLU wins again, this time when it comes to model decisions for migration and giving developers a clean migration path. Barry Bond wrote the first prototype that did this for MSLU, so I am technically just saying it is Barry's ideas that trump nlsdl.dll's ideas, not mine. I just had the good luck to be the one to take his prototype and run with it. :-)

     

    This post brought to you by (U+1009, a.k.a. MYANMAR LETTER NYA)

  • Sorting it all Out

    Hidden via the purloined post-it technique

    • 1 Comments

    So this last week Cathy had her 40th birthday....

    Julie had sent a mail explaining in simple terms why it was so important that something be done for this important event:

    First, we need to celebrate this fine milestone in her life.

    Second, it’s payback time. :-)

    In keeping with Cathy's Inner Monica™, one of the important tasks was to cover the inside her office with a reported 5000 Post-it notes of all different colors. it made for quite a sight all week, let me tell you!

    That morning as people were coming by to admire and/or gawk at this fine effort, Jenny and I were outside Cathy's office discussing why this was so amusing. I pointed out that there were way more than 5000 Post-it notes there even if only 5000 were put up for the 40th birthday assaultcelebration, given how many that Cathy always put up on her monitor and her desk to keep track of important items/issues.

    Jenny pointed how all of those Post-its were very well hidden, in plain sight. Like that story, the name of which she couldn't recall at that moment.

    I immediately realized what she was referring to, named the story, and I think both of us were probably way too amused at how many important issues in Cathy's work life were basically hidden by the Purloined Post-it method. :-)

    I realize there are probably more pictures I could post, right down to the one of a 10-year-old Cathy with identical hair color, but she is probably embarrassed enough at the Friends reference above so I'll show some mercy....

    (Hat tip to Edgar Allen Poe)

     

    This post brought to you by(U+2122, a.k.a. TRADE MARK SIGN)

  • Sorting it all Out

    In Vista, jackdaws appear to be somewhat endangered

    • 11 Comments

    Earlier today (in There was an order for letters, iroha was it's name-oh!) I talked about a specific pangram that has an interesting educational functionality, even in modern times.

    Maybe for computers too, and maybe not -- the evidence isn't in just yet. :-)

    But there are cases where as good pangram is exactly what you might want on a computer when you think about the core purpose of a pangram -- to show each and every letter in the alphabet in as short of a string as possible.

    How about in the sample text used by the Windows font viewer?

    If you look at the fonts folder in Windows XP or Server 2003 and just double click on any font you will get a nice dialog with a nice pangram in it:

    while other fonts would show a different pangram:

    As far as I know, no one has ever described the rules by which the Fontview.EXE decides what string to display. Mainly, people seem to rely on knowing that certain fonts show certain strings, and they leave it at that. It is not a particularly interesting algorithm, basically going something like this:

    if the Thread Locale is CJK:
        if the LPLOGFONT->lfCharset is SYMBOL_CHARSET, ANSI_CHARSET, DEFAULT_CHARSET, or OEM_CHARSET:
            The quick brown fox jumps over the lazy dog. 1234567890
        else
            Jackdaws love my big sphinx of quartz. 1234567890
    else
        if the font claims to support the CP_ACP of the system:
            The quick brown fox jumps over the lazy dog. 1234567890
        else
            Jackdaws love my big sphinx of quartz. 1234567890

    Not an especially brilliant algorithm, and both strings are actually in Message Compiler resources which means they could actually be localized (though note that the above algorithm means that localization might make the situation worse here, not better. On top of that, what do you do when you have a string with no latins in it? By this algorithm, they will just get another Latin script string which will still have to use font linking to find the glyphs to display.

    Plus, if you are trying to understand at a glance what each font is for, how does this honestly help?

    I am sure that the algorithm could be lamer than this, but offhand I can't think of how. :-(

    In Vista, the Jackdaw was put on the endangered species list and that string is no longer available. The string is still localizable, so maybe it is actually being modified in different language versions, though this is of limited use and still kind of lame since the string should be FONT driven, not UI language driven.

    So now let's think about how we could do a better job here!

    How would you proceed with the task of deciding the best possible way to produce the optimal default string that is most likely to display text that not only shows off the font's best characteristics but does so in the language that the user is most likely to be able to understand (if there is one, of course).

    That just screams out for an interview question, any time the candidate claimed knowledge of Win32 Text/GDI knowledge!

    Anyone want to take a stab here at the algorithm they would try and use? I'll post my thoughts tomorrow.....

     

    This post brought to you by(U+163a, a.k.a. CANADIAN SYLLABICS CARRIER TLU)

  • Sorting it all Out

    There was an order for letters, iroha was it's name-oh!

    • 8 Comments

    (Apologies to the farmer's dog Bingo!) 

    So the question I got from a customer the other day was an interesting one:

    Does Windows support the Iroha ordering Kana? I did not see an option for it.

    Windows doesn't support it, no. Though maybe I should say a bit more about this, it being a blog and all....

    The basis of the Iroha ordering is a poem, one that is a nearly perfect pangram1 for Kana. The poem goes like this:

    いろはにほへと
    ちりぬるを
    わかよたれそ
    つねならむ
    うゐのおくやま
    けふこえて
    あさきゆめみし
    ゑひもせす

    Now this is a common ordering that many Japanese students in Japan may have learned during their youth while learning the language, but it doesn't really get used much after that (the Gojūon ordering is favored).

    In fact, after talking to some colleagues of mine who grew up in Japan the only real uses that came up were somewhat random, like the ones mentioned in that Wikipedia article I pointed to above:

    • Seat numbering in auditoriums/theaters (a good reason to learn the poem if you plan to live in Japan, huh?)
    • Go games in Japan which would have the letters at the top from right to left
    • The musical scale in Japanese (A B C D E F G becomes i ro ha ni ho he to)
    • A few other numbering/counting cases of particular items used in Japanese

    So let's back up to the original question -- if it is used all of these places, then why is it not an alternate collation for Kana in Windows?

    Well, first the simple reason -- it really hasn't been requested (or, if it has, the request has not made it here yet!).

    Second is the fact that most of the cases where the ordering is used don't necessarily make sense in the context of an alphabetical ordering in a call to a function like CompareString.

    Which leads to the more complex reason, in the definition of what I meant by request. I mean with an actual scenario, a time when the ordering would make sense (and make sense morally and ethically!) to use (and of course I would not count masking it easier to cheat on primary school exams in Japan when students own Windows Mobile devices as an acceptable reason, due to ethical concerns!).

    Now would such an ordering actually be useful in some scenarios? It is an interesting problem to contemplate (the person who asked did not give a specific reason but might well have had one in mind). Or would the results be confusing at this point to speakers of Japanese?

     

    1 - A pangram is a sentence that uses each and every letter an alphabet at least once; a perfect pangram is one that uses each and every letter only once.

     

    This post brought to you by (U+3044, a.k.a. HIRAGANA LETTER I)

  • Page 2 of 4 (57 items) 1234