Blog - Title

March, 2006

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Getting all you can out of a keyboard layout, Part #6

    • 14 Comments

    Previous posts in this series: Parts 0, 1, 2, 3, 4 and 5.

    It is funny when I do posts like these and find that although they get a lot of comments, they do get a lot of hits. Of course that could mean that people don't find them helpful so they move on to the next site. :-)

    Anyway, if you look at the first six posts, between them all of the "basics" are covered; everything from here on in gets a little molre complicated. Mainly because most of the additional information is more complicated....

    We'll start with the small issue I mentioned earlier about using Scan Codes versus using Virtual Keys. The problem is that as a rule there is not a 100% round trip between them, and the strange behavior can happen when you move between VK and SC, not the other way around.

    So if I take the following code:

    for(KeysEx ke = KeysEx.VK_NUMPAD0; ke <= KeysEx.VK_NUMPAD9; ke++) {
        uint sc02 = MapVirtualKeyEx((uint)ke, 0, hkl);
        uint vk02 = MapVirtualKeyEx(sc02, 1, hkl);
        uint sc03 = MapVirtualKeyEx(vk02, 0, hkl);
        uint vk03 = MapVirtualKeyEx(sc03, 1, hkl);
        uint sc04 = MapVirtualKeyEx(vk03, 0, hkl);
        uint vk04 = MapVirtualKeyEx(sc04, 1, hkl);
        Console.WriteLine("{0} == {1:x2} -> {2:x2} -> {3:x2} -> {4:x2} -> {5:x2} -> {6:x2} -> {7:x2} == {8}",
          ke, ((uint)ke).ToString("x2"), sc02, vk02, sc03, vk03, sc04, vk04,((KeysEx)vk04));
    }

    then the output will be:

    VK_NUMPAD0 == 60 -> 52 -> 2d -> 52 -> 2d -> 52 -> 2d == VK_INSERT
    VK_NUMPAD1 == 61 -> 4f -> 23 -> 4f -> 23 -> 4f -> 23 == VK_END
    VK_NUMPAD2 == 62 -> 50 -> 28 -> 50 -> 28 -> 50 -> 28 == VK_DOWN
    VK_NUMPAD3 == 63 -> 51 -> 22 -> 51 -> 22 -> 51 -> 22 == VK_NEXT
    VK_NUMPAD4 == 64 -> 4b -> 25 -> 4b -> 25 -> 4b -> 25 == VK_LEFT
    VK_NUMPAD5 == 65 -> 4c -> 0c -> 4c -> 0c -> 4c -> 0c == VK_CLEAR
    VK_NUMPAD6 == 66 -> 4d -> 27 -> 4d -> 27 -> 4d -> 27 == VK_RIGHT
    VK_NUMPAD7 == 67 -> 47 -> 24 -> 47 -> 24 -> 47 -> 24 == VK_HOME
    VK_NUMPAD8 == 68 -> 48 -> 26 -> 48 -> 26 -> 48 -> 26 == VK_UP
    VK_NUMPAD9 == 69 -> 49 -> 21 -> 49 -> 21 -> 49 -> 21 == VK_PRIOR

    You can see how the journey from VK to SC to VK to SC to VK clearly has some round tripping issues, basically at the point I marked in red, at that first transition.

    Now, if you think about how these keys are mapped, you'll see a pattern in what is happening here:

    7      8      9        Home  Up    PgUp
    4      5      6        Left        Right
    1      2      3        End   Down  PgDown
    0                      Ins

    Suddenly what is happening becomes clear, doesn't it? :-)

    What is more, you do not need to have the NUMLOCK key toggled to get the VK_NUMPAD# Virtual Key values to give you numbers either -- they always work.

    And to stay consistently inconsistent with the rest of the keyboard, other VK values with the VK_NUMLOCK toggled don'e return numbers.

    So, the advantage to using Scan Codes for most of the keyboard is that you are appropriately limiting yourself to keys that are expected to exist. Though to get all of the keys on the keyboard there are as few that you will need to handle via VK values anyway....

    So what we will insert into our keyboard code is a little bit of info to get the keys that we cannot get via Scan Code alone (new code in black):

    // Scroll through the Scan Code (SC) values and get the valid Virtual Key (VK)
    // values in it. Then, store the SC in each valid VK so it can act as both a
    // flag that the VK is valid, and it can store the SC value.
    for(uint sc = 0x01; sc <= 0x7f; sc++) {
        uint vk = MapVirtualKeyEx(sc, 1, hkl);
        if(vk != 0) {
            rgScOfVk[vk] = sc;
        }
    }


    // add the special keys that do not get added from the code above
    for(KeysEx ke = KeysEx.VK_NUMPAD0; ke <= KeysEx.VK_NUMPAD9; ke++) {
        rgScOfVk[(uint)ke] = MapVirtualKeyEx((uint)ke, 0, hkl);
    }
    rgScOfVk[(uint)KeysEx.VK_DECIMAL] = MapVirtualKeyEx((uint)KeysEx.VK_DECIMAL, 0, hkl);
    rgScOfVk[(uint)KeysEx.VK_DIVIDE] = MapVirtualKeyEx((uint)KeysEx.VK_DIVIDE, 0, hkl);
    rgScOfVk[(uint)KeysEx.VK_CANCEL] = MapVirtualKeyEx((uint)KeysEx.VK_CANCEL, 0, hkl);

    Now, just in case any of this is making sense, I'll close with some #defines in kbd.h:

    #define SCANCODE_NUMPAD_FIRST 0x47
    #define SCANCODE_NUMPAD_LAST  0x52

    I think we have already established that most of the Scan Codes do not fit in this range (not to mention the fact that 0x47 to 0x52 would be eleven keys, not ten!). At this point I still have no idea what these are for, if not solely to confuse -- I did find it used a few places in the Windows source but clearly they are not meant for most people since the functions that come out of user32.dll do not ever provide them.

    Coming up, even weirder stuff....

     

    This post brought to you by "6" (U+0036, DIGIT SIX)
    A Unicode character that is in the very small family of those whose VK value is the same as it's code point!

  • Sorting it all Out

    If at first you don't succeed, there's probably still a bug

    • 38 Comments

    Back near the beginning of the month, when I posted about how Everybody's doing the wraparound...., I talked about how the particular implementation of multiple combining diacritic weights led to an interesting situation on Windows where the secondary, a.k.a. DW or 'diacritic weight', weight would wrap around.

    Regular reader Maurits suggested a real problem that seemed to be possible:

    Hmmm... I just thought of a potentially serious problem with the third method (wrap.)  What if the value wraps to 0xfe or 0xfd?  Then when you add 2, you get 01 or (worse?) 00.

    In particular:

    "e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1.

    "e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0.

    On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?

    This was before he had actually tested things out -- once he did, his report was slightly different:

    I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000)

    Maybe I'm trying to solve a problem that doesn't exist?

    Well, let's back up a second....

    He is indeed right, the sort keys never seem to get created with bogus 0x00 or 0x01 weights inserted (therefore the bug Atsushi Enomoto reported still stands alone as the bug that inserts one of these byte sequences when it is not expected!).

    (That was another bug found through blogging!)

    But that does not mean there is no bug in this case.

    Now I suspected there might be a bug here after Maurits reported the original problem -- and after looking at it  tonight I managed to find one....

    By carefully applying Tester's Axiom #5 and the knowledge that there is a potentially incorrect issue with the implementation, what can we find out here?

    Well, if you look at How to track down collation bugs from late last year, you will notice a couple of things:

    • Most of the points in that post do not apply, and
    • Point #12 leads to some interesting results

    Take the following sample code:

    using System;
    using System.Globalization;

    namespace SortTest {
        class Class1 {
            [STAThread]
            static void Main(string[] args) {
                CompareInfo ci = CompareInfo.GetCompareInfo(0x007f);
                string stOld= "e";

                for(int i = 0; i < 7; i++) {
                    string st = "e\u0323\u0323" + ((char)(0x031d + i)).ToString();
                    Console.Write(string.Compare(stOld, st));
                    Console.Write("\t\t");
                    Console.WriteLine(SortKey.Compare(ci.GetSortKey(stOld),
    ci.GetSortKey(st)));
                    stOld = st;
                }
            }
        }
    }

    The output of this little function (which should be two rows of numbers, where each row contains two values equal to each other) is worth 1000 words....

    -1              -1
    -1              -1
    -1              1
    -1              0
    -1              0
    -1              -1
    -1              -1

    What are the sort key values here? They are:

    \u0065\u0323\u0323\u031d        0e 21 01 fe 01 01 01 00
    \u0065\u0323\u0323\u031e        0e 21 01 ff 01 01 01 00
    \u0065\u0323\u0323\u031f        0e 21 01 01 01 01 00
    \u0065\u0323\u0323\u0320        0e 21 01 01 01 01 00
    \u0065\u0323\u0323\u0321        0e 21 01 01 01 01 00
    \u0065\u0323\u0323\u0322        0e 21 01 03 01 01 01 00
    \u0065\u0323\u0323\u0323        0e 21 01 04 01 01 01 00

    So, it looks like functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results, while LCMapString for sort keys and its managed equivalent will, in an effort to avoid the problem with embedded invalid bytes will occasionally drop them.

    Three bytes are dropped:

    • 0x00 -- reserved for the end of the string;
    • 0x01 -- reserved for a sentinel between subsections of the sort key;
    • 0x02 -- reserved for the minimal weight with no diacritics, which is dropped when it is not needed as a placeholder for later secondary weights that greater than 0x02.

    As soon as I read that comment, I knew there was a bug here. :-)

    (Yet another bug found through blogging!)

     

    This post brought to you by " ̣" (U+0323, a.k.a. COMBINING DOT BELOW)

  • Sorting it all Out

    Is the CAPS LOCK on?

    • 9 Comments

    A few days ago, Mohammed asked:

    I am using System.Windows.Forms.SendKeys to send some keystrokes to the active application. When the CAPS lock on the hosting keyboard is turned on, all my keystrokes get sent with the opposite case, which is an acceptable behavior (SendKeys is dependant on the state of the keyboard).

    To work around this issue, I’m trying to detect the state of the CAPS lock key to account for case changes. Does anyone know a way to do this using a .NET library call? I’m using C#.

    Thanks,

    There is unfortunately nothing built in, which is not to say there shouldn't be!

    The trick is to use the GetKeyState function in the Win32 API, which has the following info about its return value:

    The return value specifies the status of the specified virtual key, as follows:

    • If the high-order bit is 1, the key is down; otherwise, it is up.
    • If the low-order bit is 1, the key is toggled. A key, such as the CAPS LOCK key, is toggled if it is turned on. The key is off and untoggled if the low-order bit is 0. A toggle key's indicator light (if any) on the keyboard will be on when the key is toggled, and off when the key is untoggled.

    That first half may sound familiar to those who read Getting all you can out of a keyboard layout, Part #5, since I used it to give information on whether the shift keys were pressed or not.

    Well, the second half is what we're dealing with now -- the low order bit must be set to indicate a toggle key like CAPS LOCK has been pressed. So, you can use code something like this:

    [DllImport("user32.dll", ExactSpelling=true)]
    internal static extern ushort GetKeyState(uint nVirtKey);

    internal const byte VK_CAPITAL      = 0x14;

    if(0 != (GetKeyState(VK_CAPITAL) & 1))
    {
        // do whatever here
    }

    (You may remember that I hate the Keys enumeration, as I pointed out in Part 0; also, I did not know about VB.Net's My.Computer.Keyboard.CapsLock but luckily Sara Ford was nearby to point this out for those who are using VB and wanted avoid the const definition!)

    And there you have it -- how to find out if the CAPS LOCK is toggled.

    As I am sure you can imagine, we're going to be making use of this in a future post in the series. :-)

     

    This post brought to you by "Ŧ" (U+0166, a.k.a. LATIN CAPITAL LETTER T WITH STROKE)

  • Sorting it all Out

    Getting all you can out of a keyboard layout, Part #5

    • 8 Comments

    Previous posts in this series: Parts 0, 1, 2, 3, and 4.

    I have been slowly building up a sample that is (so far) actually of limited use. I mean, who the hell cares about running through one shift state of a keyboard, even if it is definitive?

    But it has only been a few days, and I am going to take care of the "easy shift states" now. So you can take the code from this sample and consider it done enough to handle a lot of the conventional scenarios and keyboards that exist.

    Now it won't get you everything, but there are future posts for that.... :-)

    We'll start by defining a nice enumeration to cover all those easy shift states:

    public enum ShiftState : int {
        Base            = 0,                    // 0
        Shft            = 1,                    // 1
        Ctrl            = 2,                    // 2
        ShftCtrl        = Shft | Ctrl,          // 3
        Menu            = 4,                    // 4 -- NOT USED
        ShftMenu        = Shft | Menu,          // 5 -- NOT USED
        MenuCtrl        = Menu | Ctrl,          // 6
        ShiftMenuCtrl   = Shft | Menu | Ctrl,   // 7
    }

    Now, there are a few important things to note here:

    • Only four of them have any real independent meaning; the other four are the various combinations when you consider the first four;
    • Two of the eight states (Menu and ShftMenu) cannot be assigned in keyboards on Windows, at all, and are included only for completeness;
    • One of the eight states (Ctrl) is not generally recommended in applications due to the fact that it is often co-opted for use in shortcuts;
    • There is assumed to be no difference between the LEFT and RIGHT versions of these keys -- we'll save that for later.

    And, how will we use this enumeration, exactly?

    Well, we will build a Key State array (basically an array of Virtual Key values), and send it in our various calls to ToUnicodeEx. According to the documentation such as that in handy functions like GetKeyboardState:

    ...each member of the array pointed to by the lpKeyState parameter contains status data for a virtual key. If the high-order bit is 1, the key is down; otherwise, it is up.

    There is more there, but we'll get to it later. :-)

    The high order bit of a byte is the top bit -- thus setting the byte value to 0x80 will indicate it is pressed. We'll also add a handy function to do the bit work for us:

    private static void FillKeyState(KeysEx[] lpKeyState, ShiftState ss) {
        lpKeyState[(int)KeysEx.VK_SHIFT]    = (((ss & ShiftState.Shft) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
        lpKeyState[(int)KeysEx.VK_CONTROL]  = (((ss & ShiftState.Ctrl) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
        lpKeyState[(int)KeysEx.VK_MENU]     = (((ss & ShiftState.Menu) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
    }

    Since as I mentioned before, we are only dealing with the up/down state of these three keys, that's all we need to get the Key State array set up correctly.

    Then, as I mentioned in Part #4, we do need to clean up our output a bit to keep it sensible. So, let's look at our function:

    using System;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;

    namespace KeyboardLayouts {
        class Class1 {

            //  You'll want to insert that enumeration from part #0 here!

            public enum ShiftState : int {
                Base            = 0,                    // 0
                Shft            = 1,                    // 1
                Ctrl            = 2,                    // 2
                ShftCtrl        = Shft | Ctrl,          // 3
                Menu            = 4,                    // 4 -- NOT USED
                ShftMenu        = Shft | Menu,          // 5 -- NOT USED
                MenuCtrl        = Menu | Ctrl,          // 6
                ShiftMenuCtrl   = Shft | Menu | Ctrl,   // 7
            }

            internal const uint KLF_NOTELLSHELL  = 0x00000080;

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="MapVirtualKeyExW", ExactSpelling=true)]
            internal static extern uint MapVirtualKeyEx(
                uint uCode,
                uint uMapType,
                IntPtr dwhkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadKeyboardLayoutW", ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayout(string pwszKLID, uint Flags);

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr hkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern int ToUnicodeEx(
                uint wVirtKey,
                uint wScanCode,
                KeysEx[] lpKeyState,
                StringBuilder pwszBuff,
                int cchBuff,
                uint wFlags,
                IntPtr dwhkl);

            [DllImport("user32.dll", ExactSpelling=true)]
            public static extern int GetKeyboardLayoutList(int nBuff, [Out, MarshalAs(UnmanagedType.LPArray)] IntPtr[] lpList);

            [STAThread]
            static void Main(string[] args) {
                int cKeyboards = GetKeyboardLayoutList(0, null);
                IntPtr[] rghkl = new IntPtr[cKeyboards];
                GetKeyboardLayoutList(cKeyboards, rghkl);
                IntPtr hkl = LoadKeyboardLayout(args[0], KLF_NOTELLSHELL);
                if(hkl == IntPtr.Zero) {
                    Console.WriteLine("Sorry, that keyboard does not seem to be valid.");
                }
                else {
                    KeysEx[] lpKeyState = new KeysEx[256];
                    uint[] rgScOfVk = new uint[256];

                    // Scroll through the Scan Code (SC) values and get the valid Virtual Key (VK)
                    // values in it. Then, store the SC in each valid VK so it can act as both a
                    // flag that the VK is valid, and it can store the SC value.
                    for(uint sc = 0x01; sc <= 0x7f; sc++) {
                        uint vk = MapVirtualKeyEx(sc, 1, hkl);
                        if(vk != 0) {
                            rgScOfVk[vk] = sc;
                        }
                    }

                    Console.WriteLine("SC\tVK                 \t_\ts\tc\tsc\tca\tsca");
                    Console.WriteLine("==\t==========\t\t====\t====\t====\t====\t====\t====");
                    for(uint vk = 0x01; vk < rgScOfVk.Length; vk++) {
                        if(rgScOfVk[vk] != 0) {
                            bool fSomeActualLetters = false;                // Were actual letters found in the row?
                            StringBuilder sbBuffer;                         // Scratchpad we use many places
                            StringBuilder sbLayout = new StringBuilder();   // Will hold an entire layout 'row'
                            StringBuilder sbStates = new StringBuilder();   // Will hold all of the shift state info

                            // First, get the SC/VK info stored
                            sbLayout.Append(string.Format("{0:x2}\t{1:x2} - {2}", rgScOfVk[vk], vk, ((KeysEx)vk).ToString().PadRight(13)));
                            for(ShiftState ss = ShiftState.Base; ss <= ShiftState.ShiftMenuCtrl; ss++) {
                                if(ss == ShiftState.Menu || ss == ShiftState.ShftMenu) {
                                    // Alt and Shift+Alt don't work, so skip them
                                    continue;
                                }

                                FillKeyState(lpKeyState, ss);
                                sbBuffer = new StringBuilder(10);
                                int rc = ToUnicodeEx(vk, rgScOfVk[vk], lpKeyState, sbBuffer, sbBuffer.Capacity, 0, hkl);
                                if(rc > 0) {
                                    StringBuilder sbChar = new StringBuilder(5 * rc);
                                    if(sbBuffer.Length == 0) {
                                        // Someone defined NULL on the keyboard; let's coddle them
                                        sbChar.Append("0000 ");
                                    }
                                    else {
                                        for(int ich = 0; ich < rc; ich++) {
                                            sbChar.Append(((ushort)sbBuffer.ToString()[ich]).ToString("x4"));
                                            sbChar.Append(' ');
                                        }
                                    }
                                    fSomeActualLetters = true;
                                    sbStates.Append(string.Format("\t{0}", sbChar.ToString(0, sbChar.Length - 1)));
                                }
                                else if(rc < 0) {
                                    fSomeActualLetters = true;
                                    sbStates.Append(string.Format("\t{0:x4}@", ((ushort)sbBuffer.ToString()[0])));

                                    // It's a dead key; let's flush out whats stored in the keyboard state.
                                    ToUnicodeEx((uint)KeysEx.VK_DECIMAL, rgScOfVk[(uint)KeysEx.VK_DECIMAL], lpKeyState, sbBuffer, sbBuffer.Capacity, 0, hkl);
                                }
                                else {
                                    sbStates.Append("\t  -1");
                                }
                            }
                            // Skip the layout rows that have nothing in them
                            if(fSomeActualLetters) {
                                sbLayout.Append(sbStates.ToString());
                                Console.WriteLine(sbLayout.ToString());
                            }
                        }
                    }

                    foreach(IntPtr i in rghkl) {
                        if(hkl == i) {
                            hkl = IntPtr.Zero;
                            break;
                        }
                    }

                    if(hkl != IntPtr.Zero) {
                        UnloadKeyboardLayout(hkl);
                    }
                }
            }

            private static void FillKeyState(KeysEx[] lpKeyState, ShiftState ss) {
                lpKeyState[(int)KeysEx.VK_SHIFT]    = (((ss & ShiftState.Shft) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
                lpKeyState[(int)KeysEx.VK_CONTROL]  = (((ss & ShiftState.Ctrl) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
                lpKeyState[(int)KeysEx.VK_MENU]     = (((ss & ShiftState.Menu) != 0) ? (KeysEx)0x80 : (KeysEx)0x00);
                lpKeyState[(int)KeysEx.VK_CAPITAL]  = (fCapsLock ? (KeysEx)0x01 : (KeysEx)0x00);
            }
        }
    }

    And there we go.

    Note some of the important changes:

    • Rather than directly writing results, they are put into StringBuffer objects, which are only written out for each "row", where a row represents everything that a single key can do, in the various shift states;
    • The fascinating case of a key having NULL assigned in a keystroke is handled, a necessity since for some ridiculous reason the US keyboard has such an assignment;
    • When no assignment was found in a shift state, -1 was entered (this is the convention used in .KLC files as well in part to act as a placeholder);
    • When a dead key assignment is found, a COMMERCIAL AT (@) sign in placed just after it, so we know it is a dead key;
    • I changed the "dead key buffer clearing character" from VK_SPACE to VK_DECIMAL.

    Let's take a look at a random keyboard when we run this code, say the French keyboard layout (0000040c), chosen since we get to see both dead keys and keys that have multiple code points in them (even though the latter messes up our columns for a few rows!):

    SC      VK                      _       s       c       sc      ca      sca
    ==      ==========              ====    ====    ====    ====    ====    ====
    0e      08 - VK_BACK            0008    0008    007f      -1      -1      -1
    7c      09 - VK_TAB             0009    0009      -1      -1      -1      -1   
    1c      0d - VK_RETURN          000d    000d    000a      -1      -1      -1
    01      1b - VK_ESCAPE          001b    001b    001b      -1      -1      -1
    39      20 - VK_SPACE           0020    0020    0020      -1      -1      -1
    0b      30 - VK_0               00e0    0030    0000      -1    0040      -1
    02      31 - VK_1               0026    0031      -1      -1      -1      -1
    03      32 - VK_2               00e9    0032      -1      -1    007e@     -1
    04      33 - VK_3               007e 0022       0033      -1      -1    0023      -1
    05      34 - VK_4               0027    0034      -1      -1    007b      -1
    06      35 - VK_5               0028    0035      -1    001b    005b      -1
    07      36 - VK_6               002d    0036      -1    001f    007c      -1
    08      37 - VK_7               00e8    0037      -1      -1    0060@     -1
    09      38 - VK_8               0060 005f       0038      -1    001c    005c      -1
    0a      39 - VK_9               00e7    0039      -1    001e    005e      -1
    10      41 - VK_A               0061    0041    0001    0001      -1      -1
    30      42 - VK_B               0062    0042    0002    0002      -1      -1
    2e      43 - VK_C               0063    0043    0003    0003      -1      -1
    20      44 - VK_D               0064    0044    0004    0004      -1      -1
    12      45 - VK_E               0065    0045    0005    0005    20ac      -1
    21      46 - VK_F               0066    0046    0006    0006      -1      -1
    22      47 - VK_G               0067    0047    0007    0007      -1      -1
    23      48 - VK_H               0068    0048    0008    0008      -1      -1
    17      49 - VK_I               0069    0049    0009    0009      -1      -1
    24      4a - VK_J               006a    004a    000a    000a      -1      -1
    25      4b - VK_K               006b    004b    000b    000b      -1      -1
    26      4c - VK_L               006c    004c    000c    000c      -1      -1
    27      4d - VK_M               006d    004d    000d    000d      -1      -1
    31      4e - VK_N               006e    004e    000e    000e      -1      -1
    18      4f - VK_O               006f    004f    000f    000f      -1      -1
    19      50 - VK_P               0070    0050    0010    0010      -1      -1
    1e      51 - VK_Q               0071    0051    0011    0011      -1      -1
    13      52 - VK_R               0072    0052    0012    0012      -1      -1
    1f      53 - VK_S               0073    0053    0013    0013      -1      -1
    14      54 - VK_T               0074    0054    0014    0014      -1      -1
    16      55 - VK_U               0075    0055    0015    0015      -1      -1
    2f      56 - VK_V               0076    0056    0016    0016      -1      -1
    2c      57 - VK_W               0077    0057    0017    0017      -1      -1
    2d      58 - VK_X               0078    0058    0018    0018      -1      -1
    15      59 - VK_Y               0079    0059    0019    0019      -1      -1
    11      5a - VK_Z               007a    005a    001a    001a      -1      -1
    37      6a - VK_MULTIPLY        002a    002a      -1      -1      -1      -1
    4e      6b - VK_ADD             002b    002b      -1      -1      -1      -1
    4a      6d - VK_SUBTRACT        002d    002d      -1      -1      -1      -1
    1b      ba - VK_OEM_1           0024    00a3    001d      -1    00a4      -1
    0d      bb - VK_OEM_PLUS        003d    002b      -1      -1    007d      -1
    32      bc - VK_OEM_COMMA       002c    003f      -1      -1      -1      -1
    33      be - VK_OEM_PERIOD      003b    002e      -1      -1      -1      -1
    34      bf - VK_OEM_2           003a    002f      -1      -1      -1      -1
    28      c0 - VK_OEM_3           00f9    0025      -1      -1      -1      -1
    0c      db - VK_OEM_4           0029    00b0      -1      -1    005d      -1
    2b      dc - VK_OEM_5           002a    00b5    001c      -1      -1      -1
    1a      dd - VK_OEM_6           005e@   00a8@   001b      -1      -1      -1
    29      de - VK_OEM_7           00b2      -1      -1      -1      -1      -1
    35      df - VK_OEM_8           0021    00a7      -1      -1      -1      -1
    56      e2 - VK_OEM_102         003c    003e    001c      -1      -1      -1

    And there we have it -- all of the easy shift states in a nice grid.

    Items for future posts (not necessarily in this order!):

  • the base characters that go with the dead keys and the composite characters they create
  • the CAPS LOCK key
  • the harder shift states
  • SGCAPS
  • chained dead keys

    And perhaps a few more goodies if anyone is still reading by then....

     

    This post brought to you by "5" (U+0035, DIGIT FIVE)
    A Unicode character that is in the very small family of those whose VK value is the same as it's code point!

  • Sorting it all Out

    Only ONE WCHAR per dead key

    • 6 Comments

    Regular reader Ivan Petrov asked the following in the Suggestion Box:

    Hi Michael

    I've the following problem:

    In MSKLC in the 'Control state' of the keyboard (when the Control key is pressed) I'm trying to make VK_OEM_3 a Dead Key. So, I assign to it  'U+0060 (') GRAVE ACCENT' and then I set it as Dead key. To this point everythig goes ok! Then I go into the Dead key dialog box. And here is the BIG problem. What I mean:
    I want to do the following:
    When I press "Ctrl" + "`"  followed by one of this vowels in the Bulgarian alphabet:
    "а", "е", "и", "о", "у", "ъ", "ю" and "я",
    the keyboard layout to produce one of this results:

    а̀  (0430 + 0300)
    ѐ  0450 or (0435 + 0300)
    ѝ  045d or (0438 + 0300)
    о̀  (043e + 0300)
    у̀  (0443 + 0300)
    ъ̀  (044a + 0300)
    ю̀  (044e + 0300)
    я̀   (044f  + 0300)

    And finally the problem:

    Let's take for example the first vowel "а":
    In the Ded Key dialog box in the Base (code point) field I type "а". Then in the Composite (code point) field I type "U+0430 U+0300" and then MSKLC says that "The value must be either a single character or code point." So, this is the problem!

    Can you help, how to deal with this.

    I've no problem with the two precomposed letters "ѐ" and "ѝ", but the rest ... ;-(

    Thank you in advance.
    Regards,
    Ivan Petrov.

    Hmmm.... maybe Ivan was not reading often enough. :-(

    The problem that he is reporting on has no solution. As I pointed out back in December of 2004 when I mentioned that Dead keys are not intuitive, and then a few times since then -- the end result of a dead key transaction must be s single UTF-16 code unit.

    This is also explained in the MSKLC help file, as is the explanation that the dead key architecture is one that is around for legacy purposes only, and not generally for the creation of new keyboards.

    Even if this ever were changed in a future release of Windows, it could not be used on existing versions due to the backcompat break that this would cause -- and the future version of MSKLC would have to support different keyboard layout DLLs for different versions of Windows. Which as I am sure people can imagine is not a terribly popular plan....

    For solutions, there are two obvious ones:

    • You can add each sequence to its own key on the keyboard, or
    • You can add the combining character and then type letter plus combining character on the keyboard

    Either of these plans will allow the characters to be supported....

     

    This post brought to you by "у" (U+0443, a.k.a. CYRILLIC SMALL LETTER U)

  • Sorting it all Out

    Technical jargon bordering on a new dialect?

    • 11 Comments

    Over at Language Log Plaza, Geoffrey K. Pullum was talking about Lexical Drift.

    I tend to think of the specific issue he raises there:

    Psychologist Alex Delaware and his best friend the gay detective Milo Sturgis are always having long and complex discussions about how much the current evidence favors this or that suspect. And Milo will often say, for example, "So, yaou like the husband now?" — meaning (and I have no idea how it was that I could see this instantly), "So, you now favor the hypothesis that the husband is the murderer that we seek?" The verb like has taken on a new sense where A likes B means "A favors the hypothesis that B is the culprit." See how that works? Maybe the new sense will catch on more widely, maybe it will be limited (or is limited) to police talk, maybe it will never spread much; we don't know, and we can't predict.

    is actually more of a technical jargon thing that we accept in such situations, the same way we accept for example "The AIDS test was positive" to mean that the most negative of all possible results was returned.

    Such usage (which we seem to accept without too much trouble) does not seem to affect our own usage in other places....

  • Sorting it all Out

    Getting all you can out of a keyboard layout, Part #4

    • 9 Comments

    Previous posts in this series: Parts 0, 1, 2, and 3.

    We're going to do a bit of preparatory adjustment in this post. Just so we can be ready for what comes later, you see.

    If you look at information like the Scan Code and the Virtual Key, they really are independent of shift state. It means the code is kind of wasteful, continually asking for mappings over and over that it already has gotten.

    Or perhaps I should say would be wasteful once we started adding new shift states to the mix. At the moment we are only mildly wasteful, where the code gets the scan code for VK_SPACE more than once.

    Let's fix it now....

    (As before, the older code is gray, the new code is black)

    using System;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;

    namespace KeyboardLayouts {
        class Class1 {

            //  You'll want to insert that enumeration from part #0 here!

            internal const uint KLF_NOTELLSHELL  = 0x00000080;

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="MapVirtualKeyExW", ExactSpelling=true)]
            internal static extern uint MapVirtualKeyEx(
                uint uCode,
                uint uMapType,
                IntPtr dwhkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadKeyboardLayoutW", ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayout(string pwszKLID, uint Flags);

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr hkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern int ToUnicodeEx(
                uint wVirtKey,
                uint wScanCode,
                KeysEx[] lpKeyState,
                StringBuilder pwszBuff,
                int cchBuff,
                uint wFlags,
                IntPtr dwhkl);

            [DllImport("user32.dll", ExactSpelling=true)]
            public static extern int GetKeyboardLayoutList(int nBuff, [Out, MarshalAs(UnmanagedType.LPArray)] IntPtr[] lpList);

            [STAThread]
            static void Main(string[] args) {
                int cKeyboards = GetKeyboardLayoutList(0, null);
                IntPtr[] rghkl = new IntPtr[cKeyboards];
                GetKeyboardLayoutList(cKeyboards, rghkl);
                IntPtr hkl = LoadKeyboardLayout(args[0], KLF_NOTELLSHELL);
                if(hkl == IntPtr.Zero) {
                    Console.WriteLine("Sorry, that keyboard does not seem to be valid.");
                }
                else {
                    KeysEx[] lpKeyState = new KeysEx[256];
                    uint[] rgScOfVk = new uint[256];

                    // Scroll through the Scan Code (SC) values and get the Virtual Key (VK)
                    // values in it. Then, store the SC in each valid VK so it can act as both a
                    // flag that the VK is valid, and it can store the SC value.
                    for(uint sc = 0x01; sc <= 0x7f; sc++) {
                        uint vk = MapVirtualKeyEx(sc, 1, hkl);
                        if(vk != 0) {
                            rgScOfVk[vk] = sc;
                        }
                    }

                    for(uint vk = 0x01; vk < rgScOfVk.Length; vk++) {
                        if(rgScOfVk[vk] != 0) {
                            StringBuilder sb = new StringBuilder(10);
                            int rc = ToUnicodeEx(vk, rgScOfVk[vk], lpKeyState, sb, sb.Capacity, 0, hkl);
                            if(rc > 0) {
                                StringBuilder sbChar = new StringBuilder(5 * rc);
                                for(int ich = 0; ich < rc; ich++) {
                                    sbChar.Append(((ushort)sb.ToString()[ich]).ToString("x4"));
                                    sbChar.Append(' ');
                                }
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}",
                                    rgScOfVk[vk],
                                    sbChar.ToString(0, sbChar.Length - 1),
                                    vk,
                                    ((KeysEx)vk).ToString(),
                                    ((Keys)vk).ToString());
                            }
                            else if(rc < 0) {
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}\t\t\tDEAD!!!",
                                    rgScOfVk[vk],
                                    ((ushort)sb.ToString()[0]),
                                    vk,
                                    ((KeysEx)vk).ToString(),
                                    ((Keys)vk).ToString());

                                // It's a dead key; let's flush out whats stored in the keyboard state.
                                ToUnicodeEx((uint)KeysEx.VK_SPACE, rgScOfVk[(uint)KeysEx.VK_SPACE], lpKeyState, sb, sb.Capacity, 0, hkl);
                            }
                        }
                    }
                    foreach(IntPtr i in rghkl) {
                        if(hkl == i) {
                            hkl = IntPtr.Zero;
                            break;
                        }
                    }

                    if(hkl != IntPtr.Zero) {
                        UnloadKeyboardLayout(hkl);
                    }
                }
            }
        }
    }

    So for now, since we have two items (the VK and the SC), and the VK is basically always a byte, it is easiest to store in a small array (for our purposes, the scan codes always fit into a byte too, but indexes in .NET are easier to work with when they are int or uint types.

    If we need to store more data than a new class might make sense, but we'll put that off for now as we watch the complexity unfold before us. It is something we will make a decision on later.

    At the moment we can be pleased with the fact that we have saved ourselves a few function calls -- calls that in some situations on some versions of Windows may actually map to kernel calls as that kernel mode component (userk) is the one that has lots of the keyboard information....

    As an aside -- yesterday, I asked the question What the %$#!* is wrong with VkKeyScan[Ex]? and I did not mention the weirdest problem with the function -- it's name! It takes a TCHAR and returns a VK and a SHIFT STATE. But it does not return a keyboard scan code, which of course makes the name sort of misleading.

    As I happen to be in the middle of the muddle of mapping between these things, it becomes even more noticable than usual. :-)

    You may or may not remember how earlier in the series, I promised to explain why the code enumerated Scan Codes rather than Virtual Keys. Well, whether you do or not, I am not quite ready to do that just yet. It'll happen soon, I promise. :-)

    There will also need to be some thought on displaying all the information -- once we move into new shift states, a bit more economy will be needed. That change will come with the shift states, which are next in the part....

     

    This post brought to you by "4" (U+0034, DIGIT FOUR)
    A Unicode character that is in the very small family of those whose VK value is the same as it's code point!

  • Sorting it all Out

    Is it a bug?

    • 19 Comments

    Regular readers you can think of this as a part of the Sorting It All Out mid-term.

    Basically we are looking at two calls to CompareString. The first is:

    CompareStringW(0x0409, 0, L"Hello-Bob", -1, L"Hello Bob", -1)

    which returns CSTR_GREATER_THAN, and the second is:

    CompareStringW(0x0409, 0, L"-", -1, L" ", -1)

    which returns CSTR_LESS_THAN.

    I promise there are no "spoofing" characters or anything else unexpected in the strings, it is literally

    • a comparison of two almost identical strings and
    • a comparison of two substrings that literally represent the only differences between those two almost identical strings

    The question -- is the difference between the two calls a bug? And if so, then which one is incorrect? And if not, then why?

    Answers will be graded for accuracy, or short of that for how convincing the provided expository bullshit is, in an otherwise inaccurate answer....

    (All posts will be moderated unless they do not give away the answers!)

  • Sorting it all Out

    What the %$#!* is wrong with VkKeyScan[Ex]?

    • 8 Comments

    The VkKeyScanW and VkKeyScanExW functions have a simple, documented functionality:

    ...translates a character to the corresponding virtual-key code and shift state...

    Not the sort of thing you would need all the time, but it can come in handy.

    They both take a WCHAR as a parameter and (from the documentation):

    If the function succeeds, the low-order byte of the return value contains the virtual-key code and the high-order byte contains the shift state, which can be a combination of the following flag bits.

    If the function finds no key that translates to the passed character code, both the low-order and high-order bytes contain –1.

    Bit Meaning
    1 Either SHIFT key is pressed.
    2 Either CTRL key is pressed.
    4 Either ALT key is pressed.
    8 The Hankaku key is pressed
    16 Reserved (defined by the keyboard layout driver).
    32 Reserved (defined by the keyboard layout driver).

    Seems simple enough, right? :-)

    Unfortunately, things are never as simple as they seem when it comes to keyboards.

    The other day, friend Gregory called me up to ask me a question about the keyboard he created with MSKLC.

    On his keyboard, he had added the same letter in two different spots on the keyboard (one on the VK_OEM_102 key, and the other on the ALTGR+VK_OEM1 key).

    So he was looking at some code he had that used the VkKeyScan function, and it suddenly occurred to him that he had no idea which VK/SC it would return. After trying it out he found it returned the one he didn't actually want it to (the one on the ALTGR+VK_OEM1 key).

    And now, he figured since he knew the person who did the development work on MSKLC that he could ask what the deal was here.

    After I explained it to him, I thought it might make a nice blog entry!

    I'll start off by saying that VkKeyScan has some limitations, starting with the one Gregory ran into -- when a function's purpose is to return the Virtual Key and the shift state, there is only so much it can do when there are two answers to the question, both equally valid.

    And the decision in the case of a keyboard created by MSKLC is not one that can be controlled within the tool -- it is in essence controlled by the order in which the entries in the LAYOUT table are written (that order is deterministically created by MSKLC as matching the order of the keys in the US keyboard, and then putting the VK_OEM_102 key almost at the end).

    I pointed out to Gregory that he had to pick the one key that would pretty much always lose the battle here. :-)

    Now VkKeyScan has other limitations (documented and undocumented).

    Such as the fact that it will not work with ligatures.

    Or dead keys.

    Or SGCAPS.

    Or anything on the numeric keypad.

    There is also that it does not distinguish between the left and the right SHIFT/CTRL/ALT keys, which rules out all of those more complex shift states like the RIGHT CTRL key on the Canadian Multilingual Standard keyboard layout. Or any other complex shift states you can set up.

    Now that last paragraph talks about features not in MSKLC, but that I'll be covering soon enough in that series I am working on about interrogating keyboard layouts....

    But in any case, you get the point -- VkKeyScan and VkKeyScanEx will definitely never be able to be confused with functions that have all the answers. Since they have been around since Windows 95/NT 3.1, I can think of two possible reasons for this problem. Either

    • much of this additional functionality grew around them and after them, or
    • they were written to handle a specific limited scenario needed by Windows and were only exposed due to someone thinking they might be useful.

    The obvious question that comes up at this point is whether the limitation in MSKLC is one that would be considered a bug to fix.

    Interesting question, and from a triage standpoint an interesting issue to consider. There are a lot of relevant facts here:

    • There is no shortage of unfixable limitations of what would be useful functionality in VkKeyScan and VkKeyScanEx, which makes fixing this issue of limited use (this is sometimes thought of as the "why fix the screen door when the roof has blown off the house" theory of bug triage!);
    • The fact that keyboards with duplicate entries for letters is on the whole relatively uncommon;
    • Exposing the "solution" here is fairly complicated when one tries to consider what the user interface for such a change would be;
    • Addressing the issue would involve some decidely non-trivial changes to the way that keyboard layouts are loaded from and persisted to .KLC files.

    But maybe it would make a nice KB article at some point. Or a blog entry up here, some day....

    If it helps, the sample I am putting together in that other series of posts will not suffer from any of those limitations from either VkKeyScan or VkKeyScanEx! :-)

    Though I am sure those who are following that series will understand why a new version of those functions that addressed the limitations might not be likely. It is pretty difficult to extract that info!

    Which is not to say that a generic function that took and end result of one or more UTF-16 code units and returned the exact set (or sets) of keystrokes that would produce it would not be useful....

    But conceptually where does one draw the line between requiring two or more keystrokes for dead key entry but not allowing the typing of two or more separate characters for combining forms? Perhaps keyboard authors would be willing to make the distinction, but what about the rest of the world?

     

    This post brought to you by "ĩ" (U+0129, a.k.a. LATIN SMALL LETTER I WITH TILDE)

  • Sorting it all Out

    Non-default paths and instructions....

    • 5 Comments

    Leandro Becker wrote to me via the contact link:

    Hi

    When building the CRT 7.1 following your blog instructions, after some minutes compiling I´ve got the following error:

    # *** These are the compiler switches for the XDLL model (MSVCPRTD.LIB):
    #
    # CL = -c -nologo -Zelp8 -W3 -WX -GFy -DWIN32 -GB -Gi- -GS -Zc:wchar_t -Zc:forScope \
    # -DWIN32_LEAN_AND_MEAN -DNOSERVICE -Fdbuild\intel\MSLUP71D.pdb \
    # -D_MBCS -D_MB_MAP_DIRECT -D_CRTBLD -DWINHEAP -D_RTC -D_MT -D_DLL -DCRTDLL2
    #
    # ML = -c -nologo -coff -Cx -Zm -DQUIET -D?QUIET -Di386 -D_WIN32 -DWIN32 \
    # -D_MBCS -D_MB_MAP_DIRECT -D_CRTBLD -DWINHEAP -D_RTC -D_MT -D_DLL -DCRTDLL2

    =-=-=-=-= Doing CRTL Source build (Libraries) =-=-=-=-=
    NMAKE : warning U4004: too many rules for target 'build\intel'
    NMAKE : fatal error U1073: don't know how to make '"C:\Program Files\Microsoft Visual Studio .NET 2003\VC7\PlatformSDK\include\winver.h"'
    Stop.

    ***
    *** BUILD ABORTED -- ErrorLevel is non-zero!
    ***

    C:\Arquivos de programas\Microsoft Visual Studio .NET 2003\Vc7\crt\src>

    Do you know some issue like this ?

    As soon as I saw the email I knew what the problem was (you might see the problem as well!).

    And within five minutes of the first mail I got a second one from Leandro, who also saw what the problem was:

    Sorry, I thing I´ve found the problem. I´ve copied the VCTOOLS from your post that is C:\Program Files path in the error, but my Windows is PT-BR, so the correct folder is C:\Arquivos de Programas.

    Sorry :-(

    No need to be sorry, Leandro!

    What I had been working on (and would have even finished before that second mail if Leandro had been neither as smart or as quick!) was a small paragraph that I added to the 6.0, 7.0, 7.1, and 8.0 versions of the instructions (with the red emphasis included in each!):

    In all instructions below, the assumption is a default install path and an en-US copy of Windows; if either is not the case, make sure you replace paths such as C:\Program Files\Microsoft Visual Studio 2003 with the appropriate install location.

    Because this problem can occur any time there is a change due to:

    • Installation to the default location on a non-English version of Windows
    • Installation on another drive
    • Installation to a different path

    I had initially started changing all paths in the instruction for the first item, when I realized the second and third items made it more important to just let the person who knows where everything is installed make the changes here, as needed.

    Thanks for the heads-up, Leandro!

  • Sorting it all Out

    Enumerating available localized language resources in .NET

    • 16 Comments

    Marc Brooks asked in the Suggestion Box:

    In an ASP.Net 2.0 application, I want to fill a combobox (and my own internal lists) with the list of CultureInfo.

    That's easy.

    But how can I only include the ones whose localized resources (e.g. the correct version of the .Net runtime) has been installed?

    I want my application to automatically "offer" the ones that will have localized Framework UI available.

    It's funny, until I looked into Mark's question, I had no idea that there was not a way to enumerate the language resources in managed code (akin to the unmanaged EnumResourceLanguages, with the managed resource model in mind).

    But after spending a little time it was clear that there was nothing like this in the System.Resources or related namespaces (that I could find, at least). The resource model clearly seems to rely on more of a "let the user choose and fall back p.r.n." mechanism than an "enumerate and choose" mechanism.

    This seems mildly ironic to me given the fact that the CurrentUICulture in client apps relies on the UI language in Windows -- which is explicitly enumerated for the user, who gets to choose a language from that enumeration. :-)

    But I am stubborn, so I kept digging and came up with the following:

    using System;
    using System.IO;
    using System.Reflection;
    using System.Globalization;

    namespace Test
    {
        class ResourceEnum {

            [STAThread]
            static void Main() {
                // Grab a type that we know is in mscorlib
                Type type = Type.GetType("System.Object");
                Assembly assembly = Assembly.GetAssembly(type);
                Console.WriteLine(assembly.CodeBase);

                // Enum through all the languages .NET may be localized into
                foreach(CultureInfo ci in CultureInfo.GetCultures
    (CultureTypes.SpecificCultures | CultureTypes.NeutralCultures)) {
                    try {
                        Assembly satellite =
    assembly.GetSatelliteAssembly(ci);

                        // If we made it this far, we have the resources
                        Console.WriteLine("\t" + ci.Name);
                    }
                    catch(FileNotFoundException) {
                        // Swallow this exception, it means no
    such 
                        // resources exist for the given language

                    }
                }
            }
        }
    }

    Since I already had all of the 1.1 and 2.0 .NET language packs installed, the above code gave me a nice list:

    E:\test>csc resource.cs
    Microsoft (R) Visual C# 2005 Compiler version 8.00.50727.42
    for Microsoft (R) Windows (R) 2005 Framework version 2.0.50727
    Copyright (C) Microsoft Corporation 2001-2005. All rights reserved.


    E:\test>resource.exe
    file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll
            ar
            zh-CHS
            cs
            da
            de
            el
            es
            fi
            fr
            he
            hu
            it
            ja
            ko
            nl
            no
            pl
            sv
            tr
            pt-BR
            pt-PT
            zh-CHT

    Now the code is tied to mscorlib.dll, but it could be generalized into any assembly (by picking a different type, or using GetExecutingAssembly to get the application itself).

    There are probably more clever ways to do some of this, if someone knows what they are they should point them out!

    The above is my first real foray into this area, I hope I did not embarrass myself too much. :-)

    Enjoy!

     

    This post brought to you by "" (U+1302, a.k.a. ETHIOPIC SYLLABLE JI)

  • Sorting it all Out

    Unicode Character Names

    • 4 Comments

    Andrew West has started a very interesting series entitled Unicode Character Names.

    Check out Part 1 (the Good the Bad and the Ugly) and Part 2 (a Name is for Life). And if you are like me you'll be waiting and hoping for additional parts in the future! :-)

    My favorite piece so far is at the end of Part 2:

    Since the merger between Unicode and ISO/IEC 10646 only two characters have ever changed their name, namely U+00C6 and U+00E6, which were originally called LATIN CAPITAL LETTER A E and LATIN SMALL LETTER A E in Unicode 1.0, then changed to LATIN CAPITAL LIGATURE AE and LATIN SMALL LIGATURE AE in Unicode 1.1 after the merger with ISO/IEC 10646, and finally changed to their current names LATIN CAPITAL LETTER AE and LATIN SMALL LETTER AE in Unicode 2.0. The latter change was due to representations by the Danish Standards Association who considered these two characters to be letters rather than ligatures; but this caused so much trouble and acrimony that the respective committees of Unicode and ISO/IEC 10646 resolved never again to make any name changes, regardless of the severity of the mistake or the triviality of the change required (see the Unicode Standard Stability Policy).

    Why is it my favorite?

    (perhaps I should say favourite given the British slant on names?)

    Well, it is an issue that comes up a lot and people simply don't appreciate what a nightmare it would be to deal with a constant flurry of name changing requests. There would be no time to encode any actual characters!

    In any case, I am glad that BabelStone is on the lists of blogs I read, because if it were not I would miss gems like this....

     

  • Sorting it all Out

    I need my SPACE, symbolically speaking

    • 11 Comments

    (No, this is not a post about anyone breaking up with me and telling me that they need their space)

    In Microsoft's implementation of collation, we have several different categories of characters, and rules for dealing with each category.

    One of the interesting categories is the SYMBOL category. All of the miscellaneous odd symbols show up here. And they all come before the various letters and numbers.

    Of course, as a feature most of the symbols do not really have any linguistic meaning that would foster a set of rules for how to sort them. So as I pointed out in Not all characters are created equal: take SYMBOLS, for example, there must be some order within the symbols "block", and the order is usually arbitrary.

    And that gets us back on topic, to U+0020 (a.k.a. SPACE). It is a symbol, too.

    "But wait,Michael!" you may be crying now. "The space actually represents an absence of symbols, or numbers, or letters, or anything. So it should not be a symbol!"

    Well, this gets us kind of existential, which collation usually tries to avoid. It is based on expected behavior. So let's try some thought experiments to see where expected behavior leads us.

    If you were comparing the strings "Microsoft" and "Micro soft", would you expect them to be equal?

    Probably not.

    But if SPACE were given no weight in collation, then they would always be identical. And what is more, the name Ray Mond would show up in the Exchange global address book after Raye. And all kinds of other weirdnesses.

    So, it has to have some weight.

    As perhaps a psychic nod to those who are philosophically against treating SPACE as a symbol, it is the very lightest of the true symbols. And from a behavior standpoint everything works, as long as you do not pass that NORM_IGNORESYMBOLS flag to CompareString and LCMapString.

    This last paragraph may make some people wonder what I meant when I mentioned "true symbols" -- what are the symbols that are not true to us? Am I actually talking about relationships at this point, even though I said I was not?

    I did not change my mind on the subject, I promise. :-)  I am simply talking about a subcategory of symbols that are treated specially which weigh even less than the space -- the punctuation. They are the ones affected by word sort vs. string sort decisions (as I discuss here), and will weigh either less than the regular symbols (in the case of string sort) or less than even the difference between uppercase and lowercase letters (in the case of word sort).

    Let's see some of this in action. If we look at the sort keys for several of these situations, what is happening underneath becomes more obvious:

    Microsoft

    0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

    Micro-soft (word sort)

    0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 80 1B 06 82 00

    Micro-soft (string sort)

    0E 51 0E 32 0E 0A 0E 8A 0E 7C 06 82 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

    Micro soft

    0E 51 0E 32 0E 0A 0E 8A 0E 7C 07 02 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

    Microsoft / Micro-soft / Micro soft (NORM_IGNORESYMBOLS)

    0E 51 0E 32 0E 0A 0E 8A 0E 7C 0E 91 0E 7C 0E 23 0E 99 01 01 12 01 01 00

    If you ignore symbols, they are all the same, otherwise the specific issues with the space, the hyphen, and word/string sort come into play.

    Perhaps SPACE could have been a part of some bold new category that is not a symbol, but things are as they are -- and as it stands this returns intuitive results in most cases....

     

    This post brought to you by " " (U+0020, a.k.a. SPACE)

     

     

     

  • Sorting it all Out

    Getting all you can out of a keyboard layout, Part #3

    • 13 Comments

    Previous posts in this series: Part 0, Part 1, and Part 2.

    Ok, we are making some progress here, and we are at the very least no longer stomping on the user's own keyboard list.

    But we are ignoring dead keys and ligatures. Which, once again, is quite lame.

    The key here is to have a little more respect for the return value of ToUnicodeEx. Right now we do nothing with the resulting string unless the return value is 1. But there are three other possibilities:

    • If it's a dead key, the result will be -1;
    • If it's a ligature (by which I mean the keyboard definition, a string of 2-4 UTF-16 code points);
    • If it fails, the result is 0.

    Clearly the only case where we want to do nothing is when 0 is the return value; in all other cases we want to do something. So let's fix that....

    (As before, the older code is gray, the new code is black)

    using System;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;

    namespace KeyboardLayouts {
        class Class1 {

            //  You'll want to insert that enumeration here!

            internal const uint KLF_NOTELLSHELL  = 0x00000080;

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="MapVirtualKeyExW", ExactSpelling=true)]
            internal static extern uint MapVirtualKeyEx(
                uint uCode,
                uint uMapType,
                IntPtr dwhkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadKeyboardLayoutW", ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayout(string pwszKLID, uint Flags);

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr hkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern int ToUnicodeEx(
                uint wVirtKey,
                uint wScanCode,
                KeysEx[] lpKeyState,
                StringBuilder pwszBuff,
                int cchBuff,
                uint wFlags,
                IntPtr dwhkl);

            [DllImport("user32.dll", ExactSpelling=true)]
            public static extern int GetKeyboardLayoutList(int nBuff, [Out, MarshalAs(UnmanagedType.LPArray)] IntPtr[] lpList);

            [STAThread]
            static void Main(string[] args) {
                int cKeyboards = GetKeyboardLayoutList(0, null);
                IntPtr[] rghkl = new IntPtr[cKeyboards];
                GetKeyboardLayoutList(cKeyboards, rghkl);
                IntPtr hkl = LoadKeyboardLayout(args[0], KLF_NOTELLSHELL);
                if(hkl == IntPtr.Zero) {
                    Console.WriteLine("Sorry, that keyboard does not seem to be valid.");
                }
                else {
                    KeysEx[] lpKeyState = new KeysEx[256];

                    for(uint sc = 0x01; sc <= 0x7f; sc++) {
                        uint vk = MapVirtualKeyEx(sc, 1, hkl);
                        if(vk != 0) {
                            StringBuilder sb = new StringBuilder(10);
                            int rc = ToUnicodeEx(vk, sc, lpKeyState, sb, sb.Capacity, 0, hkl);
                            if(rc > 0) {
                                StringBuilder sbChar = new StringBuilder(5 * rc);
                                for(int ich = 0; ich < rc; ich++) {
                                    sbChar.Append(((ushort)sb.ToString()[ich]).ToString("x4"));
                                    sbChar.Append(' ');
                                }
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}",
                                    sc, 
                                    sbChar.ToString(0, sbChar.Length - 1), 
                                    vk, 
                                    ((KeysEx)vk).ToString(), 
                                    ((Keys)vk).ToString());
                            }
                            else if(rc < 0) {
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}\t\t\tDEAD!!!",
                                    sc, 
                                    ((ushort)sb.ToString()[0]), 
                                    vk, 
                                    ((KeysEx)vk).ToString(), 
                                    ((Keys)vk).ToString());

                                // It's a dead key; let's flush out whats stored in the keyboard state.
                                ToUnicodeEx((uint)KeysEx.VK_SPACE, MapVirtualKeyEx((uint)KeysEx.VK_SPACE, 0, hkl), lpKeyState, sb, sb.Capacity, 0, hkl);
                            }
                        }
                    }

                    foreach(IntPtr i in rghkl) {
                        if(hkl == i) {
                            hkl = IntPtr.Zero;
                            break;
                        }
                    }

                    if(hkl != IntPtr.Zero) {
                        UnloadKeyboardLayout(hkl);
                    }
     
               }
            }
        }
    }

    Now a few different things happened here. First, any time the return of ToUnicodeEx is greater than zero, all of the code points are dumped out.

    Secondly, any time it is less than zero, it is known to be a dead key, which as I point out in this post are always limited to a single UTF-16 code unit. So we grab that one code unit and use it.

    Thirdly, in that dead key case a second call is made to clear out the buffer -- otherwise the next call will be contaminated by the dead key value and will return either a different character entirely or two separate characters. Neither of those situation is too terribly desirable, so the buffer is cleared out.

    (In an upcoming post I will explain why I chose VK_SPACE as the character for clearing out the buffer.)

    It is very important to pay attention to that return value and never look past it when the string is not null terminated (and there is no guarantee that it will be). In fact, let's look at the return values table from the documentation:

    -1 The specified virtual key is a dead-key character (accent or diacritic). This value is returned regardless of the keyboard layout, even if several characters have been typed and are stored in the keyboard state. If possible, even with Unicode keyboard layouts, the function has written a spacing version of the dead-key character to the buffer specified by pwszBuff. For example, the function writes the character SPACING ACUTE (0x00B4), rather than the character NON_SPACING ACUTE (0x0301).
    0 The specified virtual key has no translation for the current state of the keyboard. Nothing was written to the buffer specified by pwszBuff.
    1 One character was written to the buffer specified by pwszBuff.
    2 or more Two or more characters were written to the buffer specified by pwszBuff. The most common cause for this is that a dead-key character (accent or diacritic) stored in the keyboard layout could not be combined with the specified virtual key to form a single character. However, the buffer may contain more characters than the return value specifies. When this happens, any extra characters are invalid and should be ignored.

    Of course I am assuming people never mistype a dead key combination and thus tend to think of that "2 or more" case as being for ligatures -- certainly in the code provided the only case that applies is the ligature one (since it never combines with other characters!).

    Ok, we are making progress now -- dead keys and ligatures. But we are still missing some important pieces like:

    • the base characters that go with the dead keys and the composite characters they create
    • the easy shift states
    • the CAPS LOCK key
    • the harder shift states
    • chained dead keys

    Now note that those last two go well beyond what even MSKLC supports, but that's okay; I am not limited in this sample by the same things that might limit functionality in MSKLC. :-)

    Obviously we'll need something a bit smarter in the way of algorithm for some of these; this will be happening too. Remember that the main point of this sample is to show off some of those lessons that can be gleaned from this stuff....

     

    This post brought to you by "3" (U+0033, DIGIT THREE)
    A Unicode character that is in the very small family of those whose VK value is the same as it's code point!

  • Sorting it all Out

    Getting all you can out of a keyboard layout, Part #2

    • 10 Comments

    Previous posts in this series: Part 0 and Part 1.

    The most immediate and nagging problem for me is the fact that the UnloadKeyboardLayout call will unload the keyboard even if you had it loaded already as one of your many "installed" keyboard layouts.

    That stinks.

    But the problem is that there is really no way to know if a keyboard is already loaded other than using the following algorithm:

    • Load the installed keyboard list;
    • Load the keyboard you want to use;
    • Compare the HKLs in that list the one you are using;
    • If it is not in the list you loaded, then it is okay to unload.

    To get that list, there are two different approaches -- one managed and one unmanaged. The unmanaged one uses the GetKeyboardLayoutList function, and the managed one uses the static InputLanguage.InstalledInputLanguages property.

    Let's take a look at two possible ways to do this (the older code from the post #1 is gray, the new code is black).

    #1: The managed solution:

    using System;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;

    namespace KeyboardLayouts {
        class Class1 {

            //  You'll want to insert that enumeration here!

            internal const uint KLF_NOTELLSHELL  = 0x00000080;

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="MapVirtualKeyExW", ExactSpelling=true)]
            internal static extern uint MapVirtualKeyEx(
                uint uCode,
                uint uMapType,
                IntPtr dwhkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadKeyboardLayoutW", ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayout(string pwszKLID, uint Flags);

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr hkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern int ToUnicodeEx(
                uint wVirtKey,
                uint wScanCode,
                KeysEx[] lpKeyState,
                StringBuilder pwszBuff,
                int cchBuff,
                uint wFlags,
                IntPtr dwhkl);

            [STAThread]
            static void Main(string[] args) {
                InputLanguageCollection rgil = InputLanguage.InstalledInputLanguages;
                IntPtr hkl = LoadKeyboardLayout(args[0], KLF_NOTELLSHELL);
                if(hkl == IntPtr.Zero) {
                    Console.WriteLine("Sorry, that keyboard does not seem to be valid.");
                }
                else {
                    KeysEx[] lpKeyState = new KeysEx[256];

                    for(uint sc = 0x01; sc <= 0x7f; sc++) {
                        uint vk = MapVirtualKeyEx(sc, 1, hkl);
                        if(vk != 0) {
                            StringBuilder sb = new StringBuilder(10);
                            int rc = ToUnicodeEx(vk, sc, lpKeyState, sb, sb.Capacity, 0, hkl);
                            if(rc == 1) {
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}",
                                    sc, ((ushort)sb.ToString()[0]), vk, ((KeysEx)vk).ToString(), ((Keys)vk).ToString());
                            }
                        }
                    }
                    foreach(InputLanguage il in rgil) {
                        if(hkl == il.Handle) {
                            hkl = IntPtr.Zero;
                            break;
                        }
                    }

                    if(hkl != IntPtr.Zero) {
                        UnloadKeyboardLayout(hkl);
                    }
                }
            }
        }
    }

    And #2, the unmanaged one:

     

    using System;
    using System.Text;
    using System.Windows.Forms;
    using System.Runtime.InteropServices;

    namespace KeyboardLayouts {
        class Class1 {

            //  You'll want to insert that enumeration here!

            internal const uint KLF_NOTELLSHELL  = 0x00000080;

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="MapVirtualKeyExW", ExactSpelling=true)]
            internal static extern uint MapVirtualKeyEx(
                uint uCode,
                uint uMapType,
                IntPtr dwhkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadKeyboardLayoutW", ExactSpelling=true)]
            internal static extern IntPtr LoadKeyboardLayout(string pwszKLID, uint Flags);

            [DllImport("user32.dll", ExactSpelling=true)]
            internal static extern bool UnloadKeyboardLayout(IntPtr hkl);

            [DllImport("user32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
            internal static extern int ToUnicodeEx(
                uint wVirtKey,
                uint wScanCode,
                KeysEx[] lpKeyState,
                StringBuilder pwszBuff,
                int cchBuff,
                uint wFlags,
                IntPtr dwhkl);

            [DllImport("user32.dll", ExactSpelling=true)]
            public static extern int GetKeyboardLayoutList(int nBuff, [Out, MarshalAs(UnmanagedType.LPArray)] IntPtr[] lpList);

            [STAThread]
            static void Main(string[] args) {
                int cKeyboards = GetKeyboardLayoutList(0, null);
                IntPtr[] rghkl = new IntPtr[cKeyboards];
                GetKeyboardLayoutList(cKeyboards, rghkl);
                IntPtr hkl = LoadKeyboardLayout(args[0], KLF_NOTELLSHELL);
                if(hkl == IntPtr.Zero) {
                    Console.WriteLine("Sorry, that keyboard does not seem to be valid.");
                }
                else {
                    KeysEx[] lpKeyState = new KeysEx[256];

                    for(uint sc = 0x01; sc <= 0x7f; sc++) {
                        uint vk = MapVirtualKeyEx(sc, 1, hkl);
                        if(vk != 0) {
                            StringBuilder sb = new StringBuilder(10);
                            int rc = ToUnicodeEx(vk, sc, lpKeyState, sb, sb.Capacity, 0, hkl);
                            if(rc == 1) {
                                Console.WriteLine("{0:x2}\t{1:x4}\t{2:x2}\t{3}\t{4}",
                                    sc, ((ushort)sb.ToString()[0]), vk, ((KeysEx)vk).ToString(), ((Keys)vk).ToString());
                            }
                        }
                    }
                    foreach(IntPtr i in rghkl) {
                        if(hkl == i) {
                            hkl = IntPtr.Zero;
                            break;
                        }
                    }

                    if(hkl != IntPtr.Zero) {
                        UnloadKeyboardLayout(hkl);
                    }
                }
            }
        }
    }

    Unfortunately, the managed solution cannot make good use of the InputLanguageCollection.Contains method, which feels a bit more elegant to me -- since that method is expecting an InputLanguage object, not an HKL. It seems like this would be a good overload to consider adding.

    Though if they were thinking along those lines (or if they are reading this post and thinking about features for a future version!) I would not mind having them wrap the LoadKeyboardLayout and UnloadKeyboardLayout calls, too....

    In the meantime, which of the two approaches is better?

    Some people kind of religiously want to avoid p/invoke when they can, and the irony of their belief given the veritable buttload of p/invokes built into the .NET Framework is something they usually miss.

    I didn't look at the source in InputLanguage, but I suspect that it looks pretty similar to that call to GetKeyboardLayoutList anyway. And since I did not want to tie the sample to managed code only, I figured putting both in there as options would make it easier to look at the one you like best. :-)

    The important issue is to do your best to make sure the program does not affect the list of installed keyboards in any discernable way, and both of these methods are pretty much equivalent....

    Tune in to the next post in the series for our ever expanding keyboard interrogator....

     

    This post brought to you by "2" (U+0032, DIGIT TWO)
    A Unicode character that is in the very small family of those whose VK value is the same as it's code point!

  • Page 1 of 5 (66 items) 12345