Blog - Title

January, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Mixing it up with bidirectional text

    • 7 Comments

    So the question that Ziv asked was:

    Hi,

    I’m trying to display both English and Hebrew text in a single WinForms RichTextBox. Basically, the user types a string in one RichTextBox control (in either languages) and I’m appending it to the contents of another RichTextBox control.

    The problem is that “ambivalent” characters (such as “!” and “:”), while they get displayed correctly when the user types them in, are not displayed correctly once appended to the other RichTextBox.

    For example, if the user types the following two strings:

    Hello!
    שלום!‏

    Appending those strings to the existing RichTextBox yields the following display (if RightToLeft is set to “No”):

    Hello!
    שלום!

    And yields the following display (if RightToLeft is set to “Yes”):

    !Hello
    שלום!

    How can I trick the RichTextBox into behaving correctly?

    Thanks, Ziv.

    This is kind of like a problem I have discussed before in posts like this one, with a new twist -- the fact that one does not know what the text might be here -- whether it will be Hebrew or English. If one knows then one can properly use U+200e (LEFT-TO-RIGHT MARK) and U+200f (RIGHT-TO-LEFT MARK) before these potentially visually leading/trailing characters that have a more neutral directionalty.

    If you have no idea whether things are LTR or RTL though, then you don't know what to insert.

    Either way, you probably need to get the data out about the various Bidi categories of all of the characters.

    To do that in the .NET Framework, you currently have to use reflection to get at an internal method that some others have found spluenking through the IL information of the .NET Framework. At one point there was discussion of making it public but that did not end up happening. Though the method works and enough people have puzzled this one using reflection out that I would just post it now and perhaps keep the next 100 people from having to do it. :-)

    Here is a simplified example:

    using System;
    using System.Reflection;
    using System.Globalization;

    class CharUnicodeInfoReflection
    {
        [STAThread]
        static void Main() {
            string st = "Hello!\r\nשלום!";
            Type typeCharUnicodeInfo = Type.GetType("System.Globalization.CharUnicodeInfo");
            BindingFlags bf = BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.Instance | BindingFlags.InvokeMethod;
            MethodInfo getBidiCategory = typeCharUnicodeInfo.GetMethod("GetBidiCategory", bf);

            for(int ich = 0; ich < st.Length; ich++) {
                Object [] parameters = new Object[2] {st, ich};

                Object o = getBidiCategory.Invoke(typeCharUnicodeInfo, bf, null, parameters, CultureInfo.InvariantCulture);

                Console.WriteLine("U+" + ((ushort)st[ich]).ToString("x4") + "    " + o.GetType().ToString() + "    " + o.ToString());
            }
        }
    }

    This code will return the following when run:

    U+0048    System.Globalization.BidiCategory    LeftToRight
    U+0065    System.Globalization.BidiCategory    LeftToRight
    U+006c    System.Globalization.BidiCategory    LeftToRight
    U+006c    System.Globalization.BidiCategory    LeftToRight
    U+006f    System.Globalization.BidiCategory    LeftToRight
    U+0021    System.Globalization.BidiCategory    OtherNeutrals
    U+000d    System.Globalization.BidiCategory    ParagraphSeparator
    U+000a    System.Globalization.BidiCategory    ParagraphSeparator
    U+05e9    System.Globalization.BidiCategory    RightToLeft
    U+05dc    System.Globalization.BidiCategory    RightToLeft
    U+05d5    System.Globalization.BidiCategory    RightToLeft
    U+05dd    System.Globalization.BidiCategory    RightToLeft
    U+0021    System.Globalization.BidiCategory    OtherNeutrals

    So we've learned that we can get the Unicode bidi class of any Unicode character. In fact, we can probably get the type explicitly and use it more directly than this quick example if we wanted to create a wrapper to make it easier to call while hiding the reflection stuff. anyond want to try and take a stab at that? ;-)

    And now we have the key here to solving Ziv's issue -- any time one finds neutral characters at which ever end of the string is going to be stuck on another string, one has to add either an RLM or an LRM matching the last character with some direction we found, before the append. And for good measure we do it on the other end of the string too, so that a neutral on the other end is not misinterpreted.

    Thus in this case (for example), where the string ends with ! (U+0021, a.k.a. EXCLAMATION MARK), we have to walk backwards in the string to the first character that has some direction. We see it is U+05dd and that this character is RightToLeft, so we add a U+200f to the end before we append or prepend another string (and we do something similar if the string we are appending/prepending has neutral characters at its ends, too).

    Should this be built in?

    Well, maybe.

    It is hard to imagine the exact semantic of such a method or what we would call it (or even what object would it go on, exactly).

    In this world where the .NET Framework supports neither parsing nor formatting with LRM and RLM like Win32 does, it just seems a little premature to start adding code that will insert these characters so freely. Know what I mean? :-)

    One special note -- the GetBidiCategory method does not seem to have a method that takes a single char (a developer asked me about this a few weeks ago and wondered if he was missing something; he wasn't); it only has one that takes a string and an index (a signature I have discussed previously), which means if you pass a supplementary character in UTF-16 as a high surrogate and a low surrogate, you will get the bidi category of the supplementary character. This is what you would want for any code, but note that the code above would have to be modified so that any time one has a high and a low surrogate one knows to not get the bidi category of the low surrogate by itself....

    If someone really wanted to take a stab at the generic function that would do all this, I think it meets the compleity level of a difficult interview question and I'd likely be impressed by code that would do the trick.:-)

     

    This post brought to you by ! (U+0021, a.k.a. EXCLAMATION MARK)

  • Sorting it all Out

    Sorting The Old New Thing All Out

    • 5 Comments

    A couple of months ago I got a phone call from someone at Addison-Wesley who wanted to send me some books. They ended up sending me two of them right away (more on those to another day) but the third didn't show up because a week before someone else from Addison-Wesley had called me and asked if they could send me that same book. I believe I pointed this out and I assumed that they each figured someone else was sending a copy.

    The book was Raymond Chen's new book The Old New Thing: Practical Development Throughout the Evolution of Windows.

    I assumed that someone figured I was running some kind of book scam (which I wasn't, though I once got an AE to send me a book not even printed by her publishing company ) but I figured I'd just pick up a copy anyway and I didn't need a free copy to want to buy one....

    Though after I saw Dennis E. Hamilton's comment the other day, I realized that perhaps the book might still make its way to me.

    Then yesterday, I got two copies of the book (one delivered to my apartment and the other to my office). Oops!

    I'll get over it, I'm sure. It is a truly amazing book, for a lot of reasons....

    There is the full chapter on International Programming (it's chapter 16), and the fact that various issues I would cover (or have covered) here like sorting in Shell and time zones are covered in some of the other chapters.

    There is even the specific mention of me and this blog at the beginning of Chapter 16, which was really quite nice and probably won't decrease the intrinsic value of the tome too much. As an interesting side effect of the way the indexing worked I am one of like four or five names in the index, and Sorting It All Out seems to be the only blog (it was just a quick glance, I may have missed something).

    There was even a bit on page 374 ("An anecdote about Improper Case Mapping") that I got to point out a mistake in last July, which was corrected for the book (and it is right there, giving a good example of the issue I mentioned here). It's funny, when the book finally had the final manuscript delivered Raymond told me forgot to get the fix in -- I guess he got in done somehow anyway. :-)

    But what I like most of all is what he did when converting the blog into a great book that did not have me missing the links -- because the topics themselves were in the chapters with the other ones that I would have wanted to go to anyway.

    I actually found myself wanting to go back and forth between the book and the blog at times, due to the different ways that each one comes across, sometimes with the exact same words, just by virtue of the surrounding text. Plus I often find the comments distracting enough that a break from them was quite welcome -- I am starting to see some benefit to that separate comment feed that Scoble liked so much?

    I am wondering about how hard of a job it might be to take over 1000 posts and weave a consistent tapestry, representative of so many of the posts within that blog. I am astounded that any developer would try and amazed at how good of a job he ended up doing. For both the quality and the whole new job to organize what may have originally felt to me like a random drunkard's walk through so many different topics and actually show that the two different approaches each have their own unique value to impart.

    Which is where the title of this post comes from -- Raymond has applied a principle of this very blog and produced a tome that has quite deftly proved to be a great thing for Soring The Old New Thing All Out. :-)

    Raymond's blog can "sell" the book for this organized alternate view of the information, and the book can "sell" the blog for trying to get more of the same. It is pretty cool how it worked out if you ask me. Were I not so sure that I would not live up to the example, I'd be on the phone to my old AE or to Joan Murray of Addison-Wesley tomorrow to tell her I changed my mind about writing another book (I told her this at Tech Ed in 2005).

    This is the kind of book that I could see myself scooting over to try to get signed while not trying to act as silly as I would feel doing it. Maybe if I timed it with a meeting that happened to be going on in the same building? Nah, that won't work. I'll have to keep thinking on the way to do this that would be least embarrassing (or maybe as with Duncan I'll just not try!).

    In any case, whether you pick it up for the knowledge, for the history, for the humor, for the glimpse into the development of the most widely used operating system on the planet, for the glimpse into one of the most clever minds one could ever run across, or because you just have to read what Scott Hanselman's pull quote was, this is a book I would highly recommend.

    I think my extra copy might be a raffle item at this month's PNWADG meeting, just as an FYI to members who are reading here, and as a minor disclaimer I did point out that I got the book for free but actually would have bought it anyway. Odds are I'll end up buying it a few time for various folks I know who would appreciate it (I'll probably buy one if I ever do get it signed, it just seems wrong to me to get a signature on a free book?).

    Anyway, this book is definitely a keeper. The only flaw I can see is that I can't go out and buy "volume two" next week. :-)

     

    This post brought to you by  (U+ff32, a.k.a. FULLWIDTH LATIN CAPITAL LETTER R)

  • Sorting it all Out

    Converting a project to Unicode: Part 9 (The project's postpartum postmortem)

    • 5 Comments

    Previous posts in this series (including today's!):

    (If you are just tuning in and want to start now that we are done, you can grab the latest source from here)

    If you look at the source, you'll see I chickened out of always adding MSLU to Unicode builds, so there is makefile.mslu and a makefile.uni. :-)

    Now that we have gone through and taken an application that is actually useful and converted it to Unicode, I figured for the review it would be good to talk about it a bit.

    (I honestly did not look at the code until after deciding to do the series, so this is a true postmortem decision about the effort!)

    As projects go, this one was fairly tame, and although there were a few issues that were discussed, it was just a few. Tto compare briefly, the kbdtool.exe --> kbdutool.exe conversion I mentioned back in Part 0 made extensive use of the C Runtime for its extensive file handling and parsing and creating operations. So the single example of strtoul being converted to _tcstoul I taked about in Part 4 would have to be multiplied to the 131 such changes that were required. So the fact is that in the real world of app conversion you could find that the actual effort takes more time even if you do not run into any problems more complex than we dealt with here.

    Another interesting comment that was made by Mike Dimmick to Part 3 talked about an issue related to prinft-esque format specifiers, which have outrageous rules in relation to Unicode conversion:

     

    Character Type Output format
    c int or wint_t When used with printf functions, specifies a single-byte character; when used with wprintf functions, specifies a wide character.
    C int or wint_t When used with printf functions, specifies a wide character; when used with wprintf functions, specifies a single-byte character.
    hc, hC int or wint_t Specifies a single-byte character; it is always interpreted as type CHAR, even when the calling application uses the #define UNICODE compile flag.
    hs, hS String Specifies a string; it is always interpreted as type LPSTR, even when the calling application uses the #define UNICODE compile flag.
    lc, lC int or wint_t Specifies a wide character; it is always interpreted as type WCHAR, even when the calling application does not use the #define UNICODE compile flag.
    ls, lS String Specifies a string; it is always interpreted as type LPWSTR, even when the calling application does not use the #define UNICODE compile flag.
    s String When used with printf functions, specifies a single-byte–character string; when used with wprintf functions, specifies a wide-character string. Characters are printed up to the first null character or until the precision value is reached.
    S String When used with printf functions, specifies a wide-character string; when used with wprintf functions, specifies a single-byte–character

    Now I can completely understand why every single one of these format specifiers exist, but you can see why there is a potential for strange results as one moves a project to Unicode, since one is not only dealing with the conversion of the application but in some cases one is dealing with parsing and manipulating data from other sources that may or may not also be converted at the same time.

    In our case, the extensive use of formatting strings in the DebugMsg function was alwaysd used by callers with the %s type, so everything worked out. But if you are converting an application that is using anothing other than %c and %s from the above table, one can have a much harder job to decide how to convert the project.

    Clearly the project was in many ways written in "the right way" to handle the conversion we did -- note especially the mostly consistent use of sizeof() in character buffer lengths, something often missing -- a fact that only came to bite us in a few specific cases that were clearly written later on by other developers.

    Because of such efforts, it is perhaps better to think of the setup bootstrap EXE project as a fair representative of the type of problems one will hit, if not necessarily the magnitude of those problems.

    And what has been "delivered" is an EXE that you may well see in the upcoming release of MSKLC. :-)

    Now I'll keep my eyes open, and if I run across another example like this of a project to convert that can be shared this way I'd love to do it again some time. I think it would be especially interesting to do one that turns out to be much harder in terms of the amount of effort, just to help give a good sense of how hard people might find the process, in general.

     

    This post brought to you by (U+1839, a.k.a. MONGOLIAN LETTER FA)

  • Sorting it all Out

    How can be changed the keyboard layout name label?

    • 2 Comments

    Regular reader from Romania Christian Secara asks:

    How can be changed the keyboard layout names, the ones from the tray icon for example, from a programmatic point of view ?

    After installing LIP or MUI, the keyboard layout names have translated names, so there should be a way to change this. Is there a "public" mechanism (API, or smth), or this is reserved for language packs only ?

    Cristi

    Well, there is no way from a programmatic point of view.

    But if you look in the registry where the list of keyboard layouts lives, down in

    HKLM\SYSTEM\CurrentControlSet\Control\Keyboard Layouts\

    and look at any of the subkeys and values under it, e.g. under XP:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Keyboard Layouts\00000418
        Layout Display Name      @%SystemRoot%\system32\input.dll,-5037
        Layout File                      KBDRO.DLL
        Layout Text                     Romanian

    That "Layout Display Name" entry (which has been around since XP) is a string designed for the SHLWAPI SHLoadIndirectString function.

    The way that it changes when the UI language changes is the way all of the rest of the resources change -- they load the strings from the appropriate localized DLL...

    In versions prior to XP, the Layout Text value is the one that would be used, and its language would not change when the UI language did.

    For more info (including a function to enumerate the keyboards), you can look to the post I did this last May entitled Getting the real (localized) name of the keyboard, which I guess points to the best answer -- the search box for this blog! :-)

     

    This post brought to you by ţ (U+0163, a.k.a. LATIN SMALL LETTER T WITH CEDILLA)
    (because in the words of Buck Murdock, irony can be pretty ironic sometimes!)

  • Sorting it all Out

    Converting a project to Unicode: Part 8 (Fitting MSLU into the mix)

    • 8 Comments

    Previous posts in this series (including today's!):

    (If you are just tuning in and want to start now you can grab the current source from here)

    As I mentioned almost from the start, one of the big downsides to converting setup.exe to Unicode is that Win9x doesn't support Unicode.

    Lucky for us we have the Microsoft Layer for Unicode on Windows 95, 98, and Me Systems, huh?

    Anyway, just as regular reader Dean Harding suggested, a logical step at this point in the series is to add MSLU support to our project. :-)

    Now since MSLU makes no sense in the ANSI build, we will start with the makefile.uni that was added yesterday in Part 7 that you can find in the source code download above:

    # THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF
    # ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO
    # THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A
    # PARTICULAR PURPOSE.
    #
    # Copyright (c) Microsoft Corporation.  All Rights Reserved.
    #
    #
    # Processor independent Unicode makefile,
    # perhaps one day intended for Platform SDK
    #
    # Target only i386

    # FILE : MAKEFILE.UNI

    !include <ntwin32.mak>

    #include support for unicode
    cflags   = $(cflags) -D_UNICODE -DUNICODE

    !include <makefile> 

    Now the key to supporting MSLU is making sure to add unicows.lib to the link list prior to all of the .LIB files that it uses (and incidentally prior to the ones that contain source code that use it ,though we do not have any of those in this case).

    Some casual spluenking through ntwin32.mak indicates that it is the baselibs variable that includes the basic libraries like kernel32.dll. So all we have to do is add unicows.lib prior to the other lib files in baselibs and then MSLU has been integrated!

    To verify this, you can use the following command to look at all of the functions that are imported by our setup.exe:

    link -dump -imports WIN2000_DEBUG\setup.exe

    (this command works a lot like dumpbin.exe does) 

    As I have mentioned before talking about dumpbin, when a function that used to come from a regular OS lib starts coming from unicows.lib, it seems to disapear from the list of imported functions. It almost reminds me of that Douglas Adams story I quoted in When you think it couldn't get any harder, it gets easier, where you are using the fact that certain items are not where you expect them as proof of what is going on.

    If you run that line with the baselibs line in makefile.uni and compare it to the results of when it isn't there, you will see the bulk of the "W" functions disappear. As if they fell out of the hole in the ship caused by the meteorite or something. :-) 

    And there is one more interesting thing you might have to do here that the output indicates. We'll look at it really quickly to see what I am talking about (the interesting ones are marked in RED):

    E:\setup.exe>link -dump -imports WIN2000_DEBUG\setup.exe

    Microsoft (R) COFF/PE Dumper Version 8.00.50727.762
    Copyright (C) Microsoft Corporation.  All rights reserved.


    Dump of file WIN2000_DEBUG\setup.exe

    File Type: EXECUTABLE IMAGE

      Section contains the following imports:

        KERNEL32.dll
                    451014 Import Address Table
                    45F434 Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                      257 LoadResource
                      25C LocalFree
                      1FF GlobalFree
                      1F8 GlobalAlloc
                      380 VerifyVersionInfoW
                      37D VerSetConditionMask
                      142 GetCurrentProcess
                       3A CompareStringA
                      17F GetModuleHandleA
                      17D GetModuleFileNameA
                      1F3 GetWindowsDirectoryA
                      1C1 GetSystemDirectoryA
                      252 LoadLibraryA
                      229 InterlockedExchange
                      265 LockResource
                      15A GetExitCodeProcess
                      171 GetLastError
                       34 CloseHandle
                       EE FlushFileBuffers
                      388 VirtualQuery
                       53 CreateFileA
                      3CC lstrlenA
                      135 GetConsoleOutputCP
                      399 WriteConsoleA
                      337 SetStdHandle
                      1E2 GetTimeZoneInformation
                      133 GetConsoleMode
                      122 GetConsoleCP
                      31B SetFilePointer
                      223 InitializeCriticalSection
                      381 VirtualAlloc
                      328 SetLastError
                       F8 FreeLibrary
                      21A HeapReAlloc
                      244 LCMapStringA
                      2EE SetConsoleCtrlHandler
                      28D OutputDebugStringA
                       78 DebugBreak
                      2A3 QueryPerformanceCounter
                      1DF GetTickCount
                      146 GetCurrentThreadId
                      143 GetCurrentProcessId
                      1CA GetSystemTimeAsFileTime
                      35E TerminateProcess
                      36E UnhandledExceptionFilter
                      34A SetUnhandledExceptionFilter
                      216 HeapFree
                      1E9 GetVersionExA
                      210 HeapAlloc
                      1A3 GetProcessHeap
                      22C InterlockedIncrement
                      228 InterlockedDecrement
                      239 IsDebuggerPresent
                      2D7 RtlUnwind
                       FD GetACP
                      193 GetOEMCP
                      365 TlsGetValue
                      363 TlsAlloc
                      366 TlsSetValue
                      364 TlsFree
                      145 GetCurrentThread
                      220 HeapValidate
                      233 IsBadReadPtr
                      2A7 RaiseException
                       81 DeleteCriticalSection
                       98 EnterCriticalSection
                      251 LeaveCriticalSection
                       C0 FatalAppExitA
                       B9 ExitProcess
                       F6 FreeEnvironmentStringsA
                      155 GetEnvironmentStrings
                      110 GetCommandLineA
                      111 GetCommandLineW
                      324 SetHandleCount
                      1B9 GetStdHandle
                      166 GetFileType
                      1B7 GetStartupInfoA
                      214 HeapDestroy
                      212 HeapCreate
                      383 VirtualFree
                      3A4 WriteFile
                      1E0 GetTimeFormatA
                      147 GetDateFormatA
                      1BA GetStringTypeA
                      174 GetLocaleInfoA
                      241 IsValidLocale
                       AF EnumSystemLocalesA
                      1E3 GetUserDefaultLCID
                      313 SetEnvironmentVariableA

        ADVAPI32.dll
                    451000 Import Address Table
                    45F420 Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                       1D AllocateAndInitializeSid
                       E2 FreeSid


       USER32.dll
                    45117C Import Address Table
                    45F59C Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                       E1 ExitWindowsEx
                      24D SetCursor
                      111 GetDlgItem
                      15D GetSystemMetrics
                      1ED MsgWaitForMultipleObjects
                       99 DestroyWindow
                      1EC MoveWindow
                      256 SetFocus
                      292 ShowWindow
                      257 SetForegroundWindow
                      2AA TranslateMessage
                      174 GetWindowRect

        COMCTL32.dll
                    45100C Import Address Table
                    45F42C Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                       5D InitCommonControlsEx

        urlmon.dll
                    4511BC Import Address Table
                    45F5DC Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                       49 URLDownloadToCacheFileW

        WININET.dll
                    4511B0 Import Address Table
                    45F5D0 Import Name Table
                         0 time date stamp
                         0 Index of first forwarder reference

                       65 InternetCanonicalizeUrlW
                        D DeleteUrlCacheEntryW

      Summary

            4000 .data
            F000 .rdata
            3000 .rsrc
           50000 .text

    Those functions marked in Red are the ones that may either not exist or may only be present as stubs on Win9x (according to either Platform SDK docs or unclear issues in Platform SDK docs e.g. the one I first mentioned in MSLU doesn't support wininet.dll). So in theory there my be a little more work here whether that includes perhaps separately wrapping these functions in a delayload kind of wrapper or sending some email to the Platform SDK folks or just deciding not to go down the MSLU road in the case of a new version of MSKLC that doesn't run on Win9x anyway? :-)

    The small number of functions in this case will require some specific investigation:

    • VerifyVersionInfo (according to the VS2005 MSDN docs, only supported on Win2000 and later, no Win9x support)
    • InternetCanonicalizeUrl (according to the VS2005 MSDN docs, claims to support Unicode only through MSLU, but MSLU does not support this function)
    • DeleteUrlCacheEntry (according to the VS2005 MSDN docs, claims to support Unicode only through MSLU, but MSLU does not support this function)
    • URLDownloadToCacheFile (according to the VS2005 MSDN docs, claims to support Unicode only through MSLU, but MSLU does not support this function)

    In any case, although this step is one you should always do when adding MSLU to a project you are converting to Unicode, it is a step that you may not really need to do otherwise....

    Plus that first function, which becomes VerifyVersionInfoA on ANSI builds, would indicate that setup.exe won't run on Win9x anyway? More research is definitely needed, I think!

    On a slightly unrelated note, a question:

    If you have to support MFC or some other library it will change what you do with the baselibs a bit. Anyone care to guess what would have to change? :-)

     

    This post brought to you by  (U+a0a8, a.k.a. YI SYLLABLE HMUR)

  • Sorting it all Out

    Whither intl.inf in Vista?

    • 2 Comments

    Regular reader Ivan Petrov asked in the Suggestion Box:

    Hi Michael,

    I've two questions for you:

    1) What happened with the 'intl.inf' file in Windows Vista?

    and

    2) Where has disappeared most of its content?

    Regards,

    Ivan.

    Now Ivan has been asking questions about various features like code pages and how to modify them or which locales point to them or which ones are installed or keyboard choices and so forth (I can recall four off the top of my head and there are probably lots more thst one could find, too!).

    Now intl.cpl over the course of Windows NT 3.1 to Windows NT 3.5 to Windows NT 4.0 to Windows 2000 to Windows XP to Windows Server 2003 had really done quite a number on this huge file that was actually built for each language SKU by combining a language neutral intl.inx with a bunch of localized intl.txt files (one per SKU), and let me tell you that file was a nightmare to maintain due to its fragile nature, its hugely complicated/hard to track dependencies, and its fragility (it was a running joke that a lead PM we had would be guaranteed to break the build every time a change was made back during the XP/Server 2003 timeframe).

    And of course there were huge mutual dependencies with layout.inf, font.inf, hivedef.inf, and others that made it even harder to track the nature of changes - for fonts, for locale information, and so in. Some of this data was even duplicated across different INF files or between INF files and data elsewhere in the system.

    It is true other INFs were also quite complicated in Windows, though ours was complicated in part because of all the efforts other teams put in to adding stuff to it any time their functionality might hinge off our configuration settings (from Index Server to fonts and so on and so on). So the breaks were also often quite hard to track down and sometimes even too subtle to notice.

    We were also tied to intl.inf for post-setup operations as the file acted as a data store for install/uninstall decisions one would make later in intl.cpl or input.dll.

    It was a nightmare.

    In Vista with a whole new setup mechanism and an effort to make teams own their own components and the configurations thereof, we worked to get a lot of those more complex dependencies removed. Between that and the effort to install everything by default and not give uninstall options, most of the content went away, and the few things that were still needed needed found new homes (like LOCALE_SKEYBOARDSTOINSTALL data, described here).

    Now I won't claim that the new setup does not have complications of its own (how can anyone ship a product of this size that is not complicated?) though from our point of view there are many fewer complications even as we added a ton of locales and keyboards and fonts.

    For more specifics, you'd probably have to specify what data was being looked for, precisely? :-)

     

    This post brought to you by Ӿ (U+04fe, a.k.a. CYRILLIC CAPITAL LETTER HA WITH STROKE)

  • Sorting it all Out

    Converting a Project to Unicode: Part 7 (What does it mean to fit things to a 'T', anyway?)

    • 4 Comments

    Previous posts in this series (including today's!):

    (If you are just tuning in and want to start now you can grab the current source from here -- no changes since it was posted the day before yesterday)

    Like I said yesterday, if you have read Parts 2-5 then you know how we went from a purely ANSI application to a purely Unicode one.

    The binary itself has been tested with the MSKLC update and it resolves the bug I talked about back in Part 0. And the Unicode Bootstrap EXE works for the scenarios in which it will be used.

    Now for a moment I wanted to talk about the myth of applications compiled as both Unicode and ANSI. We say TCHAR but the truth is that most of the time the dev has just one in mind. For me it is Unicode (which leads to problems like the one Mihai pointed out here) and to be honest most developers think of it as ANSI, even when they talk about Unicode, which is why you get problems like those in the DrawThemeText function. Ignore the weird text for a moment:

    DrawThemeText uses parameters similar to the Microsoft Win32 DrawText function, but with a few differences. One of the most notable is support for wide-character strings. Therefore, non-wide strings must be converted to wide strings, as in the following example.

    You know, text handled by people who were not aware that DrawText has a Unicode version. And just look at the code sample:

    INT cchText = GetWindowTextLength(_hwnd);
    if (cchText > 0)
    {
      TCHAR *pszText = new TCHAR[cchText+1];
      if (pszText)
      {
        if (GetWindowText(_hwnd, pszText, cchText+1))
        {
          int widelen = MultiByteToWideChar(CP_ACP, 0, pszText, cchText+1, NULL, 0);
          WCHAR *pszWideText = new WCHAR[widelen+1];
          MultiByteToWideChar(CP_ACP, 0, pszText, cchText, pszWideText, widelen);

          SetBkMode(hdcPaint, TRANSPARENT);
          DrawThemeText(_hTheme,
                        hdcPaint,
                        BP_PUSHBUTTON,
                        _iStateId,
                        pszWideText,
                        cchText,
                        DT_CENTER | DT_VCENTER | DT_SINGLELINE,
                        NULL,
                        &rcContent);

           delete [] pszWideText;
        }

        delete [] pszText;
      }
    }

    This is code that won't even compile if you try to compile it as UNICODE!

    Clearly, there are times where even the people who are moving forward and only providing Unicode versions to their functions are not necessarily thinking of a TCHAR as a type that could be either a CHAR or a WCHAR.

    And I am not casting stones here or anything (after all, I made the same kind of mistake in the other direction -- one I may never have noticed since I was only ever going to probably compile and run the code with UNICODE/_UNICODE (just as I suppose people are anticipating those samples will be written by people who don't).

    It makes the whole "T" thing really a myth most of the time, you know? :-)

    So I think we should go ahead and make sure it will compile both ways, and do the work in the makefile to make sure it happens. let's break the myth, at least for this particular sample at this particular moment....

    One way that some Platform SDK samples do this (like the StrOut sample, for example) is in addition to the makefile, having a makefile.uni that looks something like this (this is the StrOut one):

    #*************************************************************#
    #**                                                         **#
    #**                 Microsoft RPC Samples                   **#
    #**                   strout Application                    **#
    #**         Copyright(c) Microsoft Corp. 1992-1996          **#
    #**                                                         **#
    #** This is the makefile used when compiling for UNICODE.   **#
    #** It sets the flags it needs, and then call the regular   **#
    #** makefile.                                               **#
    #** To compile for ANSI type nmake at the command line      **#
    #*************************************************************#
    # FILE : MAKEFILE.UNI

    !include <ntwin32.mak>

    #include support for unicode
    cflags = $(cflags) -D_UNICODE -DUNICODE
    midlflags = -D _UNICODE

    #include library for CommandLineToArgvW function
    conlibsdll = $(conlibsdll) shell32.lib

    !include <makefile>

    Well, no COM and we don't use CommandLineToArgvW, so we don't need exactly this. But it gives one example of how samples are doing this. We'll just go with it. :-)

    The cynical side of me believes that if this does end up in the Platform SDK that this will work up until the next time it is updated for some other particular feature, since the whole "dual compiling system" doesn't exactly fit us to a 'T'.....

    (other techniques here might include different config settings in the same makefile or environment variable dependencies, but I am aiming for the Platform SDK so doing it the way they seem to my work in my favor!)

    But in any case, the next source code drop will include an updated makefile and a new makefile.uni and instructions about using them.

     

    This post brought you to  (U+0f45, a.k.a. TIBETAN LETTER CA)

  • Sorting it all Out

    UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages

    • 6 Comments

    Michael Entin asks in the Suggestion Box:

    Hi Michael.

    I want to revisit UTF-8 discussion.

    In several posts you wrote that it is impossible to support UTF-8 as NT code page, since there is a lot of legacy code that assumes maximum of 2 bytes per char. So it is impossible to fix all this code to support UTF-8.

    I don't quite understand how then does Windows support GB 18030 encoding? It appears it is a very similar encoding, where a character can be encoded by up to 4 bytes.

    What are the differences between these two encodings? How come Windows can support one, but not the other?

    I believe he is referring to this post and/or this post and/or the comments in this one....

    And it is still true that UTF-8 (code page 65001) cannot be an ACP ("ANSI" code page") for a locale.

    But from a technical standpoint, neither can GB-18030 (code page 54936) -- for pretty much the same reason.

    The GB-18030 question is a bit more interesting since I am pretty sure there was an official request that we change the default system code page of the zh-CN locale to GB-18030, but unfortunately the answer was the same.

    These code pages are present for people to convert things out of and to convert things into that a user might run across; they are not for the legacy ("ANSI") support in the Win32 API which, since The Unicode train is leaving the station, are not being added to or updated. So they work great in MultiByteToWideChar and WideCharToMultiByte, but the core OS is not going to updated to work internally off of either one.

    Now the job would not be entirely impossible, though I suspect fairly improbable (and I say this as someone who has written a Unicode Layer for Win9x Systems (and who was asked once by another company to write a UTF-8 Encoding Layer for NT (or UELNT, I guess?), this would require a serious and non-trivial development effort, whether one is inside or outside of Microsoft. There simply isn't a specific reason or benefit to doing it that would outweigh the cost).

    Now if I ever retired, that UELNT project might be something interesting to take a shot at if someone really wanted to fund it. But I would probably have to run out of other stuff to do first, and that doesn't seem likely to happen any time soon. :-)

     

    This post brought to you by  (U+0edc, a.k.a. LAO KO LA)

  • Sorting it all Out

    Converting a Project to Unicode: Part 6 (Upon the road not traveled)

    • 10 Comments

    Previous posts in this series (including today's!):

    (If you are just tuning in and want to start now you can grab the current source from here -- no changes since it was posted yesterday)

    Now if you have read Parts 2, 3, 4, and 5 then you know how we went from a purely ANSI application to a purely Unicode one.

    The binary itself has been tested with the MSKLC update and it resolves the bug I talked about back in Part 0. And the Unicode Bootstrap EXE works for the scenarios in which it will be used.

    (which, as the lessons of Part 5 hopefully taught everyone, means that there could still be bugs in the other scenarios like internet downloads -- these will have to be tested by somebody at some point!).

    But perhaps people who have done this type of thing before felt uncomfortable with the route I took -- all of those global changes in parts 2 and 3 might seem quite different from the way of just compiling with UNICODE and _UNICODE and fixing errors as they come. Certainly the experience I set people up for earlier in Part 1 was a lot uglier than what actually happened. So why would I have done it that way, and what is the experience like if it is not done that way?

    Well, you start with a lot more errors, obviously. And due to the many dependencies in the files (like the header files, and all the functions in util.cpp used throughout the code), you can easily find yourself revisiting the same files over and over again as you compile all and continually break files that you just fixed.

    As to why I prepared for much more dire experiences, the Bootstrap EXE sample project was as pretty tame one, with a reasonably small number of changes to make beyond datatypes. Some cases are not quite as clean as that and can have many more -- some project you my apply the same plan to could be a lot more brutal in terms of number of errors...

    I really prefer not to take the harder route though, since you can easily miss cases -- for example think of all the times that you have to catch sizeof(char) or sizeof(CHAR) and change it to sizeof(TCHAR). All you have to do is miss one and you'll hit bugs like the one in Part 5 caused by your Unicode migration rather than by pre-existing bugs. Because bugs like that are not found at compile time, so you have to pay the price later in terms of bugs or problems you catch in unit testing. And in the rush to make changes, compile, make more changes, compile again, and so on, it is easier to miss things.

    Like is just a lot easier if the global changes can be made upfront so you can focus on the special cases....

    Of course you are welcome to try it if you like -- just do Part 4 after skipping parts 2 and 3....

    Tomorrow, Part 7 will be going up to do more than just jabber about stuff like this post did!

     

    This post brought to you by ܛ (U+071b, a.k.a. SYRIAC LETTER TETH)

  • Sorting it all Out

    Right behavior, wrong scenario

    • 4 Comments

    So the other day when I posted By design? Well, not beautifully so.... it was nice to see that Aaron Stebner posted on some of the underlying setup issues behind it in his post Why .NET Framework 2.0 language packs will not install correctly on Windows Vista, as well as how to work around the issue if you'd really rather just install the little buggers and not have the .NET Framework lie and claim that they are there when they are not.

    Though for what its worth I disagree with both Arron's and Marc's (original) characterization that the code I posted in Enumerating available localized language resources in .NET is flawed or broken. I am actually quite proud of the fact that my code honestly returns the information on the machine irregardless of the weird licensing rules of Windows and the fibs that .NET Framework language packs tell!

    Anyway, this post is not about any of that, really.

    It is actually about an issue that Marc keeps bringing up, whether it is here or here or his comments on the feedback site here, basically an unhappiness in the way that having the download pages for the various language packs localized into.

    Now the scenario that makes this reasonable is the one I described in If you can't read it, don't switch to it!. And as far as that goes, it is a reasonable scenario.

    Anyone who does not see the logic of the scenario is probably in denial.

    However, there is another scenario here, one that impacts both managed and unmanaged applications, and which actually is the real limitation that Marc is hitting.

    The scenario relates to the times that you need to install all languages on a server (like a web server or a terminal server).

    In my mind I think of it as "The Shawn Scenario™" since my colleague Shawn Steele has consistently been the most likely person on the NLS dev team to mention the "what if you were on a web server and..." kind of scenario in a dev meeting. :-)

    Now in that case, the person who would reasonably expected to be doing the installation of UI language resources is not likely to know all of the languages in question, and thus requiring the person installing all of the languages to know all of them is unrealistic.

    Beyond the language issue, is it really reasonable to have them all be separate installs in this case? That is just time consuming and annoying, right? So it seems like it is not just the language issues that are annoying here -- it is the whole way that the langpacks exclusively exist that kind of conflicts with a reasonable scenario that is a problem here!

    Since the scenario of a web server having all of the .NET Framework's UI languages available for ASP.NET is just as reasonable as a terminal server that people all over the world log into having all of the MUI languages installed, it is clear that in addition to the need for installs whose UI language is the same as the language being installed, there needs to be coverage of this other scenario where all of the languages can be installed at once.

    Windows has this covered for MUI (and if you have all of the MUI langpacks in a single flat location you can install all of them at once), but currently the .NET Framework does not. In the long run, I think it would be a good scenario to cover.

    Perhaps they could even fix the other problem where it claims the languages are installed -- I have some code that will them the real install story if they need it! :-)

     

    This post brought to you by װ (U+05f0, a.k.a. HEBREW LIGATURE YIDDISH DOUBLE VAV)

  • Sorting it all Out

    Converting a Project to Unicode: Part 5 (Are we there yet? Well, not *just* yet)

    • 6 Comments

    Previous posts in this series (including today's!):

    (If you are just tuning in and want to start now you can grab the current source from here.)

    I am delaying the "road not traveled" post until tomorrow. Hang in there, it is coming!

    Now we have a project that compiles and links and produces a setup.exe. Does that mean we're done?

    Well, no. Because the project hasn't been tested yet. Like the First Tester's Axiom states, if you have not tested it, assume it is broken.

    Of course testing a bootstrap EXE is a bit tougher than the average project; running the current project gives you a nice error:

    You can look at the readme.htm file for information on how to plug in the various special properties that this (and most) SETUP.EXE files look for, settable via msistuff.exe. You could probably even puzzle out how msituff.exe works if you looked closely at the source of setup.exe with an eye to understanding the functionality (as opposed to trying to convert a project to Unicode!).

    So the fact that we see this dialog means at the very least that some of this code works!

    Ah, but when you click that OK button, what happens next suggests a bug is there:

    Looks like something crashed. Since this does not happen with the non-Unicode version, common sense forces us to assume it is our bug. Let's take a look....

    The call stack of the crash:

    setup.exe!operator delete(void * pUserData=0xfdfdfdfd)  Line 52 + 0x3 bytes C++
    setup.exe!wWinMain(HINSTANCE__ * hInst=0x00400000, HINSTANCE__ * hPrevInst=0x00000000, wchar_t * lpszCmdLine=0x0002069c, int nCmdShow=0x00000001)  Line 927 + 0x18 bytes C++
    setup.exe!__tmainCRTStartup()  Line 324 + 0x35 bytes C
    setup.exe!wWinMainCRTStartup()  Line 196 C
    kernel32.dll!7c816fd7()  

    Obviously the problem would be in our wWinMain, not in operator delete. Let's take a look at the source code right around the crash line:

    if (szInstallPath)
        delete [] szInstallPath;

    Hmmm.... the cleanup code is crashing try to delete a string that is allocated on line 539 of setup.cpp:

    // canocialize the URL path
    cchInstallPath = cchTempPath*2;
    szInstallPath = new TCHAR[cchInstallPath];

    Yet the actual error happened when "DATABASE" (actually ISETUPPROPNAME_DATABASE) was not found in setup.exe, hundreds of lines earlier (lines that would be run in RED):

        // Determine if this is a patch or a normal install.
        if (ERROR_OUTOFMEMORY == (uiRet = SetupLoadResourceString(hInst, ISETUPPROPNAME_DATABASE, &szMsiFile, dwMsiFileSize)))
        {
            ReportErrorOutOfMemory(hInst, DownloadUI.GetCurrentWindow(), szAppTitle);
            goto CleanUp;
        }
        else if (ERROR_SUCCESS != uiRet)
        {
            // look for patch
            if (ERROR_OUTOFMEMORY == (uiRet = SetupLoadResourceString(hInst, ISETUPPROPNAME_PATCH, &szMsiFile, dwMsiFileSize)))
            {
                ReportErrorOutOfMemory(hInst, DownloadUI.GetCurrentWindow(), szAppTitle);
                goto CleanUp;
            }
            else if (ERROR_SUCCESS != uiRet)
            {
                PostResourceNotFoundError(hInst, DownloadUI.GetCurrentWindow(), szAppTitle, ISETUPPROPNAME_DATABASE);
                goto CleanUp;
            }

            fPatch = true;
        } 

    In other words, the code to allocate that buffer was never run!

    We had better take a closer look at the definition of that variable, shouldn't we? :-)

    The intent is straightforward enough:

    TCHAR *szInstallPath      = 0;

    (though I probably would set pointers to NULL rather than 0, just as a matter of personal preference). In any case, clearly something is not working -- something is overwriting stack here, as all of these variables are on the stack.

    Looking at the actual value at the time of the crash, it is 0xfdfdfdfd, which seems a bit too suspicious to be an actual pointer (especially since all of the surrounding string variables have the same value!). Looking at Funny Memory Values, this is:

    Microsoft Visual C++ compiled code with memory leak detection turned on. Usually, DEBUG_NEW was defined. Memory with this tag signifies memory that is in "no-mans-land." These are bytes just before and just after an allocated block. They are used to detect array-out-of-bounds errors. This is great for detecting off-by-one errors.  

    So it does look like perhaps memory leak detection is writing a bit past one place or another. Let's again assume it is us and just debug.

    Stepping through the code and looking at all the surrounding variables that are all on the stack (but whose memory they point to if allocated will be on the heap), szOperation has a bunch of memory set to 0xcdcdcdcd since there is no operation, and this is then deleted and set to 0xdddddddd and so on.before it is finally set to the stringt "DEFAULT." We then go through the same dance for szProductName which is eventually set to "the product" and szTitle which is set to "Please wait while '%s' is downloaded...".

    And here is where we find the problem. It is in this line of code:

        StringCchPrintf(szBanner, sizeof(szBanner), szText, szProductName);

    In most of the project they are properly using expressions like sizeof(szBanner)/sizeof(szBanner[0]) or at least sizeof(szBanner)/sizeof(TCHAR) for these situations, but in this case the code wasn't. And yes, we have a cb vs. cch bug. Looking in the project there are 11 other occurrences with the StringCchPrintf function, so let's fix all of the ones that need it (about seven of them are using sizeof() incorrectly this way).

    Once we do this, the crash goes away.

    Root Cause Analysis of the bug: The introduction of the StringCchPrintf and related functions was after that of the bootstrap exe, so it is likely fair to say that this is a bug in the bootstrap sample caused by dev error in the port to the safe string functions and exposed by our little conversion project here. :-)

    Looking at the whole project, there are 38 occurrences of safe string handling functions starting with StringCch* and none starting with StringCb*, though all of the others are StringCchCopy and StringCchCat calls, and none have similar errors. It is worth looking at these cases, under the "where there is one problem there may well be others" theory that works well in so many situations.

    As this problem (which I knew about) and also the problem that regular reader Mihai pointed out the other day (which I didn't) indicate, it is obviously important to test prior to being sure that everything is correct....

    Plus we'll talk about a few other things in upcoming posts.

    Stay tuned for more on the alternate way parts 23, and 4 could have been done, tomorrow!

     

    This post brought to you by  (U+178a, a.k.a. KHMER LETTER DA)

  • Sorting it all Out

    Report of an IME that splits and separates more Hangul by 9 am than most IMEs do all day

    • 9 Comments

    (apologies for the title; I fought it for days, but it just kind of had its way with me and refused to back down!) 

    The question that came in the other day:

    ok, i read about your theory thing about Hangul, and i was actually having a trouble with installing Korean on my keyboard... i did install Korean, but however, it comes out to be the second type of korean that you talked about where all the letters come out seperately, not connected... also the place of the letters are different from my previous computer... i need to talk to my family in Korea, but i can't type anything correctly in Korean... please help me. i want 한국어, not 한국어, you know? please help me. thank you

    I am pretty sure this was a reference to that post from this past July (We're off on the road to Korea! We certainly do get around...) or maybe the original one it linked to (Traditional versus modern sorts).

    But as far as I know, there is no IME that Microsoft ships in Windows that ever produces separated Jamo (conjoining or otherwise) on anything but incomplete sequences (and even then it is only the non-conjoining ones); they all produce Hangul. Which suggests three possible answers here:

    • A separate keyboard using MSKLC was produced, in the hopes that it would build up Hangul syllables like an IME might do (and like the Korean IME does);
    • An IME coming from someone other than Microsoft was installed, and it is this IME that is being referred to;
    • Some third possibility that did not occur to me at the time but which a follow up comment (from the original person asking or a reader of this blog) might make clear.

    For now, the solution in any case would be to switch to the built in IME and let it construct Hangul syllables as needed. :-)

    On an only somewhat related note, I have gotten lots of positive feedback from native speakers of Korean both inside and outside of Korea who have expressed frustration at the position of standards organizations within Korea and the negative impact it has had within Unicode, ISO's WG2/10646, and within country. The distance between the people working on these standards and the actual users of the language grows every day.

    I suppose you could say I serve Microsoft (those less charitable would call me a shill for Microsoft), though since I have been strongly encouraged to keep the customer focus for the entire time I have been here, it is probably more accurate to say I serve customers, or at least try to. Given that, my frustration about the situation grows every day as well....

    I also have to wonder whether Apple gets into any trouble given their preference for Normalization Form D, given Korea's antagonism toward it as it applies to Hangul/Jamo? Wouldn't support on the Mac actually force this whole issue once and for all?

    Microsoft, as a mostly "Normalization Form C" shop that is kind of late to the Unicode Normalization game in terms of support, has less vested of an interest in the C vs. D debate here than Apple does, other than the whole customer thing I mentioned!

    I don't have a Mac or I'd probably be testing this all out more just because it is an interesting area to me (to be honest, at this point I don't know where I would put a Mac if I had one, though I suppose if I won one in a contest or something I'd probably go buy a bigger table!). But are there any Mac users out there who know about the story with Korean on a Mac and how it is supported?

     

    This post brought to you by  (U+1112, a.k.a. HANGUL CHOSEONG HIEUH)

Page 4 of 4 (57 items) 1234