Blog - Title

April, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Rhymes with Amharic (a.k.a. How about a little breakfast embed, dear?)

    • 19 Comments

    I have a lot of ideas for blog posts that are on my generic "to do" list.

    In fact, any time someone suggests a potential topic these days, I already had the topic on my list of things to cover some day....

    I was looking at my blog summary page a moment ago and I realized that this is going to be blog post #1708.

    Apropos of nothing, you might be thinking. But I'll explain why this was interesting to me.

    You see, I tend to think that there are a few core posts that I do which have a lot more to do with real influencing/assistance, like The jury will give this string no weight or the Converting a project to Unicode series or the Private fonts: for members only post or the Getting all you can out of a keyboard layout series.

    I have been building this one up in my mind for a while now -- in fact, since I first talked about it last year in Font embedding -- the intro: a sample that really shows how font embedding can work. I hadn't gotten to it yet, but it was on the list.

    Then a few days ago Scott Hanselman asked me:

    I’ve seen the Custom Culture stuff, but I’m wondering if anyone’s done a sample (and with what font) showing Amharic on Vista? I’d like to post about it and enable some Ethiopians.

    I had to remind him that we actually added Amharic as a locale to Vista (as I sometimes have to do with Scott!), and it did suggest to him something that really might be important:

    Hm…I’ll try making a WinForms app in Amharic…I’ll let you know. Since Vista [h]as am-ET I guess we don’t need it…although, it’d be nice to talk about how to write a WinForms app that is ONE SOURCE, TWO OS’s. Meaning, it would know what to do on XP vs. Vista. Can we copy the font over to XP?

    That “straddling” sample would be VERY valuable for those languages that were added in Vista.

    Now copying the font file for Nyala is indeed a violation of the EULA, even to another Windows box. But it suddenly occurred to me that this might be the perfect time to provide a font embedding sample!

    After a bunch of work between other work and meetings and email and such (and by the way special thanks to Sergey Malkin and David Brown for their assistance here!).

    Warning: do not violate the license for any font file from Microsoft or any other source. You can use the licensing information in the Font Properties Extension to find out if you are allowed to do it!

    First a few gratuitous screen shots of the sample, on Vista (with higher DPI settings):

    and on Server 2003 (which does not have the font or the locale or anything, and with ClearType turned off):

    and on XP SP2 (again without the font or the locale or anything, and with ClearType turned on):

    Notice how the bottom TextBox control does not show the text on the platforms that do not have the font, while all of them can display the text in that one on top.

    And in fact if you used a custom culture to add am-ET, also known as Amharic (Ethiopia) or አማርኛ (ኢትዮጵያ), one can get even more of the support running on both platforms, just as Scott was hoping for!

    Ok, enough with the build-up, let's jump in....

    You can download the project from here. It basically relies on a few of the font embedding API functions:

    • TTEmbedFont -- given a device context containing a specific font that is legal to embed, creates the compressed binary file that can be embedded;
    • TTLoadEmbeddedFont -- given that compressed file, uncompresses it and turns it into a font that can be used within the process;
    • TTDeleteEmbeddedFont -- removes the embedded font's information when you are done with it.

    The sample was a bit more involved as it had to make use of the PrivateFontCollection class to load the font within WinForms, because the load is only valid within the process but GDI+ does not load any font that is not known to it at the time it has started up. Luckily, by using a technique similar to the one I used in Private fonts: for members only, you can load up the font that is ready to go in GDI/Uniscribe and cause it to be available to your managed application controls as well!

    The logic is:

    • On >= Vista, if the embedded font file has not there, it is created.
    • On all platforms, it tries to use the font, loading it up into a private name so it won't have trouble loading on platforms that contain the font.

    NOTE: The sample download does NOT include the binary file containing the embedded font file. To get that file you have to run the sample on Vista and it will create a ~150kb file named "NyalaSIAO.bin" in the same directory as the EXE. From there you can put the EXE and the .BIN file on the downlevel machine and display Amharic in your application to your heart's content, provided you are just using it in your application.

    In the real world you probably would not set up your application the way I did the sample -- you would probably embed the font as a resource like that font about private fonts did, and you'd likely only create it during development, not on the user's machine later. But it should be enough to get you started....

    I will talk more about the code soon (and the embedding support and what happens with it) in an upcoming post. :-)

    And I'll probably do an unmanaged sample too, at some point. Because I knew even 15 years ago that when someone at Microsoft talks about how easy something is to use, if they provide no samples for it, even after years pass, that we might well be full of crap and that it is hard.

     

    This post brought to you by (U+12a2, a.k.a. ETHIOPIC SYLLABLE GLOTTAL I)

  • Sorting it all Out

    Rhymes with Amharic #4 (a.k.a. we're all [sub]set so turning out the lights and going to [em]bed!)

    • 14 Comments

    (see also parts 1, 2, and 3

    OK, we are getting close to the end of this little mini-series....

    First there was a comment from Dennis E. Hamilton asking about the DPI in the screen shots of that first post:

    I notice two things here.  First, the impact of Cleartype is amazing.  Secondly, the Vista rendering seems fuzzy somehow and not as crisp as the XP SP2 Cleartype.  I realize the sizes are different, with different assumed resolutions, but the subjective experience at scale is important.  (I think I see this on my Vista-equipped Tablet PC too, so I really wonder ... )

    If you use the same DPI, how well does Vista match the XP Cleartype case?

    I decided to engage in a bit of experimental DPI viewing. I used the funniest string from Why that is positively Ethiopic! (፳፩፼፳፰፻፷፯፼፶፫፻፱) and took screen shots at 96, 120, 134, and 144 DPI (note that I did not change the sample application; I just pasted the string into the TextBox controls):

    I'll let you decide on your own about the quality (the code was unchanged so it was trying to use a 32pt size for the font in all four cases).... :-)

    Now the additional issues to keep in mind here for font embedding....

    First we'll take a look at Nyala's OpenType support from 10,000 feet:

    Notice that it does have some OpenType tables that provide support for Ethiopic, though the main reason to consider Ethiopic to be a complex script is for that undocumented sixth reason to be considered s complex script that I described in Font Linking vs. Font Fallback, #2.

    The technology provides the selected [subset of the] font and allows you to embed it in your application if the font's licensing restrictions allow it. But it does not give you updates to shaping engines and it does not give you pieces of the font that are excluded by subsetting decisions you might make. In the end, this means that any time the language you are trying to display is a complex script, the proper display might be limited by what the machine itself can support (for an example of this imagine some of the complex scripts added in Vista like Khmer or Sinhalese or Tibetan and try to imagine displaying them in Windows 2000!).

    XP SP2 will actually do very well here, much better than one might expect at first. But it turns out that the update to the Uniscribe shaping engines provided by the update I first described in Lions and tigers and bearsELKs, Oh my! included some of the (not completely finished but certainly in progress) updates that eventually made their way into Vista. So you may find you have better luck in XP SP2 then in most other downlevel platforms. But n matter what you will always have the constraint of the platform's support to contend with.

    This also applies to keyboards (you'll have to provide them via MSKLC or whatever), locale support (custom cultures, anyone?), or collation (currently no solution for this one, sorry -- so beware this problem!).

    You will probably always want to be using .NET Framework >= 2.0 so that you can use Uniscribe and not be limited to what GDI+ supports.

    And then when you are done, be sure to call TTDeleteEmbeddedFont as the sample does in its FormClosing event. And in your real world samples you probably should embed the font in your application's resources and then just use a MemoryStream rather than a FileStream to read it (though even is you treat it as a file like the sample does, there is not really anything else that you can do with the file anyway....

    As a final note, let me once again remind people to follow the licensing restrictions of the font you want to use. Your font foundry will thank you for attention to your attention to this particular detail!

     

    This post brought to you by(U+1335, a.k.a. ETHIOPIC SYLLABLE PHE)

  • Sorting it all Out

    If you find that GetLocaleInfo is driving you crazy, it may not be the right function to use

    • 14 Comments

    Aaron asks (via the Contact link):

    I apologize for this totally unsolicited email, but I'm starting to wonder if I'm crazy or not.  I'm using GetLocaleInfo to determine what language the user wants their UI strings displayed in.  I'm using LOCALE_SISO639LANGNAME and LOCALE_SISO3166CTRYNAME to get a string such as en-US. 

    The part where I'm confused is where GetLocaleInfo is grabbing that information from.  In the XP regions and languages control panel, there's a strangely worded option in the "Advanced" tab called "Language for non-Unicode programs."  When I set that to something like "French (france)", I still get en-US instead of fr-FR.

    The reason I'm doing this is because we have a custom localization scheme for our application, which runs on Windows 98 through Vista.  So we are not technically a Unicode application (we don't #define UNICODE), but we support Unicode in that we dynamically load the W version of every API we come across and prefer that to the A version (and our strings are encoded accordingly).  So does that option in the Advanced tab even apply to our application?  If it does, how would I get that information?

    How far gone is my misunderstanding of things?  ;-) 

    Thanks for your time!

    GetLocaleInfo does not grab its information from any setting in Regional Options. It grabs its info from its own internal database of information, based on the locale you pass it.

    The "Language for non-Unicode Programs" is also known as the "Default System Locale" and really if not a good setting upon which to base a localization strategy, for way too many reasons to enumerate fully (but the fact that the intent is to provide the locale to use for conversions between Unicode and "ANSI" ought to be reason enough on its own).

    If you really wanted to get the information from this setting via GetLocaleInfo, you could just pass LOCALE_SYSTEM_DEFAULT as the LCID. But like I said, you do not want to use that setting (you claimed to want to support Windows 98, Aaron -- this setting is not changeable in Windows 98).

    Now you could in theory use the "Standards and Formats" setting, also known a the default user locale (you would use LOCALE_USER_DEFAULT in that GetLocaleInfo call). It has the advantage of being settable on all platforms, if nothing else.

    Clearly though, it is not intended to drive the UI language, and thus if you made it drive UI language in your application you would be providing confusing UI to the user.

    But if you think about that locale list for a moment, the odds that it will match your list of UI languages for your application are probably pretty close to nil. So you do not miss much by not using that setting.

    Now I could claim that you should use the results of the user interface language functions provided by MUI, but to be honest even though it has the advantage of being an accurate setting, it really isn't likely to be able to match your UI languages of your application either.

    (Look on the right side of the page and expand the one that says Regional Options for more information on what each setting there is generally for.)

    And also, not every version of Windows supports the two-letter ISO codes (and LOCALE_SISO639LANGNAME and LOCALE_SISO639CTRYNAME are often the three letter codes from which you cannot deterministically derive the two letter codes), not to mention the fact that from time to time some of them have been wrong. So using Win32 NLS API functions to call at runtime to get the language tags to use on any version of Windows from Win98 to Vista just seems like a bad idea.

    If you are providing a localized copy of your application then you can default to the UI language of the operating system and then you should honestly provide your own user interface to let them change it, based on the list of localized versions of your application that you support. The various lists that Windows provides aren't actually good ways to choose your UI language (beyond that possible idea of the initial one you might choose via GetUserDefaultUILanguage when it is available -- that function is included in almost every version you need other than Win98; it even is there on WinME).

    Thus my guess in the title of this post, Aaron -- the reason GetLocaleInfo is driving you crazy is that it is really not the function your application should be using here. It is driving you crazy for the same reason that a pair of pliers would drive you crazy for fixing a hangnail....

     

    This post brought to you by(U+10ef, a.k.a. GEORGIAN LETTER JHAN)

  • Sorting it all Out

    A picture that can't be easily described with words

    • 13 Comments

    If only you could include screen shots in Microsoft Knowledge Base articles.

    I mean, could you imagine being in Product Support and trying to write up the text description for the following for a KB article?

    Actually, a customer named Pavel had the same problem trying to describe the above when trying to find out about the situation in s newsgroup posting from a month or two ago.

    After I expressed some confusion about the description, he finally put up a screen shot and we started digging....

    It was not too long before we had a repro -- only on some machines, there was no indication yet as to the cause.

    Of course here we have access to checked builds; it turns out that just before this bug comes up, two asserts with huge callstacks come up -- they look something like this:

    Wow, now that is disturbing.

    But that assert gives just the hint that was needed -- if you look at the item third from the bottom, we are actually near the end of the main form's InitializeComponent call -- this is the method that Visual Studio puts in that does all the grunt work of creating all of the controls and setting all of the properties of the form.

    It seems there is a specific case where the Resize event is being called long before any of the code in MSKLC runs -- like before the code initializes the "start size" of a bunch of controls. So those values are zero and when calculating the ratio between the new size and the original one, we divide by zero (since the uninitialized value is zero).

    Further digging, and we found that this only occurs with certain DPI values (120dpi or thereabouts), and only when the "XP compatible" mode is specified for the DPI.

    In fact, if you make sure not to click the "Disable display scaling on high DPI settings" for MSKLC or a shortcut to it (on the Compatibility tab):

    And then you can also avoid the problem too.

    As to why this event is being called prematurely (yes, technically resizing happens in InitializeComponent but this code has been around since 2001 without ever hitting this problem before so it appears to be legitimate regression either in Vista or in WinForms in a set of circumstances narrow enough to mean that most people will never see the problem!), who knows?

    That will be followed up on in due course -- it might even make a KB article at some point, since it is much easier to put into words.

    Anyway, once you make sure you aren't using the XP compatible setting for MSKLC, everything looks good even on a machine that repros the problem:

    But I am suddenly really not liking DPI settings for some reason. :-(

    Anyway, with a whole bunch of workarounds available (and the relative paucity of reports -- just one so far! -- of the bug despite the huge number of MSKLC downloads), an immediate fix is presumably a hard sell. Especially since the real bug is in an event that is firing long before it is ever supposed to in a circumstance it is not documented as being supposed to fire.

    But even if it isn't going to be fixed until a future version of MSKLC or WinForms or whatever, it seemed worth a blog post, if only to show off that cool "keyboardless" MSKLC screen shot:

    It isn't like the KB could do much without adding screen shot support. :-)

     

    This post brought to you by (U+3243, a.k.a. PARENTHESIZED IDEOGRAPH REACH)

  • Sorting it all Out

    Rhymes with Amharic #2 (a.k.a. Before you embed, you have build something to embed)

    • 10 Comments

    (see here for the first part) 

    The first part of the code centers around a call to TTEmbedFont. It only runs on Vista and above (since no one else should have the font on their machine!):

    IntPtr hDC = CreateDC("DISPLAY", IntPtr.Zero, IntPtr.Zero, IntPtr.Zero);
    if (hDC != IntPtr.Zero) {
        IntPtr hFont = CreateFont(MulDiv(Convert.ToInt16(siz), GetDeviceCaps(hDC, LOGPIXELSY), 72),
                                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, "Nyala");
        if (hFont != IntPtr.Zero) {
            IntPtr hFontOld = SelectObject(hDC, hFont);
            if (hFontOld != IntPtr.Zero) {
                // We are writing out the embed file info for the font if the file doesn't exist.
                uint ulStatus = 0;
                FileStream fsWrite = new FileStream(FONTNAME, FileMode.CreateNew);
                WRITEEMBEDPROC wep = new WRITEEMBEDPROC(this.WriteEmbedProc);
                TTEMBEDINFO ttie = new TTEMBEDINFO();

                ttie.usStructSize = Convert.ToUInt16(Marshal.SizeOf(ttie));
                ttie.usRootStrSize = 0;
                ttie.pusRootStr = IntPtr.Zero;
                ulPrivStatus = 0;
                ulStatus = 0;
                rc = TTEmbedFont(hDC,
                                 TTEMBED.RAW | TTEMBED.TTCOMPRESSED,
                                 CHARSET.UNICODE,
                                 out ulPrivStatus,
                                 out ulStatus,
                                 wep,
                                 fsWrite,
                                 IntPtr.Zero,
                                 0,
                                 0,
                                 ttie);
                fsWrite.Flush();
                fsWrite.Close();
                if (rc != E.NONE) {
                    // Since creation of the file ultimately failed, delete whatever
                    // interim bits might have been written.
                    File.Delete(FONTNAME);
                }
                SelectObject(hDC, hFontOld);
            }
            DeleteObject(hFont);
        }
        DeleteDC(hDC);
    }

    You'll notice that I am passing the flags to include the raw font and not a subset of it. My initial reason for that was the suggestion from some people that the subsetting would not pick up any of the different forms of glyphs that might be available. But Sergey actually told me that the code is rather generous at including all of the alternate forms and glyphs that could potentially derived from the ones that are specified, so in the situation where the text is static, subsetting the font may be worthwhile (and will certainly make for a smaller file!).

    If one was going to subset, putting all the text in a string and then changing that IntPtr in the pinvoke declare of pusCharCodeSet to a string and then passing it (after all, what else is a string but an array of ushort values?). :-)

    The key piece of code that does this part of  work is that WriteEmbedProc. To be honest, I am not entirely happy with it. You may see why if you look it:

    [UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
    internal delegate uint WRITEEMBEDPROC(FileStream lpvWriteStream, IntPtr lpvBuffer, uint cbBuffer);

    internal uint WriteEmbedProc(FileStream lpvWriteStream, IntPtr lpvBuffer, uint cbBuffer) {
        byte[] rgbyt = new byte[cbBuffer];
        Marshal.Copy(lpvBuffer, rgbyt, 0, (int)cbBuffer);
        lpvWriteStream.Write(rgbyt, 0, (int)cbBuffer);
        return cbBuffer;
    }

    Okay, so because I am using the .NET FileStream class to do the writing, I am forced to do that extra bit of copying into a byte array that I'd rather avoid. You know, just something to write from that lpvBuffer pointer directly to the file. But the actual hit is small (it is a small file, after all!), so I just kind of thought it would be worth earmarking as an area to potentially revisit if performance became a problem. In the meantime, it does get the job done....

    I also chose not to get involved with the whole TTEMBEDINFO structure and its link checking, though people looking at the sample might see it as worthwhile to look into (this is why I bothered to define the struct rather than just making it an IntPtr and passing IntPtr.Zero in this case).

    Anyway, when everything is done you end up with a nice little binary file that can be used in your application that needs to display text that may not be available....

    In the next post I'll talk about the harder bit, which is actually loading that file....

     

    This post brought to you by(U+1275, a.k.a. ETHIOPIC SYLLABLE TE)

  • Sorting it all Out

    Some new Vista LIPs have been released!

    • 9 Comments

    (I am still on vacation, but some news just shouldn't have to wait!) 

    I promised that the हिन्दी (Hindi) LIP would the first of many.... and now they have made good on this promise. There are in fact two LIPs now available. One for Català (Catalan) and one for српски / srpski (Serbian Cyrillic)!

    Some fun facts about each of them:

    Catalan

    Number of speakers: 6.6 million plus about 5 million second-language users

    Name in the language itself: Català

    Catalan is spoken is the Spanish regions of Catalonia, Valencia and the Balearic Islands but also in Andorra (a small independent nation between Spain and France), the French region of Roussillon and on the Italian island Sardinia. It has the status of official regional language in Catalonia, Valencia and the Balearic islands, and it is the official national language of Andorra.

    Catalan, probably developed by the 9th century, was a prominent language in the Western Mediterranean region in the 13th to 15th centuries. In the 16th century it was slowly replaced by Castilian Spanish for the urban elites while the rural areas and urban lower classes kept speaking it. In the early 19th century Catalan experienced a major revival (Renaixença) in the press and in literature. The use of Catalan, like that of Basque or Galician, was banned under the Franco Regime (1939-1975), but the language recovered very quickly from this: The percentage of young Catalan speakers in Catalonia (90% of the 15-29 years old) is higher than the percentage of speakers overall. There is a vivid book market, major newspapers, TV and radio stations in Catalan - and the first Windows Language Interface Pack ever released was for Catalan.

    Classification: Catalan is an Indo-European language and belongs to the Romance languages. Sharing many features with both Spanish and French it is often considered a transitory language between the Iberian and Gallic descendants of Latin.

    Script: Catalan is written in Latin script.

     Serbian (Cyrillic)

    Number of speakers:  11 million

    Name in the language itself:  српски / srpski

    Serbian, one of the varieties of the diasystem once known as Serbo-Croatian, is spoken by 11 million speakers in Serbia of which 10 million live in Serbia. It is official language there as it is in Montenegro and the Srpska republic in Bosnia-Herzegovina. There also considerable migrant communities in the United States, Canada and some European countries.

    Serbian is still mutually intelligible with Croatian and Bosnian. While Croatian and Bosnian develop away from the former Serbo-Croatian standard there has not been much change in Serbian.

    Classification: Serbian is as a Southern Slavic language; it belongs to the Slavic branch of the Indo-European languages.

    Script: Serbian is written in two different scripts: Cyrillic and Latin. Cyrillic was used for Serbian traditionally until 1918 and is still the official alphabet, while Latin has become popular in the Yugoslavian era and remains so especially with the younger generation.

    Enjoy! :-)

     

    This post brought to you by с (U+0441, a.k.a. CYRILLIC SMALL LETTER ES)

  • Sorting it all Out

    Win9x keyboard file source?

    • 9 Comments

    Thorsten Glaser asked over in the Suggestion Box:

    Hi,

    thanks for MSKLC, now I'm able to have the same keyboard layout on the BSD wscons (text mode) console, under X-Window and on Windows NT/2k/… – with a “meta” key that just adds 0x80 to the value of the character (e.g. maps Meta-d to ä), emulated with AltGr on NT and Mode_switch on X11, and a few funny characters I'm occasionally needing (…€„™“”•–), and Ÿ for the sake of  completeness.

    Now I've seen Janko's Keyboard Generator for Win9x, and I wonder if the format of the .KBD files is publically documented. If so, I could create the same (almost) layout with a hex-editor, which sometimes is capable of doing more than some random UI programme. (Even MSKLC wouldn't let me re-map AltGr-Tab at first.) Maybe it's just not possible, but even then, I'd be interested if there's some kind of docs for that format which I couldn't find (probably because it's been 12+ years since “Chicago” was new).

    Thanks in advance!

    The source, header files, samples, and build environment to build keyboard layouts on Win9x has been a part of the Windows 98 DDK even back all those years ago when it was the Windows 95 DDK.

    For documentation, the entire section of the documentation entitled Windows 95 Keyboard Driver is of particular use here, as is the subsection within entitled Keyboard Layouts.

    In fact, the only problem with this advice (which totally answer's Thorsten's question!) is that the DDK no longer appears to be available for download (not entirely surprising since it is as old as it is and Windows 98 is no longer supported). So I hope Thorsten has a copy of the DDK installed somewhere, or knows someone who does....

    I may post more on this topic in the future, though I am probably more likely to talk about NT-based keyboards, all things considered. :-)

     

    This post brought to you by  (U+a1d9, a.k.a. YI SYLLABLE LYR)

  • Sorting it all Out

    Which one has the astigmatism?

    • 7 Comments

    We've been talking about DPI a whole bunch, including not disabling and disabling the high DPI support in Vista. These two screen shots taken on a machine with 192 DPI will not show the final word (that comes in another post, soon!), but they will provide the penultimate word....

    Which one has the astigmatism? :-)

    I can't be the only person who is tired of that ACUVUE commercial....

     

    This post brought to you by (U+25c9, a.k.a. FISHEYE)

  • Sorting it all Out

    The Notepad encoding detection issues keep coming up

    • 7 Comments

    A few days ago, Raymond was talking about the Notepad file encoding problem, again. And the comments were pretty funny, like watching a traffic accident as people started going off the rails in all kinds of directions.

    For the record, here is the official, UNDOCUMENTED, Notepad encoding detection story, only mildly changed between Windows 2000 Beta 2 through now (into Longhorn Server thast hasn't shipped yet):

    1. Check the first two bytes;
      1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
      2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
      3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
    2. Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
    3. Check to see if it UTF-8 using the original RFC 2279 definition  from 1998 and if it then treat it (and load it) as a "UTF-8" file;
    4. Assume an ANSI file using the default system code page of the machine.

    Now note that there are some holes here, like the fact that step 2 does not do quite as good with BOM-less UTF-16 BE (there may even be a bug here, I'm not sure -- if so it's a bug in Notepad beyond any bug in IsTextUnicode).

    And frankly if people were happy with the IsTextUnicode behavior in general or with small files in particular then the big hub-hub I mentioned here and here wouldn't have been such a mini-phenomenon (like as if people needed Notepad to comment on whether Bush hid the facts or not!).

    But then again I already mentioned I don't like IsTextUnicode, for roughly some the same reasons that the whole Notepad "detection" thing is a pain.

    I also don't like step 3 above, either -- the code may be fast but it also is way behind the current algorithm used by MultiByteToWideChar, which has one a pretty good job keeping up with the ever changing Unicode conformance guidelines. I still haven't gotten my head around what it means for a file that meets the 1998 guidelines but not the latest UTF-8 conformance rules. Probably a lot of U+FFFD characters in the future, UTF-8 style (EF BF BD).

    But in the end I think it is unfair to pick on Notepad here. IsTextUnicode needs to be updated as I said over two years ago here and then after that is done someone needs to go update Notepad to use the new detection stuff that is added.

    In the meantime folks should not be so busy complaining about stuff before they understand it; as the above makes clear there is plenty of material to complain about accurately, later. :-)

     

    This post brought to you by (U+fffd, a.k.a. REPLACEMENT CHARACTER)

  • Sorting it all Out

    When methods use collation to 'disturb the peace' we charge them with being 'out of sorts'

    • 6 Comments

    You know how I talk about best practices here sometimes? And the worst ways to misuse various globalization/internationalization methods and functions?

    Well, believe it or not, sometimes even the Microsoft code gets it wrong.

    (Gasp!)

    Like the other day, when Balsu asked:

    Hi

    When I call System.Messaging.MessageQueue.Exists("ઽEFGH"), I am getting an ArgumentException saying ‘PathSynatx is invalid’. Looking in to the MessageQueue.Exists() method, I am finding that String.LastIndexOf method is causing the issue.

    int index1 = ".\\PRIVATE$\\ઽEFGH".LastIndexOf("\\PRIVATE$\\",StringComparison.CurrentCultureIgnoreCase);
    int index2 = ".\\PRIVATE$\\EFGH".LastIndexOf("\\PRIVATE$\\", StringComparison.CurrentCultureIgnoreCase);

    When I execute the above statements in en-US culture, I am getting index1 as -1  where it should be 1.

    I am getting index2 as 1 as expected.

    Has anyone faced similar issue? Any work around?

    The problem in the example can be traced to that , which is a U+0abd (a.k.a. GUJARATI SIGN AVAGRAHA). Which it just so happens that in linguistic comparisons is treated as a combining character given its tendency to make the previous character a little bit heavier.

    Regular readers might immediately spot the problem here, and immediately remember Put in on my Tab, please from last September. With the only real difference being that in this case the example is building up on a REVERSE SOLIDUS instead of a TAB.

    In both cases, a character that is not really a combining character is treated as if it were one in collation in order to get a specific linguistically appropriate result. Which is really not a bug, though it does end up causing one.

    (I'll talk more about THAT issue another day!)

    If we stay focused and try to figure out in a bit of Root Cause Analysis the reason for the problem in System.Messaging.MessageQueue.Exists, we are just getting started....

    There is of course the fact that Collation != Case (a.k.a. Collation <> Case).

    And the misuse of CurrentCulture, of course, since one would never want the behavior to change based on user settings.

    But even more important is the fact that when one is dealing with the file system as they are here, one should never be using a linguistic comparison method (something I first pointed out back in Comparison confusion: INVARIANT vs. ORDINAL). This scenario simply screams for the use of OrdinalIgnoreCase, to help match the behavior of the file system.

    So the fix here would be (in that String.LastIndexOf(String, StringComparison) call) to use StringComparison.OrdinalIgnoreCase, rather than StringComparison.CurrentCultureIgnoreCase....

     

    This post brought to you by  (U+0abd, a.k.a. GUJARATI SIGN AVAGRAHA)

  • Sorting it all Out

    Rhymes with Amharic #3 (a.k.a. Read and write a language w/o even getting out of my [em]bed? Kewl!)

    • 6 Comments

    (see also the first part and the second part)

    We now have that binary chunk that needs to be loaded, so let's go ahead and load it!

    The core bit of the code for this should have been:

    if (File.Exists(FONTNAME)) {
        // We are reading in the embed file info if the file exists (we may have just created it!)
        TTLOAD ulStatusRead = 0;
        FileStream fsRead = new FileStream(FONTNAME, FileMode.Open);
        READEMBEDPROC rep = new READEMBEDPROC(this.ReadEmbedProc);
        TTLOADINFO ttli = new TTLOADINFO();

        ttli.usStructSize = Convert.ToUInt16(Marshal.SizeOf(ttli));
        ttli.usRefStrSize = 0;
        ttli.pusRefStr = IntPtr.Zero;
        ulPrivStatus = 0;

        rc = TTLoadEmbeddedFont(out this.m_hFontReference, TTLOAD.PRIVATE,
                                out ulPrivStatus,
                                LICENSE.EDITABLE, out ulStatusRead,
                                rep, fsRead,
                                "NyalaSIAO", "NyalaSIAO",
                                ttli);
        fsRead.Flush();
        fsRead.Close();

        this.tb1.Font = new Font("NyalaSIAO", siz);
        if (this.tb1.Font.Name != "NyalaSIAO") {
            // We had everything but embedding failed anyway.
            this.lbl1.Text = "Embedding failed, font is: " + this.tb1.Font.Name;
        }
    }

    And of course the ReadEmbedProc (note the similarities and more importantly the differences when comparing to the WriteEmbedProc mentioned earlier, a ripe potential source of copy/paste codewriting errors!)

    [UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
    internal delegate uint READEMBEDPROC(FileStream lpvReadStream, IntPtr lpvBuffer, uint cbBuffer);

    internal uint ReadEmbedProc(FileStream lpvReadStream, IntPtr lpvBuffer, uint cbBuffer)
    {
        byte[] rgbyt = new byte[cbBuffer];
        lpvReadStream.Read(rgbyt, 0, (int)cbBuffer);
        Marshal.Copy(rgbyt, 0, lpvBuffer, (int)cbBuffer);
        return cbBuffer;
    }

    However, in the end it actually proved to be a lot harder than it should have been due to the way that GDI+/WinForms handles the work of fonts, refusing to recognize any font that was not available at application boot time. So even though there is a font available in the process, GDI+ is unwilling to believe it.

    The next thing I tried here was to just create it the old fashioned way and stick it into the device context, but that also failed because after all this is not plain old GDI doing the work here, this is either GDI+ using it's notion of the font to use or the WinForms concept of the font to send to Uniscribe (via TextRenderer).Doesn't anyone respect a device context any more? :-)

    The solution was to add this last bit of code after succeeding in the call to TTLoadEmbeddedFont:

        IntPtr hdc = GetDC(this.tb1.Handle);

        this.m_hFontEmbedded = CreateFont(MulDiv(Convert.ToInt16(siz), GetDeviceCaps(hdc, LOGPIXELSY), 72),
                                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, "NyalaSIAO");
        if (this.m_hFontEmbedded != IntPtr.Zero) {
            uint cb = GetFontData(hdc, 0, 0, IntPtr.Zero, 0);
            if (cb != GDI_ERROR) {
                byte[] rgbyt = new byte[cb];
                GetFontData(hdc, 0, 0, rgbyt, cb);
                this.m_pfc = new PrivateFontCollection();
                IntPtr pbyt = Marshal.AllocCoTaskMem(rgbyt.Length);
                Marshal.Copy(rgbyt, 0, pbyt, rgbyt.Length);
                this.m_pfc.AddMemoryFont(pbyt, rgbyt.Length);
                Marshal.FreeCoTaskMem(pbyt);
                this.tb1.Font = new Font(this.m_pfc.Families[0], siz);
            }
        }
    }

    What this code does is load the font that TTLoadEmbeddedFont has (if you think about it) reconstituted into an actual font and then put it into a memory font via the PrivateFontCollection class, just like what happened in the code from Private fonts: for members only.

    When I tested with subsetted font binaries it worked as well, which means that TTLoadEmbeddedFont really is doing a good job here at making what it puts together look like a font. :-)

    Now some of the things this code is ignoring include the license info that TTLoadEmbeddedFont returns, as well as the return value. And more importantly, right now the code is assuming that the call succeeds and then just trying to use the results, a strategy which is fine in the constrained situation here but if it is expanded to other fonts then you might want to consider altering that strategy since the CreateFont call will succeed even if it does not recognize the font name and you may specifically not like the font it gives instead....

    Next up, some other interesting issues to consider about embedding, and what happens when you are done....

     

    This post brought to you by (U+12ee, a.k.a. ETHIOPIC SYLLABLE YO)

  • Sorting it all Out

    I don't want you to go

    • 5 Comments

    (Absolutely positively nothing technical, whatsoever)

    My grandmother said those words as I stood in the kitchen, about to head out to the car taking me to the airport.

    Of course I still had to go.

    I had been in Ohio a week, probably one of the longest vacations I had taken since I first started working for Microsoft full time.

    Vacations have perhaps gotten less glamorous than they used to be.

    I mean, back in the day it might have been Bangkok or Grand Cayman or Hong Kong or Hawaii or Singapore or Amsterdam or Taipei or Little Cayman or Tokyo.

    Suddenly it was Beachwood.

    And now I am heading back to Redmond on a 757.

    It was just last week that I realized that I have lived in Redmond longer than any other place I have in my life. I guess the short term contract worked out okay in the end....

    But I think back to my grandmother's words again -- I don't want you to go.

    Now maybe it is just the music I am playing at the same time as I am writing this post, and with that in mind you can proably discount everything that follows to some extent.

    But I have heard those words before. And to be blunt the people who said them were people who were important to me.

    Well, ast least more important than the people who have said "I do want you to go" (or less formally "get the hell out"!).

    A few of these people were girlfriends, or lovers. Some of them were very good friends, people I relied on (and vice versa). One of them was just four years old. And don't think for a minute that the last one on the list was the easiest of the bunch.

    Yet each time, at the point where someone was saying the words, I was not going to stay.

    Worse, each time, I think the person saying it knew nothing was going to change just because they said something.

    So what is being expressed, exactly?

    Sadness? Anger? Frustration?

    A general sense of pathos about a universe that would conspire to move two people away from each other?

    Perhaps all of those things. And more.

    Or maybe I am underestimating everyone's intentions.

    It could be just what Kathleen Edwards was thinking about in Old Time Sake, or maybe even what William Thacker was thinking when he answered Anna Scott's request to stay a bit longer with Stay Forever.

    It may just be that in some cases they were actually hoping I would stay. For a day, for a week, for a month, forever.

    Maybe by leaving (the situation, the place) I really was letting someone down, dashing a mad hope that someone who would smuggle a cat into Ankara on the way to a Jethro Tull show for no other reason than he promised he would return the cat to its owner might bend the universe for a moment and delay or dash the plans for no other reason than someone said the words.

    Would it make a difference? Hard to say....

    Now as I re-read this entry that I may just delete rather than posting it, I can recall one time that I did heed the words.

    A time that she said I don't want you to go that I stopped what I was about to do and asked her if she meant it. And when she said she did I changed the plan of ending a relationship and turned back to her so that I could hold her and tell her that I was hers.

    Not that it made much difference, though -- that relationship was over too, eventually. In fact, it might have been easier had I not turned around.

    Maybe I just decided to stop heeding the words. Maybe now I just take them as a very sweet expression of sadness in a world that can't change on the basis of six words, even for a little bit. So I nod and say I wish I didn't have to and I still leave.

    Perhaps I am just a cynic now.

    But I'll tell you a secret, though. I don't believe it.

    Because I said the words to someone once, and that someone is still in my life.

    And they smile around me just often enough that I believe they are happy about it.

    In other words, they didn't go. And in the process of all that staying, they showed a strength of character for which I am grateful, of which I am jealous, and to which I aspire.

    I mean, I believed in life's rich tapestry even before Modern English was singing about it. And I believe in it now.

    If you look Farther Down (apologies to Matthew Sweet!), I am an optimist, no matter how cynical I may seem at times.

    So the next time it happens, maybe I'll be braver. Maybe I will change the itinerary or the plans or the direction in life. Whichever might be appropriate.

    (Unless the person saying it actually read this post, in which case I might have to disqualify the words; readership may have its privileges around here but I have to draw the line somewhere!)

    You may not have any idea what this post is about right now. But maybe some day you will.... :-)

     

    This post brought to you by ˺ (U+02fa, a.k.a. MODIFIER LETTER END HIGH TONE)

  • Sorting it all Out

    Before jumping into the stream, you might want to peek at it

    • 5 Comments

    Chris asked:

    I am using the following constructor for StreamReader:

    StreamReader (String, Boolean)    Initializes a new instance of the StreamReader class for the specified file name, with the specified byte order mark detection option.

    However, when I look at the “.CurrentEncoding” property of the StreamReader class; it always appears to be UTF8 no matter what the encoding is of the file that was opened.  How can I get the encoding of the file that was opened?

    --Chris

    Before anyone had a chance to respond (like maybe 10 minutes later!), he answered his own question though:

    Never mind.  By executing the .Peek() method the .CurrentEncoding property gets set appropriately.

    I thought about it later and looked at the docs, which had this to say about that bool parameter:

    The detectEncodingFromByteOrderMarks parameter detects the encoding by looking at the first three bytes of the stream. It automatically recognizes UTF-8, little-endian Unicode, and big-endian Unicode text if the file starts with the appropriate byte order marks. Otherwise, the user-provided encoding is used. See the Encoding.GetPreamble method for more information.

    A part of me is hoping that there is a small do. bug and that given the existence of the UTF32Encoding class that it is the first four bytes that are being looked at here, and not just the first three. :-)

    But then I thought about the idea that the StreamReader.CurrentEncoding was not looking at even those first few bytes until after one started looking at the data in the stream. I couldn't really think of a case where this was weird other than code that was depending on the value to decide what to do and which therefore might be looking at the CurrentEncoding first. In which case one's code could make the wrong decision, right?

    The moral of the story? Be sure to take a quick peek before you make any big decisions with the StreamReader!

     

    This post brought to you by U+feff, a.k.a. ZERO WIDTH NO-BREAK SPACE, a.k.a. the BYTE ORDER MARK

  • Sorting it all Out

    Sprechen Sie IME?

    • 5 Comments

    The other day, Keith asked in the Suggestion Box:

    In creating an on screen keyboard for Korean, I began to notice that the Korean IME seems to do things differently than, say, the Japanese IME.  In Japanese, to get the characters from the ToUnicodeEx function is as simple as setting the VK_KANA virtual key to the on state when you pass in the Keyboard State array parameter.  However, in Korean it does not seem to behave in this simple a way.  More confusing, the Japanese IME has a Kana status button that turns this virtual key 'on' in the keyboard state to switch character sets.  However, the Han/Eng toggle seems to make no change to the keyboard state.  What happens internally when this button is clicked?  How would I get the correct Korean characters from the ToUnicodeEx function?  Why is this so confusing?

    He also hedged his bets in the microsoft.public.win32.programmer.international newsgroup:

    I am currently enhancing an on-screen keyboard adding the ability to enter Korean.  I am having problems getting the Korean characters to be displayed on the keyboard keys.  The code which does it correctly for other languages but doesn't work for Korean makes a call to the function ToUnicodeEx.  Further, I just noticed an old post on the newsgroups that said this method was problematic for Korean or other IMEs using TSF.  That being the case, how would I go about doing this then?  And why does ToUnicodeEx not work for certain IMEs?

    Thanks for any assistance, Keith.

    (I take no offense, there is definitely no promise of immediate response or anything!)

    Then a couple of months ago, ibon asked in that same newsgroup:

    HWND hWnd = GetForegroundWindow( );
    HIMC himc = ImmGetContext( hWnd );

    But "himc" always returns "NULL".

    If MS has blocked this, is there other ways to access info about the input language on a common IME?

    Sincerely,

    And a few months ago, Matthias asked in that same newsgroup:

    Hello

      i have a problem with korean. For a touch screen application we use a virtual keyboard to enter data. Unfortunatly does the driver generats a mouse click event when the user presses on the screen.
      If the input is korean, the IME interprets such a click as something as a cancel event and stops the composition. Can someone tell me how to surpress this so that a mouse click does NOT interrupts the composition?

    Thanky you
    Matthias

    Then there was another post on that same newsgroup early last month from Digital Ice:

    I try to retrieve the ime candidate list of Microsoft Pinyin IME.
    This code is working fine with Windows XP but not Windows Vista.

    if (msg->message == WM_IME_NOTIFY)
    {
        if (msg->wParam == IMN_OPENCANDIDATE || msg->wParam == IMN_CHANGECANDIDATE)
        {
            HWND hFocus = msg->hwnd;
            HIMC hImc = ImmGetContext(hFocus);
            _ASSERT(hImc);
            DWORD dwSize = ImmGetCandidateList(hImc,0,NULL,0);
            if (dwSize)
            {
                ..........leave out here.

    ImmGetCandidateList always returns zero when it wotrks with Windows Vista.
    ImmGetCandidateList should returns the size of CANDIDATELIST required.
    Why the behavior is different?

    On the whole, if you look into these problems deeply enough, you will find they have a few things in common:

    1. They all have to do with IMEs
    2. All of the IMEs in question are actually Text Services Framework (TSF) Text Input Processors (TIPs)
    3. In each case, there are one of two causes to the problem being reported, either:
      • The compatibility layer between TSF and the original Input Method Manager (IMM) API within imm32.dll has a bug, sdomewhat akin to What broke the input language messages? but without as good of a justification, or
      • The IME's own interaction with the keyboard handling API within user32.dll is not as full as it is with some other IME.

    Now obviously of those two cases the second one is the one for which there is no specific solution that will allow the code to work -- in those cases, you have to work more directly with the IME rather than the keyboard handling functions, as they simply do not provide the information where it is being requested. Every IME is made up of code and data and if they handle the situation differently then that is what they do -- how many times would you expect to see code written by different developers within different countries supporting the input of different languages where they all worked the same way?

    The first case is a bit less forgivable, though to honest after working with the IMM API in the past, I can understand why the legacy support to have TSF support the IMM programming interface would be incomplete -- it is not a terribly easy API to use.

    In the bulk of these cases, the answer is to look at the Text Services Framework and its myriad of classes, interfaces, methods, and properties to work with the IMEs. Starting in XP where some of them were converted up until Vista where just about all of them were, it really is the only answer that is going to avoid frustration that does not have a chance of leading to success....

     

    This post brought to you by   (U+17c0, a.k.a. KHMER VOWEL SIGN IE)

  • Sorting it all Out

    No Regex in the Unicode room! (and no sex in the champagne room, either!)

    • 5 Comments

    (apologies to Chris Rock for the title!)

    Ted first sent me mail years ago, he was asking some questions about MSLU and Julie (who knew Ted back from when he was working for Microsoft) sent him to me. If memory serves he actually pointed out an interesting bug or two in the course of answering those questions that I ended up fixing.... :-)

    Anyway, a few years later he came back to Microsoft and from time to time a question would come up about some random Unicode or internationalization thing and I'd often know the answer.

    Though the question that came up yesterday from his colleague Kevin, I did not know for sure what was going on.

    The problem amounted to a Regex expression that should have returned the same results as char.IsLetter, but it wasn't. This code listed the characters with the problem:

    using System;
    using System.IO;
    using System.Text;
    using System.Globalization;
    using System.Text.RegularExpressions;
    namespace UnicodeCategory {
        class Program     {
            static void Main(string[] args)
            {
                StringBuilder sb = new StringBuilder();
                int cnt = 0;
                char c = char.MinValue;
                do {
                    const RegexOptions opt = RegexOptions.Compiled
                        | RegexOptions.CultureInvariant
                        | RegexOptions.IgnoreCase
                        | RegexOptions.ExplicitCapture;
                    Regex regex = new Regex(@"^([\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}]+)$", opt);
                    bool regexOK = regex.Match(c.ToString()).Success;
                    bool functionOK = Char.IsLetter(c);
                    if (regexOK != functionOK) {
                        cnt++;
                        sb.AppendLine(string.Format("regex: {0}\tfunction: {1}\tchar in hex: {2:x} - {3}",
                                                    regexOK, functionOK, (int)c, CharUnicodeInfo.GetUnicodeCategory(c)));
                    }
                    if (c == char.MaxValue) {
                        break;
                    }
                    c++;
                } while (true);
                sb.AppendLine(string.Format("TOTAL mismatches: {0}", cnt));
                File.WriteAllText("result.txt", sb.ToString());
            }
        }
    }

    The code was finding a total of 213 characters that were detected by char.IsLetter that the Regex expression that was literally searching for the same Unicode categories was not finding. The full list of characters this code was returning was:

    regex: False    function: True    char in hex: 130 - UppercaseLetter
    regex: False    function: True    char in hex: 1a6 - UppercaseLetter
    regex: False    function: True    char in hex: 1c5 - TitlecaseLetter
    regex: False    function: True    char in hex: 1c8 - TitlecaseLetter
    regex: False    function: True    char in hex: 1cb - TitlecaseLetter
    regex: False    function: True    char in hex: 1f2 - TitlecaseLetter
    regex: False    function: True    char in hex: 1f6 - UppercaseLetter
    regex: False    function: True    char in hex: 1f7 - UppercaseLetter
    regex: False    function: True    char in hex: 1f8 - UppercaseLetter
    regex: False    function: True    char in hex: 218 - UppercaseLetter
    regex: False    function: True    char in hex: 21a - UppercaseLetter
    regex: False    function: True    char in hex: 21c - UppercaseLetter
    regex: False    function: True    char in hex: 21e - UppercaseLetter
    regex: False    function: True    char in hex: 220 - UppercaseLetter
    regex: False    function: True    char in hex: 222 - UppercaseLetter
    regex: False    function: True    char in hex: 224 - UppercaseLetter
    regex: False    function: True    char in hex: 226 - UppercaseLetter
    regex: False    function: True    char in hex: 228 - UppercaseLetter
    regex: False    function: True    char in hex: 22a - UppercaseLetter
    regex: False    function: True    char in hex: 22c - UppercaseLetter
    regex: False    function: True    char in hex: 22e - UppercaseLetter
    regex: False    function: True    char in hex: 230 - UppercaseLetter
    regex: False    function: True    char in hex: 232 - UppercaseLetter
    regex: False    function: True    char in hex: 23a - UppercaseLetter
    regex: False    function: True    char in hex: 23b - UppercaseLetter
    regex: False    function: True    char in hex: 23d - UppercaseLetter
    regex: False    function: True    char in hex: 23e - UppercaseLetter
    regex: False    function: True    char in hex: 241 - UppercaseLetter
    regex: False    function: True    char in hex: 3d2 - UppercaseLetter
    regex: False    function: True    char in hex: 3d3 - UppercaseLetter
    regex: False    function: True    char in hex: 3d4 - UppercaseLetter
    regex: False    function: True    char in hex: 3d8 - UppercaseLetter
    regex: False    function: True    char in hex: 3da - UppercaseLetter
    regex: False    function: True    char in hex: 3dc - UppercaseLetter
    regex: False    function: True    char in hex: 3de - UppercaseLetter
    regex: False    function: True    char in hex: 3e0 - UppercaseLetter
    regex: False    function: True    char in hex: 3f4 - UppercaseLetter
    regex: False    function: True    char in hex: 3f7 - UppercaseLetter
    regex: False    function: True    char in hex: 3f9 - UppercaseLetter
    regex: False    function: True    char in hex: 3fa - UppercaseLetter
    regex: False    function: True    char in hex: 3fd - UppercaseLetter
    regex: False    function: True    char in hex: 3fe - UppercaseLetter
    regex: False    function: True    char in hex: 3ff - UppercaseLetter
    regex: False    function: True    char in hex: 400 - UppercaseLetter
    regex: False    function: True    char in hex: 40d - UppercaseLetter
    regex: False    function: True    char in hex: 48a - UppercaseLetter
    regex: False    function: True    char in hex: 48c - UppercaseLetter
    regex: False    function: True    char in hex: 48e - UppercaseLetter
    regex: False    function: True    char in hex: 4c0 - UppercaseLetter
    regex: False    function: True    char in hex: 4c5 - UppercaseLetter
    regex: False    function: True    char in hex: 4c9 - UppercaseLetter
    regex: False    function: True    char in hex: 4cd - UppercaseLetter
    regex: False    function: True    char in hex: 4ec - UppercaseLetter
    regex: False    function: True    char in hex: 4f6 - UppercaseLetter
    regex: False    function: True    char in hex: 500 - UppercaseLetter
    regex: False    function: True    char in hex: 502 - UppercaseLetter
    regex: False    function: True    char in hex: 504 - UppercaseLetter
    regex: False    function: True    char in hex: 506 - UppercaseLetter
    regex: False    function: True    char in hex: 508 - UppercaseLetter
    regex: False    function: True    char in hex: 50a - UppercaseLetter
    regex: False    function: True    char in hex: 50c - UppercaseLetter
    regex: False    function: True    char in hex: 50e - UppercaseLetter
    regex: False    function: True    char in hex: 1f88 - TitlecaseLetter
    regex: False    function: True    char in hex: 1f89 - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8a - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8b - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8c - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8d - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8e - TitlecaseLetter
    regex: False    function: True    char in hex: 1f8f - TitlecaseLetter
    regex: False    function: True    char in hex: 1f98 - TitlecaseLetter
    regex: False    function: True    char in hex: 1f99 - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9a - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9b - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9c - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9d - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9e - TitlecaseLetter
    regex: False    function: True    char in hex: 1f9f - TitlecaseLetter
    regex: False    function: True    char in hex: 1fa8 - TitlecaseLetter
    regex: False    function: True    char in hex: 1fa9 - TitlecaseLetter
    regex: False    function: True    char in hex: 1faa - TitlecaseLetter
    regex: False    function: True    char in hex: 1fab - TitlecaseLetter
    regex: False    function: True    char in hex: 1fac - TitlecaseLetter
    regex: False    function: True    char in hex: 1fad - TitlecaseLetter
    regex: False    function: True    char in hex: 1fae - TitlecaseLetter
    regex: False    function: True    char in hex: 1faf - TitlecaseLetter
    regex: False    function: True    char in hex: 1fbc - TitlecaseLetter
    regex: False    function: True    char in hex: 1fcc - TitlecaseLetter
    regex: False    function: True    char in hex: 1ffc - TitlecaseLetter
    regex: False    function: True    char in hex: 2102 - UppercaseLetter
    regex: False    function: True    char in hex: 2107 - UppercaseLetter
    regex: False    function: True    char in hex: 210b - UppercaseLetter
    regex: False    function: True    char in hex: 210c - UppercaseLetter
    regex: False    function: True    char in hex: 210d - UppercaseLetter
    regex: False    function: True    char in hex: 2110 - UppercaseLetter
    regex: False    function: True    char in hex: 2111 - UppercaseLetter
    regex: False    function: True    char in hex: 2112 - UppercaseLetter
    regex: False    function: True    char in hex: 2115 - UppercaseLetter
    regex: False    function: True    char in hex: 2119 - UppercaseLetter
    regex: False    function: True    char in hex: 211a - UppercaseLetter
    regex: False    function: True    char in hex: 211b - UppercaseLetter
    regex: False    function: True    char in hex: 211c - UppercaseLetter
    regex: False    function: True    char in hex: 211d - UppercaseLetter
    regex: False    function: True    char in hex: 2124 - UppercaseLetter
    regex: False    function: True    char in hex: 2126 - UppercaseLetter
    regex: False    function: True    char in hex: 2128 - UppercaseLetter
    regex: False    function: True    char in hex: 212a - UppercaseLetter
    regex: False    function: True    char in hex: 212b - UppercaseLetter
    regex: False    function: True    char in hex: 212c - UppercaseLetter
    regex: False    function: True    char in hex: 212d - UppercaseLetter
    regex: False    function: True    char in hex: 2130 - UppercaseLetter
    regex: False    function: True    char in hex: 2131 - UppercaseLetter
    regex: False    function: True    char in hex: 2133 - UppercaseLetter
    regex: False    function: True    char in hex: 213e - UppercaseLetter
    regex: False    function: True    char in hex: 213f - UppercaseLetter
    regex: False    function: True    char in hex: 2145 - UppercaseLetter
    regex: False    function: True    char in hex: 2c00 - UppercaseLetter
    regex: False    function: True    char in hex: 2c01 - UppercaseLetter
    regex: False    function: True    char in hex: 2c02 - UppercaseLetter
    regex: False    function: True    char in hex: 2c03 - UppercaseLetter
    regex: False    function: True    char in hex: 2c04 - UppercaseLetter
    regex: False    function: True    char in hex: 2c05 - UppercaseLetter
    regex: False    function: True    char in hex: 2c06 - UppercaseLetter
    regex: False    function: True    char in hex: 2c07 - UppercaseLetter
    regex: False    function: True    char in hex: 2c08 - UppercaseLetter
    regex: False    function: True    char in hex: 2c09 - UppercaseLetter
    regex: False    function: True    char in hex: 2c0a - UppercaseLetter
    regex: False    function: True    char in hex: 2c0b - UppercaseLetter
    regex: False    function: True    char in hex: 2c0c - UppercaseLetter
    regex: False    function: True    char in hex: 2c0d - UppercaseLetter
    regex: False    function: True    char in hex: 2c0e - UppercaseLetter
    regex: False    function: True    char in hex: 2c0f - UppercaseLetter
    regex: False    function: True    char in hex: 2c10 - UppercaseLetter
    regex: False    function: True    char in hex: 2c11 - UppercaseLetter
    regex: False    function: True    char in hex: 2c12 - UppercaseLetter
    regex: False    function: True    char in hex: 2c13 - UppercaseLetter
    regex: False    function: True    char in hex: 2c14 - UppercaseLetter
    regex: False    function: True    char in hex: 2c15 - UppercaseLetter
    regex: False    function: True    char in hex: 2c16 - UppercaseLetter
    regex: False    function: True    char in hex: 2c17 - UppercaseLetter
    regex: False    function: True    char in hex: 2c18 - UppercaseLetter
    regex: False    function: True    char in hex: 2c19 - UppercaseLetter
    regex: False    function: True    char in hex: 2c1a - UppercaseLetter
    regex: False    function: True    char in hex: 2c1b - UppercaseLetter
    regex: False    function: True    char in hex: 2c1c - UppercaseLetter
    regex: False    function: True    char in hex: 2c1d - UppercaseLetter
    regex: False    function: True    char in hex: 2c1e - UppercaseLetter
    regex: False    function: True    char in hex: 2c1f - UppercaseLetter
    regex: False    function: True    char in hex: 2c20 - UppercaseLetter
    regex: False    function: True    char in hex: 2c21 - UppercaseLetter
    regex: False    function: True    char in hex: 2c22 - UppercaseLetter
    regex: False    function: True    char in hex: 2c23 - UppercaseLetter
    regex: False    function: True    char in hex: 2c24 - UppercaseLetter
    regex: False    function: True    char in hex: 2c25 - UppercaseLetter
    regex: False    function: True    char in hex: 2c26 - UppercaseLetter
    regex: False    function: True    char in hex: 2c27 - UppercaseLetter
    regex: False    function: True    char in hex: 2c28 - UppercaseLetter
    regex: False    function: True    char in hex: 2c29 - UppercaseLetter
    regex: False    function: True    char in hex: 2c2a - UppercaseLetter
    regex: False    function: True    char in hex: 2c2b - UppercaseLetter
    regex: False    function: True    char in hex: 2c2c - UppercaseLetter
    regex: False    function: True    char in hex: 2c2d - UppercaseLetter
    regex: False    function: True    char in hex: 2c2e - UppercaseLetter
    regex: False    function: True    char in hex: 2c80 - UppercaseLetter
    regex: False    function: True    char in hex: 2c82 - UppercaseLetter
    regex: False    function: True    char in hex: 2c84 - UppercaseLetter
    regex: False    function: True    char in hex: 2c86 - UppercaseLetter
    regex: False    function: True    char in hex: 2c88 - UppercaseLetter
    regex: False    function: True    char in hex: 2c8a - UppercaseLetter
    regex: False    function: True    char in hex: 2c8c - UppercaseLetter
    regex: False    function: True    char in hex: 2c8e - UppercaseLetter
    regex: False    function: True    char in hex: 2c90 - UppercaseLetter
    regex: False    function: True    char in hex: 2c92 - UppercaseLetter
    regex: False    function: True    char in hex: 2c94 - UppercaseLetter
    regex: False    function: True    char in hex: 2c96 - UppercaseLetter
    regex: False    function: True    char in hex: 2c98 - UppercaseLetter
    regex: False    function: True    char in hex: 2c9a - UppercaseLetter
    regex: False    function: True    char in hex: 2c9c - UppercaseLetter
    regex: False    function: True    char in hex: 2c9e - UppercaseLetter
    regex: False    function: True    char in hex: 2ca0 - UppercaseLetter
    regex: False    function: True    char in hex: 2ca2 - UppercaseLetter
    regex: False    function: True    char in hex: 2ca4 - UppercaseLetter
    regex: False    function: True    char in hex: 2ca6 - UppercaseLetter
    regex: False    function: True    char in hex: 2ca8 - UppercaseLetter
    regex: False    function: True    char in hex: 2caa - UppercaseLetter
    regex: False    function: True    char in hex: 2cac - UppercaseLetter
    regex: False    function: True    char in hex: 2cae - UppercaseLetter
    regex: False    function: True    char in hex: 2cb0 - UppercaseLetter
    regex: False    function: True    char in hex: 2cb2 - UppercaseLetter
    regex: False    function: True    char in hex: 2cb4 - UppercaseLetter
    regex: False    function: True    char in hex: 2cb6 - UppercaseLetter
    regex: False    function: True    char in hex: 2cb8 - UppercaseLetter
    regex: False    function: True    char in hex: 2cba - UppercaseLetter
    regex: False    function: True    char in hex: 2cbc - UppercaseLetter
    regex: False    function: True    char in hex: 2cbe - UppercaseLetter
    regex: False    function: True    char in hex: 2cc0 - UppercaseLetter
    regex: False    function: True    char in hex: 2cc2 - UppercaseLetter
    regex: False    function: True    char in hex: 2cc4 - UppercaseLetter
    regex: False    function: True    char in hex: 2cc6 - UppercaseLetter
    regex: False    function: True    char in hex: 2cc8 - UppercaseLetter
    regex: False    function: True    char in hex: 2cca - UppercaseLetter
    regex: False    function: True    char in hex: 2ccc - UppercaseLetter
    regex: False    function: True    char in hex: 2cce - UppercaseLetter
    regex: False    function: True    char in hex: 2cd0 - UppercaseLetter
    regex: False    function: True    char in hex: 2cd2 - UppercaseLetter
    regex: False    function: True    char in hex: 2cd4 - UppercaseLetter
    regex: False    function: True    char in hex: 2cd6 - UppercaseLetter
    regex: False    function: True    char in hex: 2cd8 - UppercaseLetter
    regex: False    function: True    char in hex: 2cda - UppercaseLetter
    regex: False    function: True    char in hex: 2cdc - UppercaseLetter
    regex: False    function: True    char in hex: 2cde - UppercaseLetter
    regex: False    function: True    char in hex: 2ce0 - UppercaseLetter
    regex: False    function: True    char in hex: 2ce2 - UppercaseLetter
    TOTAL mismatches: 213

    I probably should have recognized the list since I have dealt with it before. But off the top of my head I didn't, and in the meantime Ryan over on the CLR team  stepped in help explain what was going on:

    This appear to be a bug in the Regex class. If IgnoreCase is present we will translate Lu and Lt to just Ll since we call Char.ToLower for every character in the input.  You would likely know more about this than I do but I verified that Char.ToLower for one of the characters returns the same character presumably because there is no lower case version of the character.  So the expression fails to match because the Unicode category for the character is still uppercase letter and we are trying to match Ll.

    Ah, now it all came together.

    Well, if you are running on Vista and have the updated casing table then they will work. But otherwise, when you are not running on Vista, the casing table does not cover all of Unicode 5.0 even though the property table in .NET 2.0 will.

    (if you run on .NET 1.1 then you will be missing even more characters since not all characters are identified, though in that case they will not be listed as missing in the script since neither function knows asbout them!)

    So if you are running on 2.0 of better, this Regex "optimization" is the cause of the bug.

    Strictly speaking, there was no need to pass RegexOptions.IgnoreCase since char.IsLetter is going to pick both of them up anyway. So there is a workaround here -- don't pass flags that slow down the Regex and break its functioning anyway, and you can then freely use the Regex if you like (though it did still seem kinda slow to me, maybe there are some optimizations here.... :-)

     

    This post brought to you by(U+2c00, a.k.a. GLAGOLITIC CAPITAL LETTER AZU)

Page 1 of 4 (50 items) 1234