Blog - Title

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Behind 'How to break Windows Notepad'

    • 54 Comments

    Larry Osterman pointed me at an article entitled How to break Windows Notepad that makes for an interesting experiment:

    Here's how to do it:
    1. Open up Notepad (not Wordpad, not Word or any other
    word processor)
    2. Type in this sentence exactly (without quotes): "this app can break"
    3. Save the file to your
    hard drive.
    4. Close Notepad
    5. Open the saved file by double clicking it.

    Instead of seeing your sentence, you should see a series of squares. For whatever reason, Notepad can't figure out what to do with that series of characters and breaks

    Now if you have East Asian language support installed, instead of seeing squares (NULL glyphs), you will see:

    桴獩愠灰挠湡戠敲歡

    An if you look at the code points under those characters, you will likely see what happened:

    6874 7369 6120 7070 6320 6e61 6220 6572 6b61

    Ah, each byte is a letter that when combined just so happens to line up with a CJK ideograph!

    I have talked about the encoding detection mechanisms that notepad uses recently, and this is another example of the problem, one that is more fun since the repro steps are so much fun (in fact the only improvement would be text insulting Microsoft or one of its rivals, which notepad appears to censor in an example of a big bad monopoly, etc.!).

    Now I have pointed out that I do not like the IsTextUnicode function in the past, and I suppose this could be considered a good reason (IsTextUnicode returns TRUE here, which is why Notepad guesses as it does).

     

    This post brought to you by (U+6874, a CJK ideograph)

  • Sorting it all Out

    When is a backslash not a backslash?

    • 36 Comments

    The character in question is U+005c, the REVERSE SOLIDUS, also known as the backslash or '\'. It is the path separator for Windows, which is encoded at 0x5c across all of the ANSI code pages.

    Since path separators are a pretty important requirement, the title of this post may seem a little scary -- how could it not be a backslash, a reverse solidus?

    Well, on Japanese code page 932, 0x5c is the YEN SIGN, and on Korean code page 949, 0x5c is the WON SIGN.

    Which is not to say that 0x5c does not act as a path separator -- it still does. And which is also not to say that the Unicode code points for the Yen and the Won (U+00a5 and U+20a9) do act as path separators -- because they do not.

    Of course the natual round trip mapping between U+005c and 0x5c happens on all code pages, and both U+00a5 and U+20a9 have one-way 'best fit' mappings to 0x5c on their respective code pages. This requirement technically went away with Unicode, when the characters were encoded separately.

    However, the issue is not a simple one of there not being space in the old code page and lots of space in Unicode, where customers will instantly move away from the not backslash path separators.

    In practice, after many years of code page based systems in Japan and Korea using their respective currency symbols as the path separators, it is believed customers were simply used to this appearance. And there was therefore little interest in changing that appearance (when the system settings were Japanese or Korean) to anything but those symbols.

    To support this expectation, Japanese and Korean fonts, whenever the default system locale is set to Japanese or Korean, respectively, will display the currency symbol rather than the backslash when U+005c is shown.

    But whether or not this is really what customers want is still an open question. Andrew Tuck of PSS here at Microsoft noted:

    When one of my customer’s from Korea was visiting here, I asked him if it bothered him that the backslash doesn’t appear as a backslash. It did bother him, and he believes it bothers most of his countrymen. However, he was fatalistic about it, "What can we do to change it. It’s been this way for a long time. We are used to it."

    Hardly a glowing recommendation, is it?

    And as Norman Diamond noted in his comments on this very blog (in this post), there are plenty of people in Japan who may not care for the convention, either.

    Of course there is no 'right' answer here, and I would imagine that you would find plenty of people who would be unhappy with such a change, just as there are those who would be unhappy with the status quo. Which perhaps explains why the status quo seems to be as it is -- those people who would like a change are resigned to the idea that it may never happen. And so they are now used  to it....

     

    This post brought to you by "\", "¥", and "" (U+005c, U+00a5, and U+20a9, a.k.a. REVERSE SOLIDUS, YEN SIGN, and WON SIGN)

  • Sorting it all Out

    What are the fonts in Vista?

    • 21 Comments

    Bettina asked via the Contact link:

    Hello Michael,

    do you have any idea if there is an entire list of system fonts that come with windows vista?

    I could find some of them, according to the articles, these are apparently not all of them:

        Calibri,
        Cambria,
        Candara,
        Consolas,
        Constantia,
        Corbel,
        Nyala,
        Segoe UI,
        Segoe print
        Segoe script

    would be great if you would know some more :-)

    thx, bettina

    I do not know of any official list anywhere, but I talked to some of the Typography folks and Simon Daniels gave me a list of the 290mb and over 712,000 glyphs contains in the Vista fonts.

    Here goes (I am not including the bold/italic info, for the sake of brevity):

    Core Fonts:

    • Arial
    • Courier New
    • Times New Roman
    • Symbol
    • Wingdings

    Core UI Fonts:

    • Microsoft Sans Serif
    • Segoe UI
    • Tahoma

    ClearType Collection Fonts:

    • Calibri
    • Cambria
    • Candara
    • Consolas
    • Constantias
    • Corbel

    Other Western Fonts:

    • Arial Black
    • Franklin Gothic Medium
    • Georgia
    • Impact
    • Lucida Console
    • Lucida Sans Console
    • Marlett
    • Palatino Linotype
    • Segoe Print
    • Segoe Script
    • Trebuchet MS
    • Verdana
    • Webdings

    East Asian Fonts

    • Batang/BatangChe
    • DFKai-SB
    • Dotum/DotumChe
    • Fangsong
    • Gulim/GulimChe
    • Gungsuh/GungsuhChe
    • KaiTi
    • Malgun Gothic
    • Meiryo
    • Microsoft JhengHei
    • Microsoft YaHei
    • MingLiU_HKSCS/MingLiU_HKSCS-ExtB
    • MingLiU-ExtB/PMingLiU-ExtB
    • MS Gothic/MS PGothic/MS UI Gothic
    • MS Mincho/MS PMincho
    • SimHei
    • Simsun/NSimsun
    • SimSun-ExtB

    Arabic Fonts:

    • Andalus
    • Arabic Typesetting
    • Microsoft Uighur
    • Simplified Arabic
    • Traditional Arabic

    Hebrew Fonts:

    • Aharoni Bold
    • David
    • FrankRuehl
    • Gisha
    • Levenim
    • Miriam
    • Narkisim
    • Rod

    Thai Fonts:

    • Angsana New/AngsanaUPC
    • Browallia New/BrowalliaUPC
    • Cordia New/CordiaUPC
    • DilleniaUPC
    • EucrosiUPC
    • FreesiaUPC
    • IrisUPC
    • JasmineUPC
    • KodchiangUPC
    • Leelawadee
    • LilyUPC

    Indic Fonts:

    • Gautami
    • Iskoola Pota
    • Kalinga
    • Kartika
    • Latha
    • Mangal
    • Raavi
    • Shruti
    • Tunga
    • Vrinda

    Other Fonts

    • DaunPenh
    • DokChampa
    • Estrangelo Edessa
    • Euphemia
    • Microsoft Himalaya
    • Microsoft Yi Baiti
    • Mongolian Baiti
    • MV Boli
    • Nyala
    • Plantagenet Cherokee
    • Sylfaen

    That should do for now. :-)

     

    This post sponsored by 𐐚 (U+1041a, a.k.a. U+d801 U+dc1a, a.k.a. DESERET CAPITAL LETTER VEE)

  • Sorting it all Out

    Stripping diacritics....

    • 41 Comments

    Well, Jochen Neyens asked:

    What's the easiest way to remove diacritic marks from characters using C#? I would like to have following function:

    string RemoveDiacriticMark(string c)

    Sample use:

    RemoveDiacriticMark("é") -> "e"

    RemoveDiacriticMark("ü") -> "u"

    RemoveDiacriticMark("à") -> "a"

    Well, there is not really an easy way to do it until Whidbey, but with Whidbey you can use normalization and Unicode character properties (discussed previously in FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler) and A little bit about the new CharUnicodeInfo class) to build something simple to do it all!

    WARNING: This code has been improved! Get the improved version from this other post.

    namespace Remove {
      using System;
      using System.Text;
      using System.Globalization;
      class Remove {
        [STAThread]
        static void Main(string[] args) {
          foreach(string st in args) {
            Console.WriteLine(RemoveDiacritics(st));
          }
        }

        static string RemoveDiacritics(string stIn) {
          string stFormD = stIn.Normalize(NormalizationForm.FormD);
          StringBuilder sb = new StringBuilder();

          for(int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if(uc != UnicodeCategory.NonSpacingMark) {
              sb.Append(stFormD[ich]);
            }
          }

          return(sb.ToString());
        }
      }
    }

    Just put it in a file (remove.cs), compile it in Whidbey:

    c:\temp\samples>csc remove.cs

    and then run it!

    c:\temp\samples>remove âãäåçèéêë ìíîïðñòó ôõöùúûüý
    aaaaceeee
    iiiiðnoo
    ooouuuuy

    Now in prior versions your options are more limited, though a p/invoke to the FoldString API with the MAP_COMPOSITE flag. There is also no CharUnicodeInfo class for information on Unicode properties, but you could also use a regular expression (using :Mn will give you the equivalent category). I will leave doing the regular expression as an exercise for the reader....

    Enjoy!

    This post brought to you by "û" (U+00fb, a.k.a. LATIN SMALL LETTER U WITH CIRCUMFLEX) 

  • Sorting it all Out

    The version of App Locale that runs on Vista?

    • 24 Comments

    The question that Eiji asked me was simple enough:

    Do you know the apploc version that supports Vista?

    I got the question from Internal user (who needs it for Vista selfhosting on en-us and run Japanese ANSI apps with en system locale).

    I tried the existing apploc on Vista, but the installation failed with unknown error.

    And, in my opinion, the situation has not been changed since XP/WS03, I think we need the support for Vista.

    Of course my first instinct in such situations is to scream Use Unicode! but I figured I should contain myself. So I stopped for a moment.

    And then I thought about the issue that came up with Keyboards under LUA where an MSKLC-created install could not be installed on Vista unless it was run from an elevated command prompt....

    So I tried the elevated command prompt in this new scenario and yes, App Locale (AppLocale) installed just fine. And I did my typical quick test (running Notepad with a Japanese system locale, typing the Hiragana vowels:

    あいうえお

    saving the file, and then opening it up from Explorer and seeing the text showing up as:

    ‚ ‚¢‚¤‚¦‚¨

    but then looking correct again if I open it with the "Japanese'd" Notepad provided by App Locale shortcut.

    So it looks to me like App Locale works fine under Vista. You just need to the right permissions to be installing it. :-)

    Note that as Heath Stewart pointed out here and Robert Flaming pointed out here, an install program with a more conventional name like SETUP.EXE will try to be elevated all on its own, for backward compatibility, with the thousands of other installers that came before it (and without needing manifests or other adjuncts). So you can think of the programs that hit this problem as a much less common scenario of ruggedly independent projects that can't use ordinary and accepted names for their installers....

    This may even inspire a decent idea for the next release of MSKLC -- adding a bootstrap program called setup.exe to start things up and work around the whole Keyboards under LUA issue simply? :-)

     

    This post brought to you by(U+0c8c, a.k.a. KANNADA LETTER VOCALIC L)

  • Sorting it all Out

    Font substitution and linking #1

    • 22 Comments

    Ok, there will be several posts on this topic, starting from the core support in GDI/Windows and moving concentrically outward to information on usage in Uniscribe, MLang, and Office.

    I'll start with font substitution.

    At the simplest level, this feature is what it sounds like -- simple substitution of one font name with another.

    It starts with a registry key (HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontSubstitutes). In Windows Server 2003 that key contains the following:

    "Arial Baltic,186"="Arial,186"
    "Arial CE,238"="Arial,238"
    "Arial CYR,204"="Arial,204"
    "Arial Greek,161"="Arial,161"
    "Arial TUR,162"="Arial,162"
    "Courier New Baltic,186"="Courier New,186"
    "Courier New CE,238"="Courier New,238"
    "Courier New CYR,204"="Courier New,204"
    "Courier New Greek,161"="Courier New,161"
    "Courier New TUR,162"="Courier New,162"
    "Times New Roman Baltic,186"="Times New Roman,186"
    "Times New Roman CE,238"="Times New Roman,238"
    "Times New Roman CYR,204"="Times New Roman,204"
    "Times New Roman Greek,161"="Times New Roman,161"
    "Times New Roman TUR,162"="Times New Roman,162"
    "Helv"="MS Sans Serif"
    "Helvetica"="Arial"
    "Times"="Times New Roman"
    "Tms Rmn"="MS Serif"

    "MS Shell Dlg"="Microsoft Sans Serif"
    "MS Shell Dlg 2"="Tahoma"

    These entries can be put into three categories (which are color-coded above):

    BLACK - these are entries that were formerly used by many applications to combine font family (name) choice with font character set choice (basically the lfFaceName and lfCharSet members of the LOGFONT struct). I'll talk more about lfCharSet and what it used to do (and sometimes still does) another day. But in any case these names are not really used much anymore. When they are used in applications, their presence in the FontSubstitutes subkey makes them work properly.

    GREEN - these entries allow some common abbreviated names to work. Their usage is self-explanatory.

    BLUE - these entries are the ones behind the huge effort to support MS Shell Dlg as a UI font name (also described here and in article 282187 in the knowledge base). In fact, these two entries are the only ones that can be considered useful for more than just backward compatibility with no longer used methodologies. Raymond Chen also has good advice about getting the right font used via DS_SHELLFONT in the articles What's the deal with the DS_SHELLFONT flag? and What other effects does DS_SHELLFONT have on property sheet pages? for those who are interested.

    Of course it seems odd that MS Shell Dlg, documented as a version independent, language independent pseudo-font name seems to be hard coded to use fonts that do not support all languages. Wasn't it designed to get people away from hard-coding Tahoma or Microsoft Sans Serif? And the answer is yes -- it was. Luckily those font names are affected by a different font mapping technology, font linking, which I will describe in a future post in this series....

    There is another kind of font substitution that is occasionally seen in documentation, which relates to printer drivers and what they do to substitute fonts built into printer hardware when thay can. My personal belief (with my admitted bias towards good international functionality) is that it is important to not use this feature since even printer fonts that accurately handle the basic glyphs seldom have the  full support for all scripts (not to mention complex scripts!). In fact, one of the first things I do with each new version of Word is find out how to set the Print TrueType Fonts as Graphics setting that allow what is on the screen to be what gets printed rather than using device fonts....

     

    This post brought to you by "ڜ" (U+069c, a.k.a. ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS ABOVE)
    A character not seen in most device fonts!

  • Sorting it all Out

    What the hell does HTTP_ACCEPT_LANGUAGE mean?

    • 18 Comments

    The question is a simple one: what the hell does HTTP_ACCEPT_LANGUAGE mean?

    The answer is also quite simple: IT DEPENDS.

    The user is sending information from their browser, and could mean any of the following things:

    • language/locale to use for formatting/collation preferences
    • language/locale to use for the UI
    • language/locale about which to provide content
    • location for which to provide information

    Now sometimes all of the settings will be the same. It is obviously more common for that to be the case. But it is a huge Internet and frankly there are a lot of times that they're not the same. It is unfortunate and all of these different items have to be filtered through a single setting across all of the browsers. But life is about dealing with things as that are, not as we want them to be.

    It is therefore importantcrucial to recognize that a user may have any of these in mind, and be careful not to assume too much based on the HTTP_ACCEPT_LANGUAGE -- giving them an easy way to change the settings if you assumed more than they wanted you to....

     

    This post brought to you by "Ǯ" (U+01ee, a.k.a. LATIN CAPITAL LETTER EZH WITH CARON)

  • Sorting it all Out

    The game is over, people!

    • 46 Comments

    NOTEPAD adds a BOM (Byte Order Mark) when you save a file in the UTF-8 encoding.

    Always1.

    You'd think that since Windows Notepad has been doing this for over 319680000 seconds2, and that the combined usage of Windows 20003, Windows XP, Windows Server 2003, Windows XP 64-bit, Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 is so high that it may well blow your mind to calculate the number, that people would have gotten over this by now.

    But no.

    As recently as yesterday4, people were making comments again in that Why are UTF-8 encoded Unix shell scripts *ever* written or edited in Notepad? blog, the one where I officially suggested that these people who don't like the Notepad behavior of inserting a BOM in front of UTF-8 files had a simple remedy:

    STOP USING WINDOWS NOTEPAD!

    Yet for some reason people are still arguing it.

    Please give up, it is over. If you were in a contest or duel for this5, then you have lost the contest, been bested in the duel. The game is over6.

    A long time ago, someone decided that:

    • if your file was 100% ASCII7 and
    • you chose to save it as UTF-8 and
    • you opened the file up again and
    • added some >0x007f character and
    • later saved again that

    you should not be prompted8 in a way like this:

    This file contains characters in Unicode format which will be lost if you save this file as an ANSI encoded text file. To keep the Unicode information, click Cancel below and then select one of the Unicode options from the Encoding drop down list. Continue?

    and so that was the way the feature was coded.

    Game over.

    There is probably an alt.i.hate.microsoft newsgroup somewhere on USENET that would be happy to hear your complaint on the matter.

    But the world has moved on.

    And Notepad (the apparent premiere tool of UNIX shell script authors throughout the world) has let down a segment of customers who could have updated whatever is reading the scripts in less than a day, rather than complaining about this on and off for the last ~37009 plus days.

    Your sacrifice is appreciated.

    But please, go home now.

    P.S. Isn't there some tool on UNIX that does this correctly10?

    P.P.S. I will not include a screenshot of my private Notepad; I'm not trying to tease you here that badly....

     

    1 - Well, not on the private Notepad I build from time to time from the Windows source, but that one is not one that is released to the public.
    2 - Over ten years, give or take
    3 - Where this first started happening.
    4 - The day before today.
    5 - Which none of you were, who are you kidding?
    6 - Even more over than the Canadians in that game last night.
    7 - Which ironically, most UNIX shell scripts are.
    8 - This is a cool feature too, by the way.
    9 - Over ten years, give or take.
    10 - By your definition of "correctness", at least - a BOM-less UTF-8 save.

  • Sorting it all Out

    Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!)

    • 41 Comments

    (The alternate title should be spoken with either a circa-1982 Jeff Spicoli or circa-1989 Theodore "Ted" Logan mannerism and accent)

    U+feff has two jobs in the Unicode standard:

    Job #1, and its namesake, is as a ZERO WIDTH NO-BREAK SPACE. The name kind of says it all. After all, we have U+00a0 (NO-BREAK SPACE) and U+200b (ZERO WIDTH SPACE). Yet we somehow needed to combine the two to create a character that has no width yet you should not break a line between what is on one side of it and what is on the other. Later they decided to add a different character is the preferred one for this job (U+200d, ZERO WIDTH JOINER), in part due to U+feff's violation of the moonlighting agreement and its apparent lack of focus (see job #2). But at its heart a conformant Unicode application does not have to do anything special with U+feff because this is a character that has no width. If it is between two characters and you completely ignore it, then you will get identical results to it not being there at all.

    Update 21 January 2005 -- TLKH pointed out that the actual character that took over job #1 after U+feff was fired (well, depracated) from this job is U+2060 (WORD JOINER) and not U+200d (ZERO WIDTH JOINER).

    Job #2 is to act as a BOM, a Byte Order Mark. A signature at the beginning of a file with no "wrapper" to indicate its encoding -- someone could look at the byte stream and know by the pattern of bytes what the encoding might be:

    00 00 fe ff      UTF-32, Big Endian

    fe ff 00 00      UTF-32, Little Endian

    fe ff ## ##      UTF-16, Big Endian

    ff fe ## ##      UTF-16, Little Endian

    ef bb bf         UTF-8

    And that last line is where it starts to get weird for people. Because lots of the folks who support UTF-8 in other standards like XML note that you do not need the BOM when you have other means to document the encoding and which use UTF-8 as their default encoding when none is specified anyway. And lots of other folks who support Unix tools that did not have to be completely changed to support Unicode by using UTF-8 do not like these extra three bytes at the front of the file. Sometimes that is because they really only support ASCII or ISO-8859-1, other times it is because they just can't handle those three bytes right in front but later on would not matter.

    Enter Microsoft.

    (Yes, I know -- boo, hiss, etc.)

    Microsoft has an application called Notepad. Application is perhaps an overstatement; its just an uber-wrapper for a Win32 EDIT control. It stores the text in the edit control, which means if you open a file it must literally load all of the data to stick it into the control (which is by the way why it takes forever to open huge files). Over the years minor features and tweaks have been added. But in its soul it is just a plain old edit control.

    When it was ported to NT the option of saving Unicode files had to be there since Unicode was there, so they added it. It was Little Endian UTF-16 (that was all the platfom really supported back then) but they just called it Unicode since it was vaguely more likely that someone might have heard of that. And that is what the rest of the platform was doing.

    Then in Windows 2000 they added the ability to save a file as Big Endian UTF-16 and since they were already calling Little Endian UTF-16 Unicode they decided to call the other form Unicode (Big Endian). I do not think this is so bad, certainly less controversial than calling it unicodeFFFE1, but it definitely did irk some people who did not like one format being called Unicode as if others were not.

    Incidentally, I those people are probably right. But the number of people who don't really care what it is called since they will never use it does outnumber all of the people who do care, so I kind of understand the logic behind the lack of detail that would confuse...

    But then the worst sin of all was committed -- Notepad also added UTF-8 support. And of course the issue with the BOM had to come up.

    The folks on the Shell team who did this recognized that if the file only had ASCII characters that it could be called UTF-8 or it could just be using the default system code page. So if a user intentionally saved it as UTF-8 then they would be confused if opening it again would not appear to remember that it had been saved in such a way. So they add a BOM when it is UTF-8, to tag it as UTF-8 in a way that is 100% conformant with Unicode.

    This is completely legal and since Notepad is just a simple "Hans & Franz" wrapper around an EDIT control, it has no other means of understanding "envelope" information to tell anyone what the encoding is. What else could they do? The bug is in the people who use Notepad to edit HTML and XML, because they do not require a BOM. People still use it as a convenient editor of files, but the caveats are pretty clear....

    People like Raymond Chen have been posted about how Some files come up strange in Notepad but generally people do not have complaints about the way Notepad behaves.

    But every 4-6 months another huge thread on the Unicode List gets started about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades2 and about how Microsoft is evil for shipping Notepad that causes all of these problems and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, and so on, and so on.

    We are about 30+ messages into such a thread right now, believe it or not. That did not inspire this post so much as the image of Sean and Keanu talking about it like surfer dudes did, though. :-)

    No one ever has answers about the fact that if someone really supports Unicode, they should be able to handle a ZERO WIDTH NO-BREAK SPACE without breaking a sweat. If they can't, their tool or utility or application or whatever they have is broken, and it's their bug, not Microsoft's. At least, if they claim to support UTF-8, that is. Tools that support "all of UTF-8 as long as it starts with ASCII" and tools that cannot handle these three bytes at all are not really supporting UTF-8.

    And by the way, that includes Microsoft applications, too. In my opinion Frontpage 2000 kinda stunk (all things considered) because of this problem. Even though they added the cool "don't screw with my HTML setting that I liked so much. I was very happy when Frontpage 2002 and 2003 fixed this problem. Just like I'm sure most others would be happy if people fixed their tools, as well....

    I thought I'd briefly quote one of the posts to the Unicode List that was just done by Peter Constable:

    As for whether plain text files can have a BOM, that is one of the few unending debates that arise with certain (fortunately not too freguent) regularity, each time with vociferous expressions of deeply-held beliefs but never any resolution. I'll just observe that the formal grammar for XML does not make reference to a BOM, yet the XML spec certainly assumes that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in any Unicode encoding form/scheme). Rather than have a philosophical debate about the definition of "plain text file", I suggest a more pragmatic approach: for better or worse, plain text processes that support UTF-8 are going to encounter UTF-8 data beginning with a BOM: learn to live with it!

    I agree 100% with his words and wish I coulsd summarize the issues as cleary and as effectively as he can. :-)

    For the record, it has occurred to me in the past that it would not be a bad idea to add an option to save files without the BOM. Of course that would mean having to document it for people who probably struggle with the difference between Unicode and Unicef3. That does make this something of an uphill battle (doc. changes are the hardest and most resource intensive in changes like this), but perhaps worthy of a try. Maybe they could take out some of that "UTF-8 is for legacy" stuff that is in Notepad help now while they are there. What do you think? :-)

     

    1 - Believe it or not, unicodeFFFE is actually documented as Internet Explorer's Preferred Charset Label for Unicode (Big-Endian). Periodically people report the name as a bug, since there is no such code point in Unicode as U+fffe. But the reason for the name is that if you look at a BOM of UTF-16 big endian on a system that is little endian, it will look like FFFE. Since that is not a valid character, it is easy to tell on a Little Endian system that the file must be Big Endian Unicode. The name is just acting as a sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.

    2 - Never mind that Unicode has not existed for that long, let alone UTF-8!

    3 - Someone once asked me at a conference how saving a file is able to contribute to a charity, and was it like one of those fake email chain letter things on her machine? And I did not laugh, though I admit I smiled pretty broadly as I explained to her about how Unicode was not Unicef. And I did laugh a bit afterward.

     

    This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)
    Though he was a little bitter about the lack of visible representation here, I was unable to find the little guy to spray paint him so that you could all see him here today. He is between those quotes, I can promise you that.

  • Sorting it all Out

    A few of the gotchas of MultiByteToWideChar

    • 22 Comments

    Like I mentioned yesterday, I have talked a bunch of times about the way that different forms of strings that are canonically equivalent according to Unicode and which actually look identical visually exist in the world.

    Yesterday, I mentioned it while I was talking about a few of the gotchas of WideCharToMultiByte. Today I thought I would talk about the other direction, the MultiByteToWideChar API.

    First of all, almost all code pages are in Normalizaton Form C (a.k.a. precomposed) at all times (I will talk about the exceptions in a second). Of course Unicode (by which I mean UTF-16 Little Endian, which Microsoft always calls Unicode) can be either Form C (a.k.a. precomposed) or Form D (a.k.a. composite).

    If you would like to choose, then you get that option; you can pass either the MB_PRECOMPOSED or MB_COMPOSITE flags. For the reasons of having data that is consistent with the rest of the platform, I would recommend the MB_PRECOMPOSED flag, but either one is legal (just not both).

    There is also an MB_USEGLYPHCHARS flag. Now I already beat that particular horse to death when I answered the question what the &%#$ does MB_USEGLYPHCHARS do? So if you want to know more you can look there. You probably do not, at least I hope you do not....

    Finally, there is the MB_ERR_INVALID_CHARS flag. The documentation says it all on this flag:

    If the function encounters an invalid input character, it fails and GetLastError returns ERROR_NO_UNICODE_TRANSLATION.

    Now after the MultiByteToWideChar topic covers these four flags, it gets confusing. It says:

    For the code pages in the following table, dwFlags must be zero, otherwise the function fails with ERROR_INVALID_FLAGS.

    50220
    50221
    50222
    50225
    50227
    50229
    52936
    54936
    57002 through 57011
    65000 (UTF7)
    65001 (UTF8)
    42 (Symbol)

    Windows XP and later: MB_ERR_INVALID_CHARS is the only dwFlags value supported by Code page 65001 (UTF-8).

    Call me crazy, but there probably was not a need to have the sentence before the table and the table conflict with the sentence after the table. It is kind of understandble, but as topics go it has the flavor of a WTF sentence, if you ask me!

    It does end on a better note by defining what an invalid character is:

    The function fails if MB_ERR_INVALID_CHARS is set and encounters an invalid character in the source string. An invalid character is either, a) a character that is not the default character in the source string but translates to the default character when MB_ERR_INVALID_CHARS is not set, or b) for DBCS strings, a character which has a lead byte but no valid trailing byte. When an invalid character is found, and MB_ERR_INVALID_CHARS is set, the function returns 0 and sets GetLastError with the error ERROR_NO_UNICODE_TRANSLATION.

    Oh, and before that it talks about some security considerations (more on these another day).

    I am forgetting something now. What was it?

    Oh yeah, I was going to talk about the code pages that are not Normalization Form C.

    Obviously there is UTF-7 (65000), UTF-8 (65001), and GB-18030 (54936). Since each of these code pages covers the entire Unicode repetoire, each can have characters in Unicode normalization Form C, Form D, or any combination thereof. Some of the other code pages in the table above also fall into this category, but in the case of these three and all the rest, the MB_PRECOMPOSED and MB_COMPOSITE flags are both at best ignored and at worst will cause an ERROR_INVALID_FLAGS to be returned. So you will want to not pass either flag with any of them.

    But there is one code page that can have data in either composite or precomposed form -- it is the Vietnamese ACP, code page 1258. It has all of the following entries:

    CC = U+0300 : COMBINING GRAVE ACCENT
    D2 = U+0309 : COMBINING HOOK ABOVE
    DE = U+0303 : COMBINING TILDE
    EC = U+0301 : COMBINING ACUTE ACCENT
    F2 = U+0323 : COMBINING DOT BELOW

    The reason for doing this is that there was really not enough room in the code page, otherwise. Unfortunately, there are also some precomposed characters with these accents:

    C0 = U+00C0 : LATIN CAPITAL LETTER A WITH GRAVE
    C1 = U+00C1 : LATIN CAPITAL LETTER A WITH ACUTE
    C8 = U+00C8 : LATIN CAPITAL LETTER E WITH GRAVE
    C9 = U+00C9 : LATIN CAPITAL LETTER E WITH ACUTE
    CD = U+00CD : LATIN CAPITAL LETTER I WITH ACUTE
    D1 = U+00D1 : LATIN CAPITAL LETTER N WITH TILDE
    D3 = U+00D3 : LATIN CAPITAL LETTER O WITH ACUTE
    D9 = U+00D9 : LATIN CAPITAL LETTER U WITH GRAVE
    DA = U+00DA : LATIN CAPITAL LETTER U WITH ACUTE
    E0 = U+00E0 : LATIN SMALL LETTER A WITH GRAVE
    E1 = U+00E1 : LATIN SMALL LETTER A WITH ACUTE
    E8 = U+00E8 : LATIN SMALL LETTER E WITH GRAVE
    E9 = U+00E9 : LATIN SMALL LETTER E WITH ACUTE
    ED = U+00ED : LATIN SMALL LETTER I WITH ACUTE
    F1 = U+00F1 : LATIN SMALL LETTER N WITH TILDE
    F3 = U+00F3 : LATIN SMALL LETTER O WITH ACUTE
    F9 = U+00F9 : LATIN SMALL LETTER U WITH GRAVE
    FA = U+00FA : LATIN SMALL LETTER U WITH ACUTE

    So you it looks like maybe you could have mixed "Form C" and "Form D" code page 1258 text, doesn't it?

    Unfortunately, its not that perfect. There are two error patterns, marked below in RED:

    0xc0 with MultiByteToWideChar/MB_PRECOMPOSED --> U+00c0
    0xc0 with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+0300
    0x41 0xcc with MultiByteToWideChar/MB_PRECOMPOSED --> U+0041 U+0300
    0x41 0xcc with MultiByteToWideChar/MB_COMPOSITE --> U+0041 U+0300

    and going the other way:

    U+00c0 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0
    U+00c0 with WideCharToMultiByte --> 0x41 0xcc
    U+0041 U+0300 with WideCharToMultiByte/WC_COMPOSITECHECK --> 0xc0
    U+0041 U+0300 with WideCharToMultiByte --> 0xc0

    The pattern is clear, right? MultiByteToWideChar is not quite smart enough to precompose in Unicode what is composite in cp1258, and WideCharToMultiByte is not quite smart enough to keep composite what is composite in Unicode.

    Ah well, nothing is perfect -- the Vietnamese code page is missing some characters used in Vietnamese, anyway.

    But the real reason for these combining characters is to handle the many letters used in Vietnamese that have double diacritics on them -- the cases of dual representations are somewhat accidental, all things considered, in the face of the need to support letters like "ẳằẵắặầẩẫấậ" and so forth....

     

    This post brought to you by "À" (U+00c0, a.k.a. LATIN CAPITAL LETTER A WITH GRAVE)

  • Sorting it all Out

    What is the difference between Big Endian and Little Endian Unicode?

    • 13 Comments

    A very common question that comes up has much to do with the meaning of the suffixes in UTF-16LE and UTF-16BE.

    It all comes back to the way processors work. When you look at a byte (like 0x41) it is easy to say you know what it is. But when looking at two bytes in a row (like 0x41 0x00) as if it were a single 16-bit WORD you have to decide if you are looking at the number 0x4100 or the number 0x0041.

    I always found the clearest description came from Bruce McKinney's Hardcore Visual Basic:

    Endian refers to the order in which bytes are stored. The term is taken from a story in Gulliver’s Travels by Jonathan Swift about wars fought between those who thought eggs should be cracked on the Big End and those who insisted on the Little End. With chips, as with eggs, it doesn’t really matter as long as you know which end is up.

    And indeed, it is pretty crucial to know which end is up. This is especially interesting for UTF-16, which in the end is a bunch of arrays of WORDs that happen to correspond to characters in Unicode. The difference between U+0041 ("A", a.k.a. LATIN CAPITAL LETTER A) and U+4100 ("䄀", a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune) is quite striking!

    On Windows platforms, which are mostly little endian, UTF-16LE is just called "Unicode" and UTF-16BE is just called "Unicode (Big Endian)". Which is much less confusing for the majority of people who do not work cross-platform.

    (Speaking frankly, this does not bother me much -- anyone smart enough to be annoyed by the terminology is smart enough to know that not everyone is as smart as they are in these matters)

    For more information, simple web searches with the following search string:

    "big endian" "little endian"

    will return enough results to keep one busy for some time...

     

    This post brought to you by "䄀" (U+4100, a.k.a. an ideograph in CJK Extension A that refers to calamity, disaster, evil, or misfortune)

  • Sorting it all Out

    Sometimes it pays to be on drugs

    • 18 Comments

    (Nothing technical in this post, sorry!)

    I swear that none of what I am about to talk about has been intentional. I am merely a victim of circumstance.

    I have been taking Lipitor for a borderline cholesterol level which, when combined with my lack of discipline about diet, made folks in the medical establishment feel like I should perhaps try and be safe rather than sorry.

    And I have been taking Copaxone daily for my MS for the last few years, mainly because although I preferred the once-a-week Avonex, I was one of the small number of people who suffered flu-like symptoms, and I was tired of being sick once a week. I used to hate the notion of 'shooting up' daily, but I decided to get over it and just pretend it was like I was actually shooting up something ilicit -- so I could have all the fun of being a drugie without any of the downsides of a life of crime and poverty....

    And since August 23rd I have been taking Novantrone, as I have mentioned in this blog before. And so far the Echocardiograsm is still looking good. So, Bob willing I'll be on it for a couple more years.

    Now I did not stop taking Copaxone during the time I have been taking Novantrone. I talked about it with my neurologist and at first she pointed out that if I was not tolerating the Novantrone that I'd just be back on the Copaxone anyway. And later I just never got around to stopping it, so I didn't.

    I also have lots of friends who send me new articles every time they see something on the web about Multiple Sclerosis. It is almost always sensationalistic, mainly because of the combination of the facts that people reporting on these things don't understand them, and even if they did the truth is never as sexy as they need to get people interested. So I usually take what the send with a grain of salt.

    But two news items in particular were interesting to me:

    Lipitor-Copaxone Combo May Fight MS -- despite its upbeat nature and the fact that the positive results are with the animal model for MS, Experimental Autoimmune Encephalomyelitis (EAE) -- since MS cannot itself occur in mice -- and many EAE cures do not actually help with MS, it may well be good news.

    Drug combo fuels hope for multiple sclerosis -- the positive results in this three-year open label Copaxone/Novantrone combination therapy are fairly exciting (and I look forward to the article that should be in the upcoming issue of Neurology), though once again one has to be careful to look too positively at popular news reports.

    It seems that I have unintentionally been involved with two interesting combination therapies? :-)

    I'll probably talk more about the second one after I read the article in Neurology. It will be years before anybody comes up with anything on the first one, but I'll just suggest no changes in my drug regimen for now....

  • Sorting it all Out

    Typing in random Unicode code points

    • 27 Comments

    People ask all the time how they can type in random Unicode data.

    Some people point out the vast array of supported Keyboard Layouts on Windows.

    Others point out how you can create your own keyboards with MSKLC.

    Still others talk about fancy things you can do with the numeric keypad.

    And then still others like to go on about typing a code point value in Word, highlighting it, and then hitting <Alt+X>.

    Personally, I like to just install the Unicode IME, first added for Traditional Chinese in Windows 2000 and available in every version of Windows since then. Just install it:

    and then it will be on your list of available input languages....

    Simple to use -- just switch to it with <Left Alt+ Shift> and start typing hex numbers in any application....

    and then when you type a full Unicode code point, it will commit the character automatically!

    A very cool stealth feature available in all even moderately recent versions of Windows! :-)

     

    This post brought to you by "Ʒ" (U+01b7, a.k.a. LATIN CAPITAL LETTER EZH)
    A character that was feeling a little cheated by the small post it ended up sponsoring earlier -- thus the second sponsorship!

  • Sorting it all Out

    What is my locale? Well, which locale do you mean?

    • 29 Comments

    A few years back (some time before Windows XP shipped) when we located in were in Building 9 and much smaller than we are now, someone else in the building was having a problem. Our kind of problem. An international problem. I don't remember what it was -- something to do with code pages, maybe?

    Anyway, Wei Wu, one of our cool development leads, asked a few configuration questions, and at the end of the message, asked him "what is your default system locale?".

    His response, which I cannot find a copy of now, was priceless. Assuming that Wei was going to stop by to look at the machine. this guy started to describe the location of his computer.... :-)

    One of those jokes that most people won't quite get, and jokes are never funny if they have to be explained. Ah, the life of an international geek.

    But it was pretty funny, I think. We all had a good laugh at the time.

    There is a not-so-hidden truth in there -- our terminiology story is weak. It really ought to be better. So, once again, here is a quick glossary of the four most common types of locales:

    DEFAULT USER LOCALE (Windows XP term: "Standards and Formats"):

    This setting controls the way information is presented -- the sort order in list boxes, the format of date, time, number, and currency values, the calendar you prefer to use. The list of locales can be thought of as a big group of defaults that are grouped by many language/region pairings, which you can see in the first tab Regional and Language Options Control Panel applet. Several of the settings are customizable, particularly the various formats.

    The setting is per-user and when you change the setting, it is effective immediately, and all top level windows in the user's windowstation will get a WM_SETTINGCHANGE message indicating that the change has happened so that they too can reflect the change immediately.

    Developers will commonly use LOCALE_USER_DEFAULT as their LCID of choice, whether by doing so directly or by calling functions in SHELL, USER, or elsewhere that do so for them. Using this setting and behaving appropriately with the results thereby is a sure sign that the developer respects the user's settings.

    DEFAULT SYSTEM LOCALE (Windows XP term: "Language for non-Unicode Programs"):

    This setting has three major purposes:

    1. Specifies the default ANSI, OEM, MAC, and EBCDIC code pages to use for non-Unicode programs.
    2. Specifies some of the font linking preferences for CJK fonts and for legacy bitmap fonts.
    3. Specifies application behavior when developers incorrectly use this setting rather than the DEFAULT USER LOCALE.

    This setting is found on the third tab of the Windows XP/Server 2003 Regional and Language Options dialog and in the "Default" button on the first tab of the Windows 2000 Regional Options dialog.

    Changing the setting changes it for the entire machine and it requires a reboot to take effect. No notification mechanism is done, nor is one needed since no change happens until the reboot does. A small number of misbehaving applications which check the registry rather than using the APIs will get wrong results after the setting change but before the reboot.

    Unfortunately, developers will sometimes use LOCALE_SYSTEM_DEFAULT for purposes other than #1 and #2, and by doing so they manage to simultaneously show their users disrespect and cause yet another compatibility weirdness with functionality tied to the default system locale. Given the fact that a reboot is required, you think people would avoid this, but dven developers are not always perfect.

    The XP name should be a big hint, though sometimes it adds confusion.

    DEFAULT USER INTERFACE LANGUAGE (Windows XP Term: "Language used in menus and dialogs"):

    This setting controls the language in which the UI is presented. It is only present if you have the MUI version of Windows (which is to say that you have Windows with the multilanguage files installed).

    This setting is found on the second tab of the Windows XP/Server 2003 Regional and Language Options dialog and in the middle of the first tab of the Windows 2000 Regional Options dialog.

    The setting is per-user and changing it requires a logoff to take effect.

    There is no constant for it but the GetUserDefaultUILanguage API will retrieve the setting quickly enough. Given the changes to the resource model that the changes to support MUI inspired, it is easy for applications to plug into the very same setting automatically. I'll talk more about this another time....

    DEFAULT INPUT LOCALE (Windows XP Term: "Default Input Language"):

    This setting controls the initial input language used for all newly created threads.

    This setting is found on the second tab of the Windows XP/Server 2003 Regional and Language Options dialog (hit the "Details..." button) and on the last tab of the Windows 2000 Regional Options dialog.

    The setting is per-user and it takes place immediately. But obviously will not change the input language on any existing threads; only new threads get the new default.

    Developers can find out what the current setting is by calling the SystemParametersInfo API with the SPI_GETDEFAULTINPUTLANG parameter. You can even set it with the SPI_SETDEFAULTINPUTLANG parameter but this almost always something that a developer should not be doing -- it is a user preference. Since proper application behavior is mostly about respect, this really is a constant that you should avoid. :-)

    For more on this topic, Dr. International has bigger lists here and here, and there is more information here.

    But when you are a developer, respect is the key here -- respect of the user's preferences and settings.

    When you are a user, consider which applications respect your settings and which do not. Because while doing so may not be convenient for an application, it is certainly possible....

     

    This post sponsored by "ð" (U+00f0, a.k.a. LATIN SMALL LETTER ETH)

  • Sorting it all Out

    Not the most sensible post to riff on, but we do deal with GD here at SIAO

    • 0 Comments

    It may not be the most sensible post for me to riff on, but it is the one that got me thinking about the issue, so into the breach I go.... 

    I did almost lose a keyboard since I had just taken a big sip of Limonata before looking over at the monitor when out of nowhere Chris Pirillo's Viagra vs. Cialis post popped up in FeedDemon.

    And it succeeded in catching my attention, if nothing else. Which was his stated goal, so well done there!

    My first thought was that the spammers had perfected the "spam post" technology and put a whole post up full of links -- since he uses Google for ads, in a way they did. :-)

    But I thought I'd try to look a bit deeper here....

    Because whether the conversation is comfortable or not, we are talking about GD (globalization dysfunction) here in this blog.

    There are many "cures" for it, but finding the right one for the circumstances can be a challenge.

    When dealing with two technologies that are mostly the same but have some sometimes subtle differences that really can drive a software developer toward one or the other, it may seem like trying to choose between two drugs that everybody knows something about (even if not quite so many of us know the differences between them).

    And I often get those questions when people have to choose between Win32 and MLang, between CultureInfo or GetLocaleInfo, between LCIDs and locale names, and so on.

    It's easy to be a cheerleader and say rah! rah! use the new stuff! and tell them what to choose. But that isn't helpful.

    I mean usually I might suggest Win32 over MLang, but what if we're talking about code page detection?

    I might usually suggest GetLocaleInfo (since I am on the Windows team) but what if we're talking about a managed code project?

    And I might usually suggest locale names, but what if you're just using LOCALE_USER_DEFAULT, or what if you are running downlevel?

    The real answer to the question "which technology should I use?" is really complicated -- pushing it into a sound bite Hillel style (where he told the heathen who wanted to learn the whole Torah while standing on one foot "What is hateful to you, do not do to your neighbor: that is the whole Torah while the rest is commentary; go and learn it.") might have some visceral appeal, but that usually just isn't going to work.

    And not just because I am no Hillel.

    (In fact, after talking to like three or four folks in Office and SQL Server over the last few weeks who had never known me before but were told to contact me for some help or answers by their colleagues, it is very clear that I have developed something of a reputation at Microsoft, and if one had to choose a rabbi it would be closest to in description, I am clearly closer in reputation to a Shammai than a Hillel, though in fairness I would never chase someone off with my builder's cubitcane or one iron for asking me a question!)

    But like I said, it is not just because of that.

    I really need to know what the actual requirement is -- what one wants the code to do.

    Blindly saying "use managed code" or "use names, not LCIDs" without any consideration for needs may not make one a complete and utter moron, but it does not make one a genius, by any means.

    And without knowing what you were doing, there is no way to answer the question. If there were, then I would have only needed like maybe 20 posts in this blog instead of the tangled mass it has become!

    So, just as you should probably go to the doctor if you needed an answer to the whole erectile dysfunction question and whether to use Viagra or Cialis, coming somewhere like here for questions related to globalization dysfunction just makes sense.

    Especially with Dr. International in such a non-responsive state (in a coma since February of 2006!). :-)

     

    This post brought to you by (U+2206, a.k.a. INCREMENT)

Page 2 of 257 (3,844 items) 12345»