Blog - Title

September, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    How to turn off the CAPS LOCK key

    • 43 Comments

    Earlier today, I talked about the fact that some people hate the CAPS LOCK key. Well, Jeff D. posted the steps to turn the CAPS LOCK off in Windows XP and Server 2003, and I thought that I would repost them so that others who are not looking at the comments would benefit:

    The other solution I've used is to simply change the cancellation of CAPS LOCK to be done with the Shift key instead. sO THAT i CAN'T ACCIDENTALLY DO THIS.

    1. Control Panel | Regional and Language Options applet | Languages tab | Details button, which gets you to the "Text Services and Input Languages" dialog.
    2. On that page "Add" a second input language (you don't have to make it the default). The two (or more) I usually have installed are "English (Canada) - US" and "English (United States) - US". Once you've got more than one input language installed, the "Key Settings..." button becomes enabled: click it. That opens the "Advanced Key Settings" dialog. Right at the top you'll see:
    3. To turn off CAPS LOCK: [ ] Press the CAPS LOCK key [x] Press the SHIFT key

    Voila! Now if you have accidentally hit the CAPS LOCK key, as soon as you start typing the next sentence, which presumably starts with a capital letter, you'll cancel the CAPS LOCK immediately.

    Very cool.... thanks, Jeff!

     

  • Sorting it all Out

    When is a backslash not a backslash?

    • 36 Comments

    The character in question is U+005c, the REVERSE SOLIDUS, also known as the backslash or '\'. It is the path separator for Windows, which is encoded at 0x5c across all of the ANSI code pages.

    Since path separators are a pretty important requirement, the title of this post may seem a little scary -- how could it not be a backslash, a reverse solidus?

    Well, on Japanese code page 932, 0x5c is the YEN SIGN, and on Korean code page 949, 0x5c is the WON SIGN.

    Which is not to say that 0x5c does not act as a path separator -- it still does. And which is also not to say that the Unicode code points for the Yen and the Won (U+00a5 and U+20a9) do act as path separators -- because they do not.

    Of course the natual round trip mapping between U+005c and 0x5c happens on all code pages, and both U+00a5 and U+20a9 have one-way 'best fit' mappings to 0x5c on their respective code pages. This requirement technically went away with Unicode, when the characters were encoded separately.

    However, the issue is not a simple one of there not being space in the old code page and lots of space in Unicode, where customers will instantly move away from the not backslash path separators.

    In practice, after many years of code page based systems in Japan and Korea using their respective currency symbols as the path separators, it is believed customers were simply used to this appearance. And there was therefore little interest in changing that appearance (when the system settings were Japanese or Korean) to anything but those symbols.

    To support this expectation, Japanese and Korean fonts, whenever the default system locale is set to Japanese or Korean, respectively, will display the currency symbol rather than the backslash when U+005c is shown.

    But whether or not this is really what customers want is still an open question. Andrew Tuck of PSS here at Microsoft noted:

    When one of my customer’s from Korea was visiting here, I asked him if it bothered him that the backslash doesn’t appear as a backslash. It did bother him, and he believes it bothers most of his countrymen. However, he was fatalistic about it, "What can we do to change it. It’s been this way for a long time. We are used to it."

    Hardly a glowing recommendation, is it?

    And as Norman Diamond noted in his comments on this very blog (in this post), there are plenty of people in Japan who may not care for the convention, either.

    Of course there is no 'right' answer here, and I would imagine that you would find plenty of people who would be unhappy with such a change, just as there are those who would be unhappy with the status quo. Which perhaps explains why the status quo seems to be as it is -- those people who would like a change are resigned to the idea that it may never happen. And so they are now used  to it....

     

    This post brought to you by "\", "¥", and "" (U+005c, U+00a5, and U+20a9, a.k.a. REVERSE SOLIDUS, YEN SIGN, and WON SIGN)

  • Sorting it all Out

    Is Excel CSV misusing NLS functionality?

    • 17 Comments

    In the  newsgroups, Yi Zhang asked:

    Hi all

    I've encountered the following situation in my work: a customer of my company reported that he can't read a file created in German user locale correctly from a English user locale. This is because German locale uses ',' as the decimal symbol while English uses '.' I tried to experiment with .csv files in Microsoft excel, and it seems that excel will always use current user locale to parse the .csv files, which could end up with incorrect result if the .csv file is created in a different user locale. So this seems to suggest that user should be responsible for setting the correct user locale before he tries to open a file. But this might be too much for the user. Is there a better way to do this? Or user should always be responsible for setting the correct user locale?

    Thanks in advance

    Yi Zhang

    I had vaguely remembered people mentioning something like this before, but I had always been using the Access wizard since it does a better job here.

    Anyway, I decided to look in Excel's help to see if it could provide any assistance. Here is what I found in the Excel 2003 help:

    You can change the separator character used in both delimited text files and comma separated values (CSV) text files.

    Change the delimiter in a delimited text file

    For a delimited text file, you can change the delimiter from a Tab character to another character in the second page of the Text Import Wizard. From the same wizard page, you can also change the way consecutive delimiters, such as consecutive quotes, are handled.

    Change the separator in a CSV text file

    1. Click the Windows Start menu.
    2. Click Control Panel.
    3. Open the Regional and Language Options dialog box.
    4. Click the Regional Options Tab.
    5. Click Customize.
    6. Type a new separator in the List separator box.
    7. Click OK twice.

    Note  After you change the list separator character for your machine, all applications will use the new character. You can change the character back to the original character by using the same procedure.

    Hmmmm. Well, I guess this is not quite as bad as instructions for drivers that actually show a picture of the "This driver is not signed..." dialog as part of the instructions to install their driver, but it is about the same order of magnitude. Also it is a little obnoxious to instruct people to change the settings that will affect all applications so that an individual file that may have been sent by someone else with different settings can be imported....

    I guess it is why I prefer the Access import text wizard, which lets you specify a different separator (after trying to initially guess at the one to use by looking at the actual file!)

    Irregardless, Mihai responded to the post where I put the info from help:

    Sorry MichKa, but this is really bad. It is not your fault, I know, it is Excel (or the Excel team)!

    The help teaches one to mess-up the settings for all application because Excel does not understand some basic concepts:

    - The 10 years old bullet in Nadine Kano's book "All language editions can read one another's documents"

    - CSV (Comma Separated Values), means COMMA separated, not "separated by the locale-dependent list separator"

    =================

    Yi Zhang: my advice would be to try some formats other than CSV. I have done some preliminary tests, and Excel 2003 seem to work ok with XML, Tab delimited text, and Unicode Text (which is also tab delimited). I would vote with XML or Unicode Text (using Unicode, so you will have no problems with code pages and lost characters).

    I admit that Mihai has a point here -- one only has but to read the name. But short of that (and to retain backward compatibility), I could live with a configurable setting, so that one can change the behavior for the one import without affecting everything else the user runs. And ideally something that will look at the data and make an intelligent guess would be a good thing....

    The thread ended with some input from Louis Solomon:

    try open office ...

    Now Louis included no information on what Open Office does to make this situation different, for better or worse. So I honestly cannot say what the purpose of his three word response was. I decided to search for his name connected to Open Office in newgroups to see if he had perhaps said more about it somewhere else. But the only one I found was:

    Louis Solomon [SteelBytes] wrote:
    > The only thing that I can see so far that stops me switching to OpenOffice
    > (from MS Office), is lack of English UK (Australian in particular) support.

    Not entirely relevant, but it makes for an interesting ounterpoint if he is not using the product (he may be now, but if so he has not been posting on it much under that name)

    Anyway, does anyone know if the recommendation can be salvaged? Does Open Office have a particular feature in this area that is easier and less involved than just changing the file extension to .TXT from .CSV?

  • Sorting it all Out

    Some people hate the CAPS LOCK key

    • 23 Comments

    It would seem that I touched a nerve when I talked about various MSKLC Suggestions that had been forwarded to me. Something about the caps lock key, which apparently some people hate like poison, at least since 1989:

    I don't know why it hasn't made it to alt.peeves yet, but one of my most cherished and encouraged pet-peeves is the existence of such a useless, stupid, nasty, treacherous and evil menace to the progress of humankind is the CAPS LOCK KEY.

    Anyone who has used vi will know that there is no easier way to fuck up a document than to JJJJJ when you meant to jjjjj. And then there's those times whEN YOU ACCIDENTALLY HIT THE SONOFABITCH IN THE MIDDLE OF A SENTENCE. Or if you program in a language other than FORTRAN. Or if you want to type command names in UNIX. Or.....

    In my estimation, the only sensible place for a Caps Lock key would be in a fairly deeply recessed hole in the back of the monitor, or, preferably, in another room altogether.

    I mean, what the hell is the point of equipping every goddamn keyboard on the planet with a key that is only used deliberately about 1% of the times it gets hit? Why not just a "Fuck Things Up Randomly" key? What kind of stupid moron jackass would sell a piece of equipment with anything so
    thoroughly evil right there on the important part? And while I'm in a mood to name names, I think Sun should be shot for not only having this abomination in the sight of the Lord on every keyboard, but it doesn't have tactile-feedback (i.e. click-down/click-up like sensible keyboards), has
    no LED to notify the unsuspecting typist of the shit s/he's just gotten him/herself into. Just because they have nice workstations, people are willing to put up with this nonsense. Christ! What the hell kind of twisted mind would perpetrate a crime of such immense proportions on an unsuspecting
    public? I suspect Foul Play, and corruption and treachery at very high levels.

    Then again, I just tell my Xdesktop to change the keyboard-mapping to return to God's cherished mode in which hitting that stupid little fucker DOES NOTHING AT ALL. SEE HOW NICE IT IS?

    -Johnny When-I-want-caps-I'll-hit-the-fucking-Shift-key

    Well, I won't say much about Johnny Zweig's rant above (or the subsequent responses) other than to point out that I am very glad to be in here while all of these other people are out there. :-)

    But I will say a few things about the CAPS LOCK key:

    1) Every typewriter I have ever seen from manual Olympia to most modern has had one. It is great that some people want to ignore history, intuitive usage, backcompat expectations, and reality. But they should likely get over themselves and just disable their own somehow. The rest of us don't need to be deprived....

    2) Some languages use the SGCAPS functionality that puts a whole new pair of shift states that one may want to use, presumbly in sequences where having to hold down multiple shift keys would be difficult. This may not be your language; if so, consider yourself lucky and stop trying to deprive others of features that are useful to them. If you look (for example) at the Hebrew keyboard, the SGCAPS feasture allows two things -- (a) the bility to put either punctuation or numbers over the English text, depending on which makes it easiest to type what you want to type, an (b) the points that are sometimes but not often added to Hebrew text. So, can you rech all of those keys without SGCAPS? No. But evn the ones you can reach have better/easier ways to be reached depending on what you are trying to do....

    3) Suggesting that a tool like MSKLC lead the irrational charge to get rid of the CAPS LOCK is like asking two ants to start pushing the earth off its axis so it will roll into the sun. Even if I were insane enough to think this was a good idea and the people who I report to were crazy enough to agree, it would have as much actual impact on the CAPS LOCK key as that pair of ants did. The CAPS LOCK key's presence in the world is significantly more tangible than a rubber tree plant....

    4) Put another way, have you ever heard of the tail wagging the dog as an expression of trying to get the thing that is acted upon to do the acting? Well, MSKLC controlling the way keyboards work on the platform is like the tail wagging the whole dog kennel! :-)

    Anyway, hopefully we can get back to productive now. Anyone who wants to go and use MSKLC to create a keyboard where the CAPS LOCK has no effect is welcome to do so; it is a per-key setting that only influences the keys you want it to. And anyone who wants to convince the myriad of governments and OEMs and hasrdware producers and users who have had a CAPS LOCK where it is also welcome to do that. MSKLC will follow whatever the bulk of the user base is doing, so if you convince everyone, MSKLC will eventually follow. :-)

     

    This post brought to you by "۩" (U+06e9, a.k.a. ARABIC PLACE OF SAJDAH)

  • Sorting it all Out

    How many days in a weekend?

    • 17 Comments

    The other day, someone sent me some mail with an interesting question:

    Hi Micheal, was googling for an answer to this dev question "how do you detect the first day of the weekend and how many days there are in the weekend"

    Which is a good question. Cause GetLocaleInfo doesn't help there except to give you the first day of the week - but its not safe to assume the last two days are weekend days when your outside Europe etc.

    Any thoughts? Thought I'd ask you after reading your blog...

    My first thought was of Ursa Minor Beta, of which has been said:

    It is a West Zone planet which by an inexplicable and somewhat suspicious freak of topography consists almost entirely of sub- tropical coastline. By an equally suspicious freak of temporal relastatics, it is nearly always Saturday afternoon just before the beach bars close.

    But then I realized that I was not being tested about my Douglas Adams knowledge. :-)

    My second thought was that perhaps Microsoft should pay to send me to various exotic places to see if I can find anywhere that has longer than a two-day weekend every week!

    Then, back on Planet Earth I realized I should just come clean.

    There is no locale data carried around by Windows that tracks the first day of the weekend, and there is nothing to represent the number of days in the weekend. As far as I know, every locale calls LOCALE_SDAYNAME6 and LOCALE_SDAYNAME7 the weekend, and it is always two days long.

    Of course, given that LOCALE_IFIRSTDAYOFWEEK exists and that there are many locales that have different days that users would consider the "first day of the week", the whole term "weekend" is kind of unfortunate. And it does make me wonder if they even call it a weekend in every locale. And if it is not, then what happens to songs like Everybody's Working for the Weekend in those places? Or maybe Loverboy just didn't do as well in some locales....

    Does anyone know better here? Is there some place that has a 3-day weekend every week, or somewhere that starts the weekend on Wednesday? I can fly there within the month. :-)

     

    This post brought to you by "ʄ" (U+0284, a.k.a. LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK)

  • Sorting it all Out

    Fonts that are 'fixed-width' even if they do not claim to be

    • 18 Comments

    When you obtain a font in Windows (whether by CreateFont, CreateFontIndirect, CreateFontIndirectEx, any of the font enumeration functions, or whatever your preferred method is), you will almost always have the option of asking for a FIXED_PITCH font in the fdwPitchAndFamily member. The purpose of this is to give you a font with fixed width letters, such as the inestimatable Courier New. The most striking characteristic of these fonts is that every letter has the same width, from the very thin lowercase i to the thickest uppercase W.

    Now of course there are many complex script languages that can be quite a challenge when it comes to designing a fixed width font due to the re-ordering, shaping, and/or stacking behavior in the underlying script, but that can be a topic for another day....

    Today I am going to talk bout a whole class of fonts that are fixed width in a different way -- some of the fonts for Chinese, Japanese, and Korean. They cannot be requested via a FIXED_PITCH since they are technically not fixed width fonts -- they have two widths:

    • One for "full width" characters such as ideographs, Hangul, Jamo, Kana, and full width letters and digits;
    • One for "half width" characters such as all of the non-ideographic language characters and the halfwidth Jamo and Kana.

    By convention, the characters in the full width category are twice as wide as those in the half width category, and as far as I know there is not an intrinsic property of fonts that identifies these "split level" fixed width fonts.

    As I mentioned here and here, while this was once more of a legacy, encoding based issue (the half width characters were literally the one-byte characters in an otherwise double byte code page for languages like Japanese and Korean), this is obviously no longer the case in Unicode (where all of the relevant characters will need the same number of bytes). However, it is very much an aesthetic issue, where in many contexts characters such as the full width Katakana are described as less pleasant visually.

    The first time I ran across that particular issue is when I was working on the Microsoft Access team, where all of the Access property descriptions are listed twice. These dual descriptions are identical for almost all languages but in Japanese the strings that show up in the property sheet use halfwidth katakana, while in other places the fulldwidth katakana strings are used. At the time I asked why they did not just use LCMapString to convert the string, and it was pointed out to me that this would potentially convert ASCII letters, digits, and punctuation to their fullwidth equivalents, which was not really a good idea....

    (As an aside, Jesper Holmberg once told me a story about a localizer who found that none of the inserts were working in strings after they were localized for a managed application. It turns out that they were taking the curly braces used for the inserts (e.g. {0}, {1}, etc.) and converting them to their full width equivalents, converting U+007b and U+007d to U+ff5b and U+ff5d. No wonder the inserts did not work!)

    Anyway, I thought about posting all of this after reading the post that Buck Hodges did earlier on figuring out how wide is a string when displayed in the console window, which is a practical application of when you may well need to know this information. Note Buck's final function:

        static int CalculateConsoleWidth(String text)
        {
            Encoding encoding = Console.Out.Encoding;

            if (encoding.IsSingleByte)
            {
                return text.Length;
            }
            else
            {
                return encoding.GetByteCount(text);
            }
        }

    Funny how it ends up using those encoding roots where this half/full width stuff all started, huh? :-)

     

    This post brought to you by "" and "" (U+ff5b and U+ff5d, a.k.a. FULLWIDTH LEFT AND RIGHT CURLY BRACKETS)

  • Sorting it all Out

    Every character has a story #14: U-BOOP (BETTY BOOP)

    • 4 Comments

    Regular reader Maurits asked the following in the Suggestion Box:

    Every character has a story - even Betty

    Monday, May 09, 2005 5:00 PM by Maurits

    How about a post on the Betty Boop character?
    http://www.unicode.org/charts/PDF/UBOOP.pdf

    Funny you should ask (even funnier that you found that one!).

    The story goes something like this....

    There are many people who consider Joe Becker to be the father of Unicode.

    And what can be better than a father with a sense of humor?

    Perhaps that is the way to explain how U+BOOP ended up online, right?

    I asked Rick McGowan about it, and his reply explained how it ended up there, and maybe even why:

    This chart I think originated with Joe Becker, and  you could ask him.

    At one time he installed U+BOOP into the character index as a test to
    see who was actually proofreading the index. I guess it stuck.

    In any case, you'll notice the date on which this chart was last modified...

    Indeed -- April 1, 2002!

    Joe decided not to comment too explicitly on it:

    Looks familiar ...!

    So even if she is not a part of the standard, the chart can hold a special place in our hearts. :-)

  • Sorting it all Out

    I 𐿿 Unicode

    • 16 Comments

    In the true spirit of hilarious T-shirts for geeks involved with internationalization and/or computer typography, CafePress/NuclearTacos provides the

    I {entity} Unicode T-shirts!

    Now available in both Windows and Mac versions:

             

    Celebrating what happens when the font that you need is either not there or not linked properly. :-)

    When I wore the Windows one in the hallways of Microsoft, an impressive number of people got the joke (though like me, they may have realized they really are truly internationalization geeks because they got it!).

    Highly recommended, and for the record: no I do not get any kickback for the recommendation. :-)

  • Sorting it all Out

    Every character has a story #15: CAPITAL SHARP S (not encoded)

    • 13 Comments

    Regular reader Maurits asked, in the Suggestion Box:

    Can you comment on Andreas Stötzner's 2004 proposal for an upper-case ß code point, which was rejected by the Unicode consortium?

    http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf

    The proposal in question underwent a great deal of (not always entirely civil) conversation on the "member's only" list of Unicode....

    I have also posted about the Sharp S before, on this blog, for example here.

    The initial post about the proposal came from Markus Scherer of IBM:

    Purely personal opinion:

    I would have expected a proposal like this to see the light of day in a little less than 5 months...

    Aside from few "discussions" and other curiosities, the majority of the document's samples shows clear _lowercase_ ß in otherwise uppercase text. Using normal ß in this way (like applying simple case mappings rather than full ones) is reasonably common. While German school children might at first scratch their heads about this irregularity, I am pretty sure that there is no pressure at all for introducing an uppercase variant - other than possibly by a local font vendor in search of a market.

    It might be more likely for Germans to give up on ß than to add an uppercase version.

    markus

    http://www.daujones.com/comments_all.php?usrid=3504
    http://faql.de/eszett.html
    http://www.eibe-online.de/schulen/bfs_bensheim/darstellung_bensheim/FachbereichFarbtechnik.htm

    Michael Everson then weighed in:

    Look again. It shows capital sharp esses, though it does show small sharp esses in capital use because nothing else was available. The Duden evidence is not to be ignored.

    People have been discussing this issue for a century. I think Stötzner has shown clear evidence for e capital sharp s.

    Nobuyoshi Mori took a more technical approach to the analysis of the propoal:

    My understanding is:
        1) Technically toupper( U+00DF ) should be defined.  It is currently defined as : toupper( U+00DF ) -->  U+00DF
        2) There are several ways to "display" an "uppercase ß" in German:
          2-1) "SS"    This is what German orthography says. It is also the most usual way to handle it.
          2-2) "ß"     This is used in exceptional cases when either there is no space for two characters, or for typographic reasons, or by ignorance of the correct orthography.
          2-3) "SZ"     This is an old variant of 2-1, only very rarely used.

    The change of the current definition 1) breaks many existing Unicode implementations and data, and will cause compatibility issues.  The major issue is that the result of toupper( U+00DF ) becomes Unicode standard version dependent. 

    \I know huge amount of Unicode implementations and Unicode customer data which will run into problems with the suggested Standard change.  Most of the database implementations, OS and PC products, Computer language implementations such as Java, C#, etc would be some of the examples. 

    ...

    I therefore would like to request UTC to refuse the proposal.

    Mark Davis agreed but had one small correction:

    See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992

    The Unicode data supports two types of case operations: full and simple. The simple operations are for restricted environments where the number of characters cannot be changed. For any other situations the full mappings should be used. And when a full mapping is used, toUppercase( U+00DF ) --> "SS"

    Markus Scherer was also unconvinced by Michael Everson's response:

    I did look at the whole document including at each and every sample. Most of them are clearly lowercase ß between uppercase letters.

    "People" may have been talking about it depending who "people" is. I spent my first 27 years in Germany and have never heard of any serious discussion of an uppercase ß. (Not sure I even need to qualify this with "serious".) Unless there has since then been an outcry in the population that I missed while visiting about once a year or while talking with my relatives, I don't see that this is on anyone's mind.

    Real issues in discussion included the spelling of Kaiser (Keiser?) and other beloved words when the spelling reform was published.

    Michael Everson responded thusly:

    Which is what you would expect to find, in the absence of a more widespread availability of fonts with capital sharp esses. The evidence, and the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not. In my view, the Universal Character Set should encode such entities where they exist. They are facts.

    Ken Whistler had to respond to that argument, though (I was tempted to respond myself, but I am glad he did instead since he argued the case more convincingly:

    You are getting caught up in your own rhetoric about the UCS. The Universal Character Set is *not* the Universal Encyclopedia of Writing Systems -- it is a practical attempt at an engineering solution that everyone can use for digital representation of text.

    Introduction of a capital ess-tzet just because it "is a fact", and despite the manifest evidence of overwhelming German practice and implementation to the contrary, while utterly ignoring all the kinds  of implementation problems that would result -- just hinted at by Nobu -- is just foolish.

    The problem you are trying to deal with, namely the appearance of an majuscule design in some fonts for an ess-tzet in an all-uppercase context, can be dealt with by other techniques, specific to fonts and to word-processing systems (if there even proves to be a demand for it, which I doubt, given Markus' testimony). It does not require a muley insistence that because somebody shows in some context that it *might* be treated as a distinct uppercase letter, that that resolves all issues and makes it obvious that separate encoding is required in the Unicode Standard for this "thing".

    I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system. And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations. My prediction is that the UTC is quite likely to turn this one down flat, without a single member in favor of it.

    As I said, much more convincing.... :-)

    But it looks like everyone dug their heels in; Michael responded to Ken:

    As a student of the world's writing systems, I maintain that what I said is true. The evidence, and  the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not.

    This may be an issue for some, or many, or most, current implementations. That's a concern for industry today. My work on the Universal Character Set, as you know, looks to tomorrow.

    Not being a complete idiot, my response to Stötzner on this particular character is "Get Germany and Austria behind the proposal."

    But facts are facts. You recently wrote a piece where you acknowledged that many of Unicode/10646's current "mistakes" will one day be purged. A hack for casing sharp-ess would seem to be one such. Stötzner's Weise, Weisse, Weiße casing to WEISE, WEISSE, WEISSE/WEIßE is a problem German implementations have to deal with now. I strongly suspect that the "solutions" are not AT ALL uniform or satisfactory to the Mr Whites out there. A capital scharp-ess would allow a consistent solution, and would, in my view, be superior to some sort of smart-font hack where a sharp-ess preceded by a capital letter would take on a different shape. That is not very portable, and, if I may remind you, from the 10646 side we are concerned with data preservation and transfer, not just implememtation by big companies.

    Yes, these are philosophical differences in the two standards, but they are ther nonetheless.

    >I am getting *really* impatient with the kind of  rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system.

    No, the Germans have been looking at that themselves. In 1902 they did, and Stötzner is doing it again today. That's also a fact.

    >And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations.

    I wouldn't encode the character on foot of this one proposal either. But there is a case to be made for this character, and it would be wrong to reject the proposal out of hand.

    Others such as Asmus Freytag and Benson Margulies also chimed in agreeing that an answer of 'proposal insufficient' seemed best at this point.

    John Hudson then mentioned:

    By the way, I met Andreas Stötzner at the recent ATypI conference in Prague, and am familiar with his journal. He is an intelligent and reasonable man, and I doubt if he would be insistant about the encoding of a Capital Double S if the text encoding and processing impact were explained to him. He has documented, in an admirably thorough way, a development in *some* German typography, which needs to be addressed at some level of text encoding or display. It is not obvious that the best way to do this is to encode a new character.

    A bit more discussion but it kind of petered out without any real sense of consensus.

    Shortly thereafter, at the November 2004 UTC meeting in Cupertino, CA, a bunch of discussion ensued, but in the end Conensus 22 happened:

    [101-C22] Consensus: The UTC concurs with Stoetzner that Capital Double S is a typographical issue. Therefore the UTC believes it is inappropriate to encode it as a separate character.

    and it was added to the Rejected Characters list with the following comment:

    LATIN CAPITAL LETTER DOUBLE S (existence as character not demonstrated; would cause casing problems for legacy German data)

    I probably would not have worded the consensus in just that way, but the end result would have been the same....

    Three months later, a thread came up on the Unicode List about the Sharp S and uppercasing it, which mainly dwelt on issues other than adding the character. So I will spare everyone the further conversation. :-)

  • Sorting it all Out

    A shorter 'shortest published sentence of the year' ?

    • 16 Comments

    A little while ago, Geoffrey K. Pullum sent up to the Language Log a post entitled Shortest published sentence of the year.

    And I do agree that the sentence in question:

    Z.

    is indeed a very short one.

    But it got me thinking.

    There were all those old IBM manuals that would start new sections by being sure to print on a page that would allow you to take a section out without robbing the previous or the next section of their pages.

    Now that is a very noble and sensible goal for manuals that are published in three ring binders obviously intending to facillitate such practices, but that would mean that on a regular basis there would be blank pages.

    Of course if you look at technical material, a blank page is a little scary -- perhaps it did not get printed, and important material is lost. So they would helpfully print somewhere on the page the following witty text:

    (This page left intentionally blank.)

    Of course the irony of the situation is inescapable (they made the page decidedly non-blank by printing the sentence, after all). It certainly has a shakier basis than the explanation of why you have to click the Start button to shut down.

    Perhaps there is a construct that could be used that would allow what is printed to fulfill its purpose without giving lie to its own claim. The current state of affairs is simply far too Godelian!

    So it got me thinking....

    In the spirit of a Hofstadterian self-referential sentence, I could say:

    The next sentence is left intentionally blank. .

    In case you wish to object about the setup required for it to be in any way sensible, I would counter that "Z." needed even more of a setup than the one proceeding sentence....

    Makes for a smaller sentence, no? :-)

     

    This post brought to you by "." (U+ff0e, a.k.a. FULLWIDTH FULL STOP)
    (a character that would perhaps not be a part of the shortest sentence of the year if its halfwidth cousin was also printed somewhere)

  • Sorting it all Out

    Extending collation support in SQL Server and Jet, Part 0 (HISTORY)

    • 14 Comments

    It was some time back in late 90s.

    Folks on the Microsoft Jet [red] team were really worried about the Jet reliance on the OS collation functions for CJK sorts, since they would give different results across several of the different supported versions of Windows.

    To solve this problem, a Program Manager on the Microsoft Jet [red] team staged a raid on the collation data used by Windows. The data was used to create a solution that would give consistent results across all platforms. It would fold together all of the collations that gave identical results (thus no need for separate entries for both Norwegian and Danish, or for Swedish and Finnish, etc.). The project code name was known as Unicorn, and it first shipped in Access 2000 as a dll named MSWSTR10.DLL.

    The folks in SQL Server, who were facing the same problem, took the Jet [Red] Unicorn solution and in SQL Server 2000 shipped with something they called SQLSORT.DLL.

    In the fancy tradition of the Jet [red] API, the exports for both of these DLLs had their names stripped. And both versions of the solution made their way out into the world, in Office 2000 and SQL Server 2000.

    The problem? Well, the solution had many problems.

    First of all, when I said that the PM staged a raid, I was not in any way exagerating the point. It was done without our knowledge. I know that because he did it just after Windows 2000 Beta 1 and before Windows 2000 Beta 2. It was during that weird period just after then SDE Julie Bennett had once again tried to fix the Turkic 'I' problem and just before she was forced to take the fix back out due to backcompat breaks. And it was also just after all of the DEFAULT TABLE work for Indic and other languages was added but just before most of the necessary EXCEPTION and COMPRESSION data was added for those languages. But since we were not told about the fact the information was being borrowed, we could not warn them to pick up the update to the data that would make the proper results available.

    The kicker (for me) was some words in one of the specs:

    Because high speed of the underlying sorting functions is essential to the efficient operation of Database products, the Windows NT code was substantially optimized when it was ported.  For most cases the MSWSTR10.DLL functions are about 50% faster than the Windows NT equivalent functions, but for some languages such as Thai the speed improvement is much, much higher.

    I am sure that folks who wanted a 50-300% speed improvement in languages that use compressions (which is where most of the optimization was done) would have appreciated having the issue communicated back to the team that provided the code and the data. However, when you swipe a wallet you probably don't warn the victim that their fly was open.... :-)

    In other words, the whole project can probably serve as a textbook example of why teams need to work together, in collaboration with each other. Because if they do not, then in the end everyone suffers....

    Well, FWIW, that situation has since been long fixed on the SQL Server side -- we now do work in collboration with folks to provide proper solutions, and in part because of that cooperative spirit SQL Server 2005 will ship with many updated collations based on the Windows Server 2003 data (including the proper support for all of those languages that were missed the first around in Windows 2000).

    Now it does not help with (for example) all of the new ELK language support that has been added (as discussed here and here) -- none of it that is in Yukon (more on that in a second).

    For the Jet side things are not as good as that, since there is no Jet [Red] update (even Access 2003 still ships with Jet 4.0, just like Access 2000 and 2002 did) to pick up fixes to those problems. So Access/Jet basically has those older tables, missing support for at least 40 languages as of those two ELK releases.

    (ASIDE: I do keep calling it Jet [Red] to distingish it from the Jet [Blue] engine that actually still does call our collation functions and never went in for that snapshot stuff -- they ship with the OS and need to support every language the OS does. And I promise that if Brett Shirley ever starts blogging that I will be reading it!)

    In the meantime, people have noticed this problem. We claim that Hindi has been supported since Windows 2000, but a Hindi speaker tries to use Access or SQL Server and sees that there is no good collation support for it. Or they get excited about the Quechua or Mapudungun or Maltese support we added in ELKs but again neither Access or SQL Server seems to show that such support exists. And there is no way to look at the upcoming language list for Longhorn and the new locales being added and not get downright depressed about this whole issue, and the fact that as we get more agile in Windows we are starting to make these other products look worse thereby.

    Anyway, I have had several talk to me lately, since these new ELK languages have been coming out and since even Vista Beta 1 has an impressive list of languages added (at the recent Internationalization and Unicode Conference, Ning Jin-Grisaffi and Kieran Snyder,  including lots of detail on the Tibetan, Mongolian, Uighur, and Yi support!). They want to know how to get support for these languages in either Jet or SQL Server (or both), when running on Vista.

    Now most of the above was written over the last few months as I worked to provide the answer to that question -- which was going to be that there was no answer, unfortunately. Sorry, go complain to those products, it is their mess.

    However, I then figured out a solution (well, several possible solutions) that would actually be able to provide assistance in tehese scenarios. And thus this post series (Extending collation support in SQL Server and Jet) was born. You can consider this post to be -- as the title indicates -- Part 0, the historical aspects. As I am sure you can imagine, since I am promising solutions (and particularly considering how bleakly the historical picture has been painted) there is nowhere to go but up.... and the going back up part is going to be a lot of fun.

    So stay tuned, and if you care about being able to extend the language support of these database products then stay tuned and prepare to have your socks knocked off!

     

    This post brought to you by "ख" (U+0916, a.k.a. DEVANAGARI LETTER KHA)

  • Sorting it all Out

    Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)

    • 3 Comments

    Prior posts in the series:

    Extending collation support in SQL Server and Jet, Part 0 (HISTORY)

    What makes this problem so much easier in the soon to be released Yukon (SQL Server 2005) is that several different technologies are coming together that, when combined together, make it all easier. You can work without them (I will explain how in future posts) but it is never so easy as when it all comes together....

    Those features are:

    1. Binary collations in SQL Server (these have been available for many versions)
    2. The binary and varbinary datatypes in SQL Server (these have also been available for many versions)
    3. Windows-only CultureInfo objects (available in Whidbey)
    4. SQLCLR integration features, allowing one to create functions in .NET that can be called from stored procedures (available in Yukon)

    I will explain each of them individually.

    Binary collations in SQL Server

    One of the most important characteristics of some of the new locales in Windows is that there are no collation weights for many of them (a-la-the jury will give this string no weight). When every row of data in a particular column is equal to every other row, it can be quite un-nerving (for obvious reasons). But using a binary collation means you can not only work around this problem, but that you can do so in a very fast way. Not much that needs to be said about this one, other than that. You may never even need to use this collation, but you may. It is worth doing.

    As for which one to choose -- now as I have mentioned before in SQL Server has its own version of .NET "ordinal" comparisons, the specific collation language choice only matters for the sake of legacy data that you may need to convert into or out of Unicode. And most of these new locales have characters that are not on any code page. So it truly does not matter which you choose here in most cases. When it does matter for your application, just be sure to choose keeping that conversion requirement in mind.

    The binary and varbinary datatypes in SQL Server (these have also been available for many versions)

    One of the things that we will be doing to make sure that your application can work as quickly as if collations for these other locales were built into SQL Server is to build indexes on these columns. Stored in these columns will be sort keys that you will generate any time a new string value is inserted or the string value in question is changed. You can then build an index on this binary column and sort by it or use it to search for information (sorting is easy but the search can be intriugingly tricky so I will be talking about some of the more obscure details of using the index if you need to do so another day).

    For now, just keep in mind that you will be adding one of these columns for every text column that might contain strings that will require the custom collation. You redoing mostly what SQL Server does anyway, but you are doing it with much more knowledgable functions, for these new languages....

    Windows only CultureInfo objects (available in Whidbey)

    In the manner of great timing, the article Cathy and I wrote for the October 2005 MSDN Magazine is available today! I'll excerpt from the article to explain Windows Only CultureInfo objects.

    I do want to take a moment to call out the efforts of the Software Design Engineer who did the bulk of the work to support these special culture types: Tarek Mahmoud Sayed. His efforts here not only make the story of running the .NET Framework on platforms that are newer and have more features than it possible, but they make that support seamless and as easy to use as any of the built-in cultures that have always been available. There are lazy developers (like me) who can blog and come up with clever ideas now and again, and there are incredibly smart and hardworking developers like Tarek who make me happy to go to work each day, to one of the smartest development teams around. Thanks, Tarek!

    So, without further adieu, here is that excerpt:

    Cultures in Windows and the .NET Framework

    The .NET Framework was released at an interesting point in the history of Windows: after Windows XP was released, but before Windows Server™ 2003. As a result, the list of cultures available in the .NET Framework matched the locales included in Windows XP (and provided a superset of the locales included in previous versions of Windows). Developers didn't have to consider the consequences of new locales on Windows. There would be no issue with how these new Windows locales would interact with a version of the .NET Framework that tries to, for example, base its culture settings on the choices available in the operating system. The .NET Framework has always maintained its own data so that it could return the same results on all possible platforms, and until Windows XP SP2, this had never caused any difficulties.

    The globalization development team had to address this problem, however, after Windows XP SP2 shipped with 25 new locales. For a listing of the new locales for SP2 only, see New Locale and Language Features in Windows XP. Imagine our surprise when one of our testers discovered that you could not even start a managed application when installing an early build of SP2 and using one of those new locales as the default user locale! This was clearly an issue we needed to address immediately in earlier versions of the .NET Framework, and fix more fully in the .NET Framework 2.0.

    Future Windows service packs may include additional locales. Windows Vista™ (formerly codenamed "Longhorn") is expected to ship with additional locales above and beyond what have been supported to date; so that presents a very possible situation where an installed version of Windows could include locales that are not recognized cultures in the .NET Framework. Therefore, it's imperative that the .NET Framework gracefully handle Windows locales in a managed environment. Figure 3 shows Francois Liger's Culture Explorer (available for download from www.gotdotnet.com), which illustrates how the .NET Framework 2.0 picks up the new locales in Windows Vista through the Windows-only cultures.

    The .NET Framework can now handle previously unidentified Windows locales by using the Win32® API to synthesize a CultureInfo object any time a locale supported in Windows has no corresponding culture in the .NET Framework. These cultures can be created either by name or by LCID, just like any other culture. The following code illustrates how to create a culture by name (new cultures on Windows XP SP2 include mt-MT, bs-BA-Latn, smn-FI, smj-NO, smj-SE, sms-FI, sma-NO, sma-SE, quz-BO, quz-EC, quz-PE, ml-IN, bn-IN, cy-GB, and more):

    ' Visual Basic 
    For Each ci As CultureInfo In CultureInfo.GetCultures( _
            CultureTypes.WindowsOnlyCultures)
        Console.WriteLine(ci.Name)
    Next
    
    // C#
    foreach(CultureInfo culture in CultureInfo.GetCultures(
            CultureTypes.WindowsOnlyCultures))
    {
        Console.WriteLine(ci.Name);
    }
    

    This is obviously a break from the typical practice in the .NET Framework of giving the same results independent of the platform. However, given the choice between failing completely and succeeding when there is a way to retrieve the data, the option of handling Windows-only cultures successfully provides a better solution for developers who expect some type of culture data returned by the .NET Framework for these Windows-only locales.

    You'll notice in the previous code snippet that the CultureInfo.GetCultures static method was used to retrieve a collection of Windows-only cultures. While GetCultures and the CultureTypes enumeration existed in previous versions of the Framework, the .NET Framework 2.0 rounds out the enumeration with more options in order to provide better support for custom and replacement cultures. One of these new values is WindowsOnlyCultures. Figure 4 provides a comparison of the various culture types.

    I'll show that great screenshot of the Culture Explorer here, everyone love good art:

    How we plan to use these objects is hopefully obvious -- if you get the CultureInfo in this way, any sort key we create by using CultureInfo.CompareInfo.GetSortKey, any comparison done via CultureInfo.CompareInfo.Compare, will all use the support built into Window for the language.

    SQLCLR integration features, allowing one to create functions in .NET that can be called from stored procedures (available in Yukon)

    In the same way that we effortlessly added Oriya support to the Culture Explorer by just sitting on a Windows Vista box and the or-IN (Oriya - India) culture that was made available thereby, you can effortlessly add support for an Oriya collation if your SQL Server 2005 indexes are being built on a Windows Vista machine!

    I will definitely be covering the details of doing this in future articles, but the basic model should be fairly obvious -- a managed assembly containing some basic procedures for creating indexes and making comparisons, and some triggers for inserts and changes to the text columns that will in turn update the indexes. You can of course sort the data via that index column. Easy!

    Most search operations can be done by using the original binary index described above, and a future column will describe how to do it with the homemade index (that part is a bit harder but if I provide the code it will hopefully be easy enough to use).

    But the most important thing to keep in mind that this integrated solution gives you the power of SQL Server 2005, extended to the 25 locales added in the first ELK release, the 11 added in the second ELK release, or the many being added to Vista (you can see the beta if you have it, or you can sneak a peek at that screen shot to see some of the culture names in the list of cultures in the dialog.

    In a word, amazing.

    Future posts in this series can be put into two categories:

    • the details of doing these various steps, with sample code and stored procedures
    • supporting all of the downlevel scenarios, up to and including support in Access/Jet

    Feel free to let me know if you are impressed.:-)

     

    This post brought to you by "" (U+0b23, ORIYA LETTER NNA)

  • Sorting it all Out

    The basics of supplementary

    • 20 Comments

    I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):

    Ok, it is all as clear as mud now, right? :-)

    The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.

    (Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)

    Let's see if we can't do something about that....

    (Warning: some MATH content ahead!)

    We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.

    Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:

    • U+d800 - U+d83f (Plane 1, Supplementary Multilingual Plane)
    • U+d840 - U+d87f (Plane 2, Supplementary Ideographic Plane)
    • U+d880 - U+d8bf (Plane 3, Reserved)
    • U+d8c0 - U+d8ff (Plane 4, Reserved)
    • U+d900 - U+d93f (Plane 5, Reserved)
    • U+d940 - U+d97f (Plane 6, Reserved)
    • U+d980 - U+d9bf (Plane 7, Reserved)
    • U+d9c0 - U+d9ff (Plane 8, Reserved)
    • U+da00 - U+da3f (Plane 9, Reserved)
    • U+da40 - U+da7f (Plane 10, Reserved)
    • U+da80 - U+dabf (Plane 11, Reserved)
    • U+dac0 - U+daff (Plane 12, Reserved)
    • U+db00 - U+db3f (Plane 13, Reserved)
    • U+db40 - U+db7f (Plane 14, Supplementary Special-purpose Plane)
    • U+db80 - U+dbbf (Plane 15, Supplementary Private Use Area A)
    • U+dbc0 - U+dbff (Plane 16, Supplementary Private Use Area B)

    By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!

    Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:

    • U+d800 U+dc00 -> U+10000
    • U+d800 U+dc01 -> U+10001
    • U+d800 U+dc02 -> U+10002
    • ...
    • U+d800 U+dffd -> U+103fd
    • U+d800 U+dffe -> U+103fe
    • U+d800 U+dfff -> U+103ff
    • U+d801 U+dc00 -> U+10400
    • U+d801 U+dc01 -> U+10401
    • U+d801 U+dc02 -> U+10402
    • ...
    • U+dbff U+dffd -> U+10fffd
    • U+dbff U+dffe -> U+10fffe
    • U+dbff U+dfff -> U+10ffff

    (I skipped some spaces in there for obvious reasons!)

    This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).

    When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):

    So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!

    Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.

    And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:

    I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)

    (On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)

    Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....

     

    This post brought to you by all of the supplementary planes of Unicode

  • Sorting it all Out

    No charset meta tag?

    • 8 Comments

    I just got a message from Mihai yesterday:

    Digging into the indented links issues a couple of days ago, I got to see the source of your blog.

    To my surprise, there is no charset meta in the head section. Deadly sin! :-) Anything to say in your defense? :-)

    Hmmmm, Mihai is right. There is no charset meta tag there, so the page gives no assistance when it comes to Character Set Recognition.

    I have nothing to say in my defense. And it looks like a long-standing problem -- .Text (the predecessor to Community Server) has it too.

    I don't think Scott Watermasysk actually reads this blog. I guess I could ask on his or something.

    I am impressed at how well my blog has worked all that time with all the various characters I put up here!

    (I'll show off some of that a bit now!)

     

     

     

     

    ॥ਂਅਆਇਈਉਊਏਐਓਔਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਲ਼ਵਸ਼ਸਹ਼ਾਿੀੁੂੇੈੋੌ੍ਖ਼ਗ਼ਜ਼ੜਫ਼੦੧੨੩੪੫੬੭੮੯ੰੱੲੳੴ

    ।॥ઁંઃઅઆઇઈઉઊઋઍએઐઑઓઔકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલળવશષસહ઼ઽાિીુૂૃૄૅેૈૉોૌ્ૐૠ૦૧૨૩૪૫૬૭૮૯

    ԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖՙ՚՛՜՝՞՟աբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆև։֊

    აბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰჱჲჳჴჵჶ჻

    ।॥ಂಃಅಆಇಈಉಊಋಌಎಏಐಒಓಔಕಖಗಘಙಚಛಜಝಞಟಠಡಢಣತಥದಧನಪಫಬಭಮಯರಱಲಳವಶಷಸಹಾಿೀುೂೃೄೆೇೈೊೋೌ್ೕೖೞೠೡ೦೧೨೩೪೫೬೭೮೯

    ।॥ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ািীুূৃৄেৈোৌ্ৗড়ঢ়য়ৠৡৢৣ০১২৩৪৫৬৭৮৯ৰৱ৲৳৴৵৶৷৸৹৺

    กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศสหฬอฮฯะัาำิีึืฺุู฿เแโใไๅๆ็่้๊๋์ํ๎๏๐๑๒๓๔๕๖๗๘๙๚๛

    ఁంఃఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళవశషసహాిీుూృౄెేైొోౌ్ౕౖౠౡ౦౧౨౩౪౫౬౭౮౯‌

    ംഃഅആഇഈഉഊഋഌഎഏഐഒഓഔകഖഗഘങചഛജഝഞടഠഡഢണതഥദധനപഫബഭമയരറലളഴവശഷസഹാിീുൂൃെേൈൊോൌ്ൗൠൡ൦൧൨൩൪൫൬൭൮൯

    ஂஃஅஆஇஈஉஊஎஏஐஒஓஔகஙசஜஞடணதநனபமயரறலளழவஷஸஹாிீுூெேைொோௌ்ௗ௧௨௩௪௫௬௭௮௯௰௱௲

    ँंःअआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढणतथदधनऩपफबभमयरऱलळऴवशषसह़ऽािीुूृॄॅॆेैॉॊोौ्ॐ॒॑॓॔क़ख़ग़ज़ड़ढ़फ़य़ॠॡॢॣ।॥०१२३४५६७८९॰

    ༀ༁༂༃༄༅༆༇༈༉༊་༌།༎༏༐༑༒༓༔༕༖༗༘༙༚༛༜༝༞༟༠༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯༰༱༲༳༴༵༶༷༸༹༺༻༼༽༾༿ཀཁགགྷངཅཆཇཉཊཋཌཌྷཎཏཐདདྷནཔཕབབྷམཙཚཛཛྷཝཞཟའཡརལཤཥསཧཨཀྵཪཱཱཱིིུུྲྀཷླྀཹེཻོཽཾཿ྄ཱྀྀྂྃ྅྆྇ྈྉྊྋྐྑྒྒྷྔྕྖྗྙྚྛྜྜྷྞྟྠྡྡྷྣྤྥྦྦྷྨྩྪྫྫྷྭྮྯྰྱྲླྴྵྶྷྸྐྵྺྻྼ྾྿࿀࿁࿂࿃࿄࿅࿆࿇࿈࿉࿊࿋࿌࿏

    ꀀꀁꀂꀃꀄꀅꀆꀇꀈꀉꀊꀋꀌꀍꀎꀏꀐꀑꀒꀓꀔꀕꀖꀗꀘꀙꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱꀲꀳꀴꀵꀶꀷꀸꀹꀺꀻꀼꀽꀾꀿꁀꁁꁂꁃꁄꁅꁆꁇꁈꁉꁊꁋꁌꁍꁎꁏꁐꁑꁒꁓꁔꁕꁖꁗꁘꁙꁚꁛꁜꁝꁞꁟꁠꁡꁢꁣꁤꁥꁦꁧꁨꁩꁪꁫꁬꁭꁮꁯꁰꁱꁲꁳꁴꁵꁶꁷꁸꁹꁺꁻꁼꁽꁾꁿꂀꂁꂂꂃꂃꂄꂅꂆꂇꂈꂉꂊꂋꂌꂍꂎꂏꂐꂑꂒꂓꂔꂕꂖꂗꂘꂙꂚꂛꂜꂝꂞꂟꂠꂡꂢꂣꂤꂥꂦꂧꂨꂩꂪꂫꂭꂮꂯꂰꂱꂲꂳꂴꂵꂶꂷꂸꂹꂺꂻꂼꂽꂾꂿꃀꃁꃂꃃꃄꃅꃆꃇꃈꃉꃊꃋꃌꃍꃎꃏꃐꃑꃒꃓꃔꃕꃖꃗꃘꃙꃚꃛꃜꃝꃞꃟꃠꃡꃢꃣꃤꃥꃦꃧꃨꃩꃪꃫꃬꃭꃮꃯꃰꃱꃲꃳꃴꃵꃶꃷꃸꃹꃺꃻꃼꃽꃾꃿꄀꄁꄂꄃꄄꄅꄆꄇꄈꄉꄊꄋꄌꄍꄎꄏꄐꄑꄒꄓꄔꄕꄖꄗꄘꄙꄚꄛꄜꄝꄞꄟꄠꄡꄢꄣꄤꄥꄦꄧꄨꄩꄪꄫꄬꄭꄮꄯꄰꄱꄲꄳꄴꄵꄶꄷꄸꄹꄺꄻꄼꄽꄾꄿꅀꅁꅂꅃꅄꅅꅆꅇꅈꅉꅊꅋꅌꅍꅎꅏꅐꅒꅓꅔꅕꅖꅖꅗꅘꅙꅚꅛꅜꅝꅞꅟꅠꅡꅢꅣꅤꅥꅦꅧꅨꅩꅪꅫꅬꅭꅮꅯꅰꅱꅲꅳꅴꅵꅶꅷꅸꅹꅺꅻꅼꅽꅾꅿꆀꆁꆂꆃꆄꆅꆆꆇꆈꆉꆊꆋꆌꆍꆎꆏꆐꆑꆒꆓꆔꆔꆕꆖꆗꆘꆙꆚꆛꆜꆝꆞꆟꆠꆡꆢꆣꆤꆥꆦꆧꆨꆩꆪꆫꆬꆭꆮꆯꆱꆲꆳꆴꆵꆶꆷꆸꆹꆺꆻꆼꆽꆾꆿꇀꇁꇂꇃꇄꇅꇆꇇꇈꇉꇊꇋꇌꇍꇎꇏꇐꇑꇒꇓꇔꇕꇖꇗꇘꇙꇚꇛꇜꇝꇞꇟꇠꇡꇢꇣꇤꇥꇦꇧꇨꇩꇪꇬꇭꇮꇯꇰꇱꇲꇳꇴꇵꇶꇷꇸꇹꇺꇻꇼꇽꇾꇿꈀꈂꈂꈃꈄꈅꈆꈇꈈꈉꈊꈋꈌꈍꈎꈏꈐꈑꈒꈓꈔꈕꈖꈗꈘꈙꈚꈛꈜꈝꈞꈟꈠꈡꈢꈣꈤꈥꈦꈧꈨꈩꈪꈫꈬꈭꈮꈯꈰꈱꈲꈴꈵꈶꈷꈸꈹꈺꈻꈼꈽꈽꈾꈿꉀꉁꉂꉃꉄꉅꉆꉇꉈꉉꉊꉋꉌꉍꉎꉏꉐꉑꉒꉓꉔꉕꉖꉗꉘꉙꉚꉛꉜꉝꉞꉟꉠꉡꉢꉣꉤꉥꉦꉧꉨꉩꉪꉫꉬꉭꉮꉯꉱꉲꉳꉴꉵꉶꉷꉸꉹꉺꉻꉼꉽꉾꉿꊀꊁꊂꊃꊄꊅꊆꊇꊈꊉꊊꊋꊌꊍꊎꊏꊐꊑꊒꊓꊔꊕꊖꊗꊘꊙꊚꊛꊜꊝꊞꊟꊠꊡꊢꊣꊤꊥꊦꊧꊨꊩꊪꊫꊬꊭꊮꊯꊰꊱꊲꊳꊴꊵꊶꊷꊸꊹꊺꊻꊼꊽꊾꊿꋀꋁꋂꋃꋄꋅꋆꋇꋈꋉꋊꋋꋌꋍꋎꋏꋐꋑꋒꋓꋔꋕꋖꋗꋘꋙꋚꋛꋜꋝꋞꋟꋠꋡꋢꋣꋤꋥꋦꋧꋨꋩꋪꋫꋬꋭꋮꋯꋰꋱꋲꋳꋴꋵꋶꋷꋸꋹꋺꋻꋼꋽꋾꋿꌀꌁꌂꌃꌄꌅꌆꌇꌈꌉꌊꌋꌌꌍꌎꌏꌐꌑꌒꌓꌔꌕꌖꌗꌘꌙꌚꌛꌜꌝꌞꌟꌠꌡꌢꌣꌤꌥꌦꌧꌨꌩꌪꌫꌬꌭꌮꌯꌰꌱꌲꌳꌴꌵꌶꌷꌸꌹꌺꌻꌼꌽꌾꌿꍀꍁꍂꍃꍄꍅꍆꍇꍈꍉꍊꍋꍌꍍꍎꍏꍐꍑꍒꍓꍔꍕꍖꍗꍘꍙꍚꍛꍜꍝꍞꍟꍠꍡꍢꍣꍤꍥꍦꍧꍨꍩꍪꍫꍬꍭꍮꍯꍰꍱꍲꍳꍴꍵꍶꍷꍸꍹꍺꍻꍼꍽꍾꍿꎀꎁꎂꎃꎄꎅꎆꎇꎈꎉꎊꎋꎌꎍꎎꎏꎐꎑꎒꎓꎔꎕꎖꎗꎘꎙꎚꎛꎜꎝꎞꎟꎠꎡꎢꎣꎤꎥꎦꎧꎨꎩꎪꎫꎬꎭꎮꎯꎰꎱꎲꎳꎴꎵꎶꎷꎸꎹꎺꎻꎼꎽꎾꎿꏀꏁꏂꏃꏄꏅꏆꏇꏈꏉꏊꏋꏌꏍꏎꏐꏑꏒꏓꏔꏕꏖꏗꏘꏙꏚꏛꏜꏝꏞꏟꏠꏡꏢꏣꏤꏥꏦꏧꏨꏩꏪꏫꏬꏭꏮꏯꏰꏱꏲꏳꏴꏵꏶꏷꏸꏹꏺꏻꏼꏽꏾꏿꐀꐁꐂꐃꐄꐅꐆꐇꐈꐉꐊꐋꐌꐍꐎꐏꐐꐑꐒꐓꐔꐕꐖꐗꐘꐙꐚꐛꐜꐝꐞꐟꐠꐡꐢꐣꐤꐥꐦꐧꐨꐩꐪꐫꐬꐭꐮꐯꐰꐱꐲꐳꐴꐵꐶꐷꐸꐹꐺꐻꐼꐽꐾꐿꑀꑁꑂꑃꑄꑅꑆꑇꑈꑉꑊꑋꑌꑍꑎꑏꑐꑑꑒꑓꑔꑕꑖꑗꑘꑙꑚꑛꑜꑝꑞꑟꑠꑡꑢꑣꑤꑥꑦꑧꑨꑩꑪꑫꑬꑭꑮꑯꑰꑱꑲꑳꑴꑵꑶꑷꑸꑹꑺꑻꑼꑽꑾꑿꒀꒁꒂꒃꒄꒅꒆꒇꒈꒉꒊꒋꒌ

    ꒐꒑꒒꒓꒔꒕꒖꒗꒘꒙꒚꒛꒜꒝꒞꒟꒠꒡꒤꒥꒦꒧꒨꒩꒪꒫꒬꒭꒮꒯꒰꒱꒲꒳꒵꒶꒷꒸꒹꒺꒻꒼꒽꒾꒿꓀꓂꓃꓄꓆

    ᠀᠁᠂᠃᠄᠅᠆᠇᠈᠉᠊᠋᠌᠍᠎᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙ᠠᠡᠢᠣᠤᠥᠦᠧᠨᠩᠪᠫᠬᠭᠮᠯᠰᠱᠲᠳᠴᠵᠶᠷᠸᠹᠺᠻᠼᠽᠾᠿᡀᡁᡂᡃᡄᡅᡆᡇᡈᡉᡊᡋᡌᡍᡎᡏᡐᡑᡒᡓᡔᡕᡖᡗᡘᡙᡚᡛᡜᡝᡞᡟᡠᡡᡢᡣᡤᡥᡦᡧᡨᡩᡪᡫᡬᡭᡮᡯᡰᡱᡲᡳᡴᡵᡶᡷᢀᢁᢂᢃᢄᢅᢆᢇᢈᢉᢊᢋᢌᢍᢎᢏᢐᢑᢒᢓᢔᢕᢖᢗᢘᢙᢚᢛᢜᢝᢞᢟᢠᢡᢢᢣᢤᢥᢦᢧᢨᢩ

    ،؛؟ހށނރބޅކއވމފދތލގޏސޑޒޓޔޕޖޗޘޙޚޛޜޝޞޟޠޡޢޣޤޥަާިީުޫެޭޮޯް

    〱〲〳〴〵〶〷ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔ゛゜ゝゞァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ・ーヽヾ、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン゙゚

    ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒᄓᄔᄕᄖᄗᄘᄙᄚᄛᄜᄝᄞᄟᄠᄡᄢᄣᄤᄥᄦᄧᄨᄩᄪᄫᄬᄭᄮᄯᄰᄱᄲᄳᄴᄵᄶᄷᄸᄹᄺᄻᄼᄽᄾᄿᅀᅁᅂᅃᅄᅅᅆᅇᅈᅉᅊᅋᅌᅍᅎᅏᅐᅑᅒᅓᅔᅕᅖᅗᅘᅙᅟᅠᅡᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵᅶᅷᅸᅹᅺᅻᅼᅽᅾᅿᆀᆁᆂᆃᆄᆅᆆᆇᆈᆉᆊᆋᆌᆍᆎᆏᆐᆑᆒᆓᆔᆕᆖᆗᆘᆙᆚᆛᆜᆝᆞᆟᆠᆡᆢᆨᆩᆪᆫᆬᆭᆮᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂᇃᇄᇅᇆᇇᇈᇉᇊᇋᇌᇍᇎᇏᇐᇑᇒᇓᇔᇕᇖᇗᇘᇙᇚᇛᇜᇝᇞᇟᇠᇡᇢᇣᇤᇥᇦᇧᇨᇩᇪᇫᇬᇭᇮᇯᇰᇱᇲᇳᇴᇵᇶᇷᇸᇹ

  • Sorting it all Out

    unicodeFFFE... is Microsoft off its rocker?

    • 13 Comments

    This is an issue that has been around for a long time.

    Back in February (geez, I really have been blogging almost a year now, haven't I?), I explained the difference between Big Endian and Little Endian Unicode. In January I also talked about the Byte Order Mark.

    Neither of them are what this post is about.

    This post is about the Preferred Charset Label for web pages that are encoded with Big Endian Unicode (or 'Unicode big endian' as Notepad likes to call it).

    It is indeed unicodeFFFE.

    "But Michael, that is not a valid Unicode code point!" cry some.

    "But Michael, that is not what the big endian BOM looks like in memory if one is looking at the bytes!" cry others.

    "But Michael, that is not what the big endian BOM looks like on Big Endian systems!" cry some of those remaining.

    "Michael, is Microsoft off its rocker?" exclaim a few of the rest (their language is at time less polite, but one email used this language so I decided to go with it).

    And believe it or not, there are actually bugs raised by people on several different product teams over the years, who are unhappy with one or more of the following:

    And then some the words from people at Microsoft....

    "The byte-order mark for big-endian unicode is FEFF, so this should be UnicodeFEFF. This seems like a valid complaint, but I was wondering if it'd break something else to change it." explains Shawn Steele, the development owner of encodings in Windows and the .NET Framework.

    "I think this a mistake in the original MLang data, but we have to keep it for compatibility." explains the developer who used to own MLang, now the MUI Development Lead.

    "Yes, it was a misnomer that we inherited from MLang.  It’s too late to change that." explains the NLS Development Lead.

    "Yes, this was wrong in the initial implementation. But now that apps are coded to it, we cannot change anymore." explains Software Architect Chris Lovett on the SQL Server team.

    But the original truth about why it was in MLang in the first place is not quite this insidious. Basically, Windows (and Microsoft) are predominantly Little Endian shops (even when platforms that supported BE ran Windows like Alpha, they used LE on the installs). And when someone on a little endian system reads it in as if it were a WCHAR (thinking it to be a UTF-16 LE code unit), they see 0xFFFE, which is of course not a valid Unicode code unit. Thus it is easy it is easy to see it as a big endian file.

    The BOM is always U+FEFF. Always. ALWAYS. But that means that in memory it is (in BYTEs):

    • 0xff 0xfe when it is the little endian BOM on any system;
    • 0xfe 0xff when it is the big endian BOM on any system.

    This is because big endian sytems take the first (big) byte first, where little endian systems take that seond byte first. Which means that in memory it is (in WORDs):

    • 0xfeff when it is the little endian BOM on little endian systems;
    • 0xfffe when it is the big endian BOM on little endian systems.
    • 0xfffe when it is the little endian BOM on big endian systems;
    • 0xfeff when it is the big endian BOM on big endian systems.

    Try it yourself on any platform you happen to have handy if you don't believe me. :-)

    The semantic is clear and unambiguous, just not documented very well, and perhaps some would call it a rather silly way to think of it. The name is just acting as a somewhat sensible (if somewhat platformily provincial) labelling of what one sees on almost 100% of all Windows platforms.

    And as people already pointed out, it is a bit late to be talking about changing it....

     

    This post brought to you by U+fffe, a permanately reserved code unit in Unicode so that BOM determination can remain easier....

Page 1 of 5 (72 items) 12345