Blog - Title

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    GEOID -- The LCIDs maligned little brother....

    • 13 Comments

    (Matt, never believe I don't hear you when you are saying things. This one is for you!)

    Inside of Microsoft, there is a huge database of location data. That database is used to support products like MapPoint, Passport, Windows, and others with important information that each can use when they provide data about locations -- countries, regions, provinces, states, cities, and so on. How important it is varies from product to product (of course). But on Windows it is particularly dissed, and that ought to change.

    In Windows, the primary key of this database is the GEOID, and many of the core GEOID values can be found in the Table of Geographic Locations.

    (Lest you think I was talking out of my hat when I was talking about a huge shared database, compare the values in that table with those in MapPoint's GeoMapRegion values table!)

    We have a program manager who started with us a while back and who has owned our Regional Options control panel applet for most of that time. One of the first things he noticed about Regional Options were that it was not driven by region, but by locale (which is itself primarily driven by language and secondarily by region as a tiebreaker any time the language alone would not give sufficient information). He was struck by how flawed that was as a concept, and has been trying to get us to think a bit beyond our current model. The model that gives the regional differences less importance than they deserve.

    Although it is in my fundamental nature as a developer to disagree with him as a program manager (insert grin here), I will try to rise above that nature as a technical lead and admit that he is 100% correct. It does not make a lot of sense.

    (teams throughout Microsoft start to get nervous)

    However, the fundamental support of the locale model has been a central means of international support in almost every product at Microsoft. And there is no good way to reverse that and make everyone rewrite their functionality.

    (teams throughout Microsoft breathe sighs of relief)

    But this is no excuse for bad decisions in functionality!

    APIs like GetUserGeoID are the most intuitive and obvious way to find out where someone is located. It is the answer a user gives in Regional and Language Options when asked for their location. The text in that dialog claims that the setting is used "To help services provide you with local information, such as news and weather." But how can we make those words true if applications Internet Explorer make their fundemental decision about the setting that controls the support of regional content on the default user locale (an LCID) rather than the geographical region (a GEOID)? As I said the other day when I asked "What is my locale? Well, which locale do you mean?" the default user locale is meant to impart the user's choice in format and collation preferences, NOT their location. So IE should be using the GEO settings for that!

    Some will freak out at this point -- what would such a change to do users?

    Well, let us say that we made the two settings act as being tied together by default, so that if you changed the locale that by default the GEOID would be changed with it.

    This is easy since there is an under-documented LCType for GetLocaleInfo that retieves the GEOID that is asociated with a locale1 -- LOCALE_IGEOID (the value is 0x0000005B and the functionality has been since Windows XP). Any time the user explicitly changes the location after seeing the description in RED above, wouldn't the user be expecting IE to follow such a request?

    Such a solution would solve the real problem for almost all users, and perhaps other folks for whom region is important (Windows Media Player, Passport, and news services for starters) would be able to start using the GEOID preference of the user as a starting point. Even the time zone settings would get an effective first guess with such a system (even if tweaking as required to obtain the best answers).

    Because GEOIDs kind of rock. They provide a lot of potentially interesting information, as the SYSGEOTYPE enumeration defines:

    • GEO_NATION - GEOID of a nation. This value is stored in a long integer.
    • GEO_LATITUDE - The latitude of the GEOID. This value is stored in a floating point number.
    • GEO_LONGITUDE - The longitude of the GEOID. This value is stored in a floating point number.
    • GEO_ISO2 - The ISO 2-letter country/region code. This value is stored in a string.
    • GEO_ISO3 - The ISO 3-letter country/region code. This value is stored in a string.
    • GEO_RFC1766 - An RFC1766-style string derived from the locale and GEOID (for nations only).
    • GEO_LCID - A locale ID (LCID) derived from the language and the GeoID (for nations only).
    • GEO_FRIENDLYNAME - The friendly name of the nation. Example: Germany. This value is stored in a string.
    • GEO_OFFICIALNAME - The official name of the nation. Example: Federal Republic of Germany. This value is stored in a string.
    • GEO_TIMEZONES - The time zones associated with the GEOID. These values are stored in an array of IDs.
    • GEO_OFFICIALLANGUAGES - The official languages of the nation at the GEOID. These values are stored in an array of LCIDs.

    Shouldn't we all consider using the result of GetUserGeoID when the question is "where am I?" rather than any of the various locale settings?

    I think it really is enough of this "I don't want to make a a change because what if no one else does?" problems -- let's break the logjam and do the right thing, and work to make sure it is a better situation for everyone!

    (Though with that said I will go in to work and still be just as cautious about any attempt at a suggestion to undo the current archeciture, beyond the (suggested above) step of having LOCAL_USER_DEFAULT drive the GetUserGeoID value!)

     

    1 - There is also a RegionInfo.GeoID property that has been added to the Whidbey release of the .NET Framework, for a lot of the same reasons that inspired that LCType. The actual regions that fill the RegionInfo collection are locale-based, but at least now you can tie them together a little more than previously. Getting GEO information as listed in the SYSGEOTYPE enumeration is still not possible using the .NET Framework, though. Maybe something for next time....

     

    This post brought to you by "" (U+2708, a.k.a. AIRPLANE)

  • Sorting it all Out

    English only! (or how to misuse NLS APIs)

    • 8 Comments

    Remember my conversation about What is my locale? Well, which locale do you mean? And the part where I talked about how people tended to misuse the locales?

    There is a new beta available for the MSN Toolbar suite, up at http://beta.toolbar.msn.com (though eventually this link will probably stop working).

    If you have an English Windows 2000 and Windows XP, you are all set. If not, then you may see something like this:

    (From a machine with a Czech system locale)

    Lets ignore the fact that over 50% of Microsoft's products are bought outside of the US, and that between 70% and 100% of those people do not want to use the English language.

    And let's also ignore the fact that this is on the INTERNET, which is a world-wide enough fad to suggest that a huge number of people around the world will see it.

    Both of those points cause this install to have problems, but we'll deal with those kind of problems another day.

    Let's also ignore the fact that they do not allow it to install on Windows Server 2003 (so I could not install on my main machine).

    Thats a problem too, as far as I am concerned -- especially since they said "XP or later" when Server 2003 is the only shipping OS that is later!

    For now, let's deal with the fact that they detected the LANGUAGE VERSION OF THE OPERATING SYSTEM using a totally inappropriate NLS setting. A setting that anyone can change and find themselvews unable to install without being given a somewhat scary warning.

    I'm not going to install for now -- I'll wait until they are a bit more internationally friendly.

    Thumbs down for the MSN Toolbar Suite Beta from this developer. :-(

     

    This post brought to you by "N" (U+004e, a.k.a. Latin Capital Letter N)
    (A letter that has been a longtime sponsor of Sesame Street and which is used to seeing content outside of the US excluded)

  • Sorting it all Out

    It says 'Unleash the Power of OneNote'

    • 0 Comments

    Unleash the Power of OneNote

    This is the title of a book that I just got sent, by FedEx.

    It took me a minute to understand why, though -- I did not remember ordering it, and it has been many years since I have had enough alcohol to spend money without realizing where I had spent it. :-)

    Then I looked at the authors of the book (Kathy Jacobs and Bill Jelen), and I understood.

    Bill Jelen (a.k.a. MrExcel of http://mrexcel.com) is one of the co-authors, and he had sent me mail not too long ago asking whether he could use the VBA code I put together for dealing with GUIDs, five years ago.

    Of course I said fine -- the code was on my site to use, my only rule about that code has been don't say that you wrote it or anything. Like some people have done in the past with functions that were more interesting than this one (like the stuff I reverse engineered for an article Ken Getz and I wrote on AddressOf in VBA).

    People who are polite enough to not only leave the attribution in but to ask first? They don't make many pepple with that much class, these days.

    I mentioned in passing that I ought to get a copy of the book. :-)

    But I was definitely not expecting top billing in the acknowledgements, or such prominent mention on pp262-3 (though my website was misspelled on the first page, it is correct on the second!).

    I'm actually going to read the book this weekend, and maybe start using OneNote a bit more (if Chris Pratley is involved, then I know it has to be worth a look).

    Thanks, Bill!

  • Sorting it all Out

    What makes a string meaningful?

    • 2 Comments

    Yesterday, I said that CompareString prefers meaningful strings, and that while (the rare) inconsistencies are always bugs that we have to prioritize such bugs based on whether or not the data is actually valid/meaningful.

    Many people stopped and wondered how one defines the word 'meaningful' here. Is it a definition that is useful for developers?

    I'll ignore the cross-script strings that have little clear semantic or pragmatic meaning and focus on the strings that have code points not defined by the MS collation tables (sometimes not even by Unicode!) as discussed in The jury will give this string no weight.

    Some developers might think they could use the CompareString API and compare characters to a zero length string. Others think about using LCMapString looking for a "no weight" sort key. But both of these ideas share two problems that keep them from acting as practical solutions:

    1. Checking for one character at a time is unwieldy, and more than one at a time can miss individual characters with no weight.
    2. Some Unicode code points intentionally have no weight and are valid as they are, such as U+2060 WORD JOINER.

    So, what can you do? You can use the IsNLSDefinedString API! You pass it a string and it will tell you if every character in a string has a defined result (which in this case is exactly what you may need).

    It is intimately related to the GetNLSVersion API, which also helps out with the question of stability in collation.

    Both APIs were added in Windows Server 2003, and the Whidbey release of the .NET Framework includes a method analagous to IsNLSDefinedString (CompareInfo.IsSortable, you will see it starting in Beta 2).

    GetNLSVersion is used by major databases like Active Directory in order to know when it needs to re-index their data. Basically looking at the NLSVERSIONINFO struct, the dwDefinedVersion member will be incremented any time a major version sort of change happens, and the dwNLSVersion member will be incrememented any time a minor version sort of change happens.

    Now looking at IsNLSDefinedString, if you have a database and create indexes based on sort keys from LCMapString or B-Trees built from CompareString calls:

    1. Any time the major version is incremented, you should re-index no matter what, and
    2. Any time the minor version is incremented, you should re-index for any entry where IsNLSDefinedString used to return FALSE (in case it now returns TRUE or different results due to part of the string now being defined)

    Obviously, major version changes are expensive and would be expected to be rare -- not even every major release of Windows requires a new major version.

    Why is that? Well, usually a new version would just mean a whole bunch of new characters added, and thus there is no need to re-index strings that are already indexed -- which suggests a minor version. Minor version changes would be much more common. With them you can trust all existing index values, and only need to re-index strings that previously contained one or more unsortable elements.

    If you follow principles (A) and (B) above and always store information about unsortable strings, you can use these APIs to maximize the utility of support of the collation of meaningful strings on Windows.

     

    This post brought to you by "" (U+10e5, a.k.a. GEORGIAN LETTER KHAR)

  • Sorting it all Out

    Doing a little more in Sri Lanka....

    • 7 Comments

    It is all public now....

    Read all about it, from two papers in Sri Lanka:

    The Daily Mirror (28 Jan 2005): Microsoft partners Govt. for technology support for tsunami recovery effort (pay site)

    The Daily News (29 Jan 2005): Microsoft Develops Sinhala Software (free site)

    The idea was simple enough. Sinhalese was always on Microsoft's language roadmap (previously discussed here) as one of the many languages to ship in Longhorn. And not everything was completely done. But given the circumstances, support for Sinhalese was needed a bit more urgently to help with the healing process in areas affected by the ravages of tsunami.

    Obviously much more has happened along the lines of financial relief efforts from people around the world and people at Microsoft (plenty of coverage has been happening about that all around the blogosphere).

    But working to make sure it will be easier to display, render, type, search, sort, save, and print information in Sinhalese is yet another small way that we were able to help the long road to recovery.

    According to Lalith Weeratunga (Secretary to the Prime Minister of Sri Lanka), "Microsoft's contribution is of enormous value to the reconstruction and rehabilitation efforts which are currently under way in Sri Lanka."

    I can't take credit for this effort; my role in the project was a minor one. But to the people who did the real work here (and you know who you are!), I am in awe of your efforts. And I am proud to be working for the company that helped to make it happen. You are truly amazing, and I am proud to be working with all of you....

     

    This post bought to you by "ඔ" (U+0d94, a.k.a. SINHALA LETTER OYANNA)

  • Sorting it all Out

    CompareString prefers meaningful strings

    • 11 Comments

    Another reason why international test is not for amateurs....

    Like they say at despair.com: "When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do."

    It does not tend to be a problem on this team. But when other teams call our APIs, they somehow get it in their head that they should as a part of testing their component they should test the API. And not understanding how the APIs work, they start building random Unicode strings and passing them to CompareString.

    Now CompareString is an API that was built to handle actual linguistically meaningful strings, not whatever random crap is generated. And while I will not claim that such a process cannot find problems, I can claim that this is not the sort of core scenario that causes me to lose sleep at night the way genuine bugs that might affect customers will....

    An example of this happened over a year ago, in the newsgroups:

    I've found that with certain Unicode strings, CompareStrin­gW seems to be acting very strangey - you get behavior like this:

    Strangely is a relative term, especially in a case where you­ are randomly generating strings....

    A < B 
    B < C 
    C < A

    or even:

    A < B 
    B < A

    I will admit that both are not so great. But you have to understand how t­he collation data is created and what it represents.

    The goal is to give a way to sort every part of the Unicode ­BMP (basic multilingual plane), according to some particular selected l­ocale. Any time a code point is not usefully defined in the table (e.g. it i­s not defined in Unicode, it is not a language/script that Windows has useful­ data for, or it is intentionally not given weight), it will not give useful ­linguistic information.

    In other words, comparing random crap can give random crap r­esults. :-)

    These strings are randomly generated Unicode strings, so i­t may be that the problematic strings contain characters that are e­ither unused or in certain parts of the Unicode space that are reserved­ (something similar to the private use space, maybe).  So it may be th­at CompareStringW works fine for all real-world strings that ­we'd ever encounter.  Still, it's a bit unsettling to see CompareStr­ingW return 
    values that are so obviously wrong.

    See above. But I will plow through the examples too, below.

    A specific example - all three calls to CompareStringW ret­urn CSTR_LESS_THAN:

    A = 1B37 1D96 4516 
    B = 30FE 4113 67BE 
    C = 0747 4443 40E6

    Are there any errors here?  From what I can tell, the thre­e strings are all legal (null-terminated) UTF-16 strings - they're n­ot ill-formed.

    Well, string A is two code points not in the Unicode (which thus have no weight) and an Extension A ideograph (no weight prior to XP, near the end of the table XP and later).

    String B starts with a Katakana iteration mark that affects the character before it and which would never start a string, another Extension A ideograph, and a standard CJK ideograph.

    String C is made up of a Syriac letter and two more Extension A ideograph.

    SUMMARY: All three are nonsense strings and nothing useful can co­me from testing with them.

    Another example:

    A = 0D42 65F9 
    B = 1111 1B4F

    String A is a Malayalam character and a CJK ideograph -- two characters ­one would never really expect to be together.

    String B is a Hangul character and an undefined codepoint -- again not a­ valid test.

    CompareStringW returns that A<B and B<A if I pass in -1 as­ the lengths (the documentation states that "if this parameter is any n­egative value, the string is assumed to be null terminated and the­ length is calculated automatically").  But if I calculate the length­s of the strings myself and pass those in, then it works proplerly ­(A>B and B<A).  Passing in the string lengths does not help the cas­e above, however.

    Well, this is a type of situation that really is a bug, some­thing that I have been working to correct for future versions -- there si­mply are many cases where if you pass invalid data we handle it oddly, specifically between the -1 and cch cases (which are basically two differ­ent code paths).

    The -1 case is designed to not require a string wallk on the­ part of the caller (it literally plows the string one sort element at a ­time and stops when it knows the answer, and any time the two calls give different results, it is technically a bug (one that I am charged with trying t­o fix! <grin>). The mitigation for the time being is that invalid input is r­equired to give invalid results....

    Now these ARE bugs. And I will look into them, at some point. But it is fair to say that invalid strings really are the last frontier. All of the meaningful bugs come first, though. Because any day where the only people I frustrate are the testers who do not understand what they are testing, I will have no problems looking in the mirror in the morning....

    The key? If you want to test CompareString, do so with actual word lists -- made up of actual useful strings in the target languages. Take an article in a target language and the first 200 or 500 words from it. Or get a list from a dictionary. Or from customers. Never generate random word lists that do not match the rules of the language or of Unicode (thinking about those illegal characters!). Work to pass appropriate flags that make sense for the application and the API itself. Do not pass code points not included in the Unicode standard if you are expecting back meaningful results.

    And most importantly know what you are testing. If you need to test what the API does to typical strings in your appliction to understand if it is the right API to call, then that is a good idea. But you do not need to test the API itself, unless Microsoft is paying you to do that. The API works, and the important question is whether or not it works for your scenarios.

    Another day I will give a good example of a scenario where it does not return the best possible results, and where another API is best considered....

     

    This post brought to you by "ß" (U+00df, a.k.a. LATIN SMALL LETTER SHARP S)
    (which is treated as equal to "SS" on sll platforms, so that German can use the default table with a ton of other languages....)

  • Sorting it all Out

    Handling shortcuts and accelerators in a real world situation

    • 2 Comments

    Not too long ago (well, I have not been doing this very long!) I talked about shortcuts and accelerators. Let's take that information 'round the dancefloor for a bit....

    Mark T. posted in the newsgroups recently:

    lets assume that we are speaking about a game where the supposed 'Z' key triggers an action.

    I would like to know if this is appropriate for instance for people using Arabic, Hebrew, Hindi, Russian or Thai keyboard.

    I think that there is no ‘Z’ printed in any key of those keyboards.

    Would it be convenient for those users to click the ‘Z’ key to do something?

    If not, what would you suggest for a situation where there are not many trigger keys while all Windows supported scripts are used?

    I had seen that Chinese versions of Windows use ASCII for menu shortcuts. Also found that Thai applications use Thai shortcuts, maybe because it is more convenient.

    The supposition is correct -- there will often be no "Z" printed on the keyboard; users will not know what to type. So it definitely will not qualify as convenient. :-)

    Generally speaking, they would need to have instructions that told them what keys to press for their language. While you could localize the instructions into target languages, the best worldwide solution would be to build this into the game on a help screen that used the ToUnicode() API to find out what character was indeed on the specific keys (this is a suitable method even if the user changes their layout, since if they know how to type on their current keyboard, they know where the letters would be, even if it is not what is printed on the keys).

    Thinking about the Chinese and Thai examples for a second -- in Chinese, the keyboard is usually a US keyboard (most of the IMEs use the US keyboard as a base), which is why accelerators and shortcuts both tend to look that way (well, that and the fact that an ideograph takes multiple keystrokes and both shortcuts and accelerators need just one). For Thai the same rules do not apply and the Thai keyboard layout will have other keys there, thus for Thai using the Thai language for the instructions makes sense. That ToUnicode() based solution I mentioned may still be the best world-wide solution to the problem, and a small additional investment now will definitely save money later as you expand into more markets....

     

    This post brought to you by "ﭮ" (U+fb6e, a.k.a. ARABIC LETTER PEHEH ISOLATED FORM)

  • Sorting it all Out

    What is my locale? Well, which locale do you mean?

    • 29 Comments

    A few years back (some time before Windows XP shipped) when we located in were in Building 9 and much smaller than we are now, someone else in the building was having a problem. Our kind of problem. An international problem. I don't remember what it was -- something to do with code pages, maybe?

    Anyway, Wei Wu, one of our cool development leads, asked a few configuration questions, and at the end of the message, asked him "what is your default system locale?".

    His response, which I cannot find a copy of now, was priceless. Assuming that Wei was going to stop by to look at the machine. this guy started to describe the location of his computer.... :-)

    One of those jokes that most people won't quite get, and jokes are never funny if they have to be explained. Ah, the life of an international geek.

    But it was pretty funny, I think. We all had a good laugh at the time.

    There is a not-so-hidden truth in there -- our terminiology story is weak. It really ought to be better. So, once again, here is a quick glossary of the four most common types of locales:

    DEFAULT USER LOCALE (Windows XP term: "Standards and Formats"):

    This setting controls the way information is presented -- the sort order in list boxes, the format of date, time, number, and currency values, the calendar you prefer to use. The list of locales can be thought of as a big group of defaults that are grouped by many language/region pairings, which you can see in the first tab Regional and Language Options Control Panel applet. Several of the settings are customizable, particularly the various formats.

    The setting is per-user and when you change the setting, it is effective immediately, and all top level windows in the user's windowstation will get a WM_SETTINGCHANGE message indicating that the change has happened so that they too can reflect the change immediately.

    Developers will commonly use LOCALE_USER_DEFAULT as their LCID of choice, whether by doing so directly or by calling functions in SHELL, USER, or elsewhere that do so for them. Using this setting and behaving appropriately with the results thereby is a sure sign that the developer respects the user's settings.

    DEFAULT SYSTEM LOCALE (Windows XP term: "Language for non-Unicode Programs"):

    This setting has three major purposes:

    1. Specifies the default ANSI, OEM, MAC, and EBCDIC code pages to use for non-Unicode programs.
    2. Specifies some of the font linking preferences for CJK fonts and for legacy bitmap fonts.
    3. Specifies application behavior when developers incorrectly use this setting rather than the DEFAULT USER LOCALE.

    This setting is found on the third tab of the Windows XP/Server 2003 Regional and Language Options dialog and in the "Default" button on the first tab of the Windows 2000 Regional Options dialog.

    Changing the setting changes it for the entire machine and it requires a reboot to take effect. No notification mechanism is done, nor is one needed since no change happens until the reboot does. A small number of misbehaving applications which check the registry rather than using the APIs will get wrong results after the setting change but before the reboot.

    Unfortunately, developers will sometimes use LOCALE_SYSTEM_DEFAULT for purposes other than #1 and #2, and by doing so they manage to simultaneously show their users disrespect and cause yet another compatibility weirdness with functionality tied to the default system locale. Given the fact that a reboot is required, you think people would avoid this, but dven developers are not always perfect.

    The XP name should be a big hint, though sometimes it adds confusion.

    DEFAULT USER INTERFACE LANGUAGE (Windows XP Term: "Language used in menus and dialogs"):

    This setting controls the language in which the UI is presented. It is only present if you have the MUI version of Windows (which is to say that you have Windows with the multilanguage files installed).

    This setting is found on the second tab of the Windows XP/Server 2003 Regional and Language Options dialog and in the middle of the first tab of the Windows 2000 Regional Options dialog.

    The setting is per-user and changing it requires a logoff to take effect.

    There is no constant for it but the GetUserDefaultUILanguage API will retrieve the setting quickly enough. Given the changes to the resource model that the changes to support MUI inspired, it is easy for applications to plug into the very same setting automatically. I'll talk more about this another time....

    DEFAULT INPUT LOCALE (Windows XP Term: "Default Input Language"):

    This setting controls the initial input language used for all newly created threads.

    This setting is found on the second tab of the Windows XP/Server 2003 Regional and Language Options dialog (hit the "Details..." button) and on the last tab of the Windows 2000 Regional Options dialog.

    The setting is per-user and it takes place immediately. But obviously will not change the input language on any existing threads; only new threads get the new default.

    Developers can find out what the current setting is by calling the SystemParametersInfo API with the SPI_GETDEFAULTINPUTLANG parameter. You can even set it with the SPI_SETDEFAULTINPUTLANG parameter but this almost always something that a developer should not be doing -- it is a user preference. Since proper application behavior is mostly about respect, this really is a constant that you should avoid. :-)

    For more on this topic, Dr. International has bigger lists here and here, and there is more information here.

    But when you are a developer, respect is the key here -- respect of the user's preferences and settings.

    When you are a user, consider which applications respect your settings and which do not. Because while doing so may not be convenient for an application, it is certainly possible....

     

    This post sponsored by "ð" (U+00f0, a.k.a. LATIN SMALL LETTER ETH)

  • Sorting it all Out

    Why that is positively Ethiopic!

    • 26 Comments

    A little over a week ago, when I was mentioning that In Tamil -- sometimes, they are digits; other times, just numbers, Scott Hanselman suggested "That would ROCK if you would do Ethiopic sometime." Well, rock on Scott -- today is the day.

    For the record I am not an expert in these things, just a geek who finds alternate number systems to be really interesting (whether roman numbers, Tamil numbers, or Ethiopic numbers).

    Ready? here we go....

    Factoid -- there is no Ethiopic zero. There are some numbers that have zeros in them (10, 20, 30, etc.) but no zero. It makes the number system quite fascinating.

    We'll start with a small quote from the Unicode Standard on the subject, found in Chapter 12, Section 1 (available for viewing online in PDF format, here):

    Numbers. Ethiopic digit glyphs are derived from the Greek alphabet, possibly borrowed from Coptic letterforms. In modern use, European digits are often used. The Ethiopic number system does not use a zero, nor is it based on digital-positional notation. A number is denoted as a sequence of powers of 100, each preceded by a coefficient (2 through 99). In each term of the series, the power 100^n is indicated by n HUNDRED characters (merged to a digraph when n = 2). The coefficient is indicated by a tens digit and a ones digit, either of which is absent if its value is zero.

    For example, the number 2345 is represented by

    2,345 = (20 + 3)*100^1 + (40 + 5)*100^0
          = 20 3 100 40 5
          = TWENTY THREE HUNDRED FORTY FIVE
          = 1373 136b 137b 1375 136d 
          = ፳፫፻፵፭

    If you are like me then your eyes may have crossed when you read this, even though the example seemed clear enough. Maybe they should have put in a bigger example....

    Personally, I find Daniel's Ethiopic Number Algorithm #4 to be much clearer from a conceptual standpoint. If you prefer something a bit more cerebral with code samples, then you can look at http://www.geez.org/Numerals/ for a slightly different algorithm (using the same number, I suspect a shared source, maybe? <grin>). The page even has links to demonstrations of the algorithm in Perl, C, Java, and C#.

    So let us take the resulting number that both sites talk about (፯፻፷፭፼፵፫፻፳፩) and try to convert it back from Ethiopic to our familiar Arabic-Indic digits:

    = ፯፻፷፭፼፵፫፻፳፩

    = 136f 137b 1377 136d 137c 1375 136b 137b 1373 1369

    = DIGIT SEVEN; NUMBER HUNDRED; NUMBER SIXTY; DIGIT FIVE; NUMBER TEN THOUSAND; NUMBER FORTY; DIGIT THREE; NUMBER HUNDRED; NUMBER TWENTY; DIGIT ONE

    (I removed the word ETHIOPIC from each character name to allow more to fit per line)

    At this point, even knowing what the number is, the words on the site ("Conversion from Ethiopic numerals into western form is trivial") do not seem quite as true, do they? :-)

    Though it actually is easy, it just looks hard. Keeping in mind those "sentinels" that ETHIOPIC NUMBER HUNDRED and ETHIOPIC NUMBER TEN THOUSAND represent (with two digits in each group, between them) and we have:

    = DIGIT SEVEN; NUMBER HUNDRED;
          NUMBER SIXTY; DIGIT FIVE; NUMBER TEN THOUSAND;
          NUMBER FORTY; DIGIT THREE; NUMBER HUNDRED;
          NUMBER TWENTY; DIGIT ONE

    Notice how the sentinels keep swapping between the TEN THOUSAND and the HUNDRED? Interesting...

    Picking at the pieces:


          65 
          43 
          21

    or more conventionally

    7654321

    Not too hard, right? Lets try another one:

    = ፳፩፼፳፰፻፷፯፼፶፫፻፱

    = 1373 1369 137c 1373 1370 137b 1377 136f 137c 1376 136b 137b 1371

    = NUMBER TWENTY; DIGIT ONE; NUMBER TEN THOUSAND; NUMBER TWENTY; DIGIT EIGHT; NUMBER HUNDRED; NUMBER SIXTY; DIGIT SEVEN; NUMBER TEN THOUSAND; NUMBER FIFTY; DIGIT THREE; NUMBER HUNDRED; DIGIT NINE

    A little harder this time, but lets do the grouping where those grouping sentinels are and see what we have:

    = NUMBER TWENTY; DIGIT ONE; NUMBER TEN THOUSAND;
        NUMBER TWENTY; DIGIT EIGHT; NUMBER HUNDRED;
        NUMBER SIXTY; DIGIT SEVEN; NUMBER TEN THOUSAND;
        NUMBER FIFTY; DIGIT THREE; NUMBER HUNDRED;
        DIGIT NINE

    We seem to be missing a digit right before that nine -- what happened to two numbers in each group? Ah, thats easy -- look at the sentinel! A zero goes there. So we have:

    = 21 
        28 
        67 
        53 
        09

    And as Tommy Tutone knows, Jenny's New York phone number is indeed 212-867-5309.

    Ok, one more that shows a bit more of that missing zero stuff:

    = ፶፻፭፼፭

    = 1376 137b 136d 137c 136d

    = NUMBER FIFTY; NUMBER HUNDRED; DIGIT FIVE; NUMBER TEN THOUSAND; DIGIT FIVE

    Ooh, a tough one. I'll insert some fake zeros in where they seem to belong based on those sentinels:

    = NUMBER FIFTY; NUMBER HUNDRED; 
        DIGIT ZERO; DIGIT FIVE; NUMBER TEN THOUSAND;
        DIGIT ZERO; DIGIT ZERO; NUMBER HUNDRED;
        DIGIT ZERO; DIGIT FIVE

    So we have:

    = 50 
        05
        00
        05

    Or more conventionally 50,050,005.

    Now of course I am not saying that you would write code that is quite this silly. But it is reasonably straightforward to write an algorithm that can handle these numbers. A bit more background required than I would try to give for an interview question (though someone who could understand it in such a short time and come up with a good answer might have impressed me).

    Anyone want to take a stab at it? :-)

    Side note #1 -- the Unicode Technical Committee voted in UTC#98 to change the general category of the ETHIOPIC DIGITS from Nd (Number, Digit) to No (Number, Other) due in large part to the fact that the Ethiopic numbers are not generally used as digits. This change was effective as of Unicode 4.01. As such, the update will not be seen in Windows until Longhorn or in the .NET Framework until the version after Whidbey.

    Side Note #2 -- Ethiopic is in the category of scripts I defined in The jury will give this string no weight (a fact that will not be changing until coincidentally around the same time -- Longhorn and the .NET Framework in the version after Whidbey).

     

    This post brought to you by "፼" (U+137c, a.k.a. ETHIOPIC NUMBER TEN THOUSAND)

  • Sorting it all Out

    Do they not even <b>*use*</b> Automatic Updates?!?

    • 29 Comments

    I have been reading people all over the internet who hate that Microsoft is perhaps in the future going to limit Windows Update to legal copies of Windows (Automatic Update would be their only option) with the Windows Genuine Advantage program (more info in the Windows Genuine Advantage FAQ).

    Many are on the bandwagon, from Greg Hughes to Mitch Wagner to a hundred of whoever your favorites are, everyone is talking about how evil Microsoft is for something that they have not even done yet.

    Most think Microsoft is being irresponsible by not patching these machines. Those people do not even realize that all security patches and Service Packs are still available via Automatic Update, even for illegal copies of Windows. This acts as a convincing proof to the theory that you do not need to know how to read in order to know how to write.

    The gist of the typical argument of those who are smart enough to at least recognize the "Automatic Updates" option is that people who pirate software will not choose to automatically update since they would be afraid that Microsoft would shut them down remotely for not being a legal user of Windows. They would rather use Windows Update where they have the choice for what they will or will not install.

    But have these wingnutspeople even used automatic updates before? Have they even looked at dialog?

    Well, lets look at it now, shall we? Here it is, both in Windows XPSP2 and Windows Server 2003:

        

    Notice how I have them set -- the XPSP2 box will automatically update every day at 3:00am, and the Server 2003 box will simply let me know if there are updates and then let me know again before installing. Is that a hint as to why I think these people have not used the feature?

    Notice how both of them have an option to look at the updates previous declined (currently disabled, I do not tend to refuse updates!)? Is that another hint?

    Look at all of the options I have here!

    People have total control over whether they install the security updates or not. Even if they are using a pirated version of Windows! The same choice they have in Windows Update for Critical Updates and Service Packs. If they are willing to use the latter, then why would the former be less appealing?

    Wouldn't using Automatic Updates lead to a safer internet for all users since it does not require an explicit visit to a web site to get patches installed? The only reason I do not install automatically on my Server 2003 boxes is that I may be building something and would prefer to control when I install. It is still very cool to get the reminder that there is something to install, and I am a huge fan of that sort of feature.

    So what are these people complaining about, exactly?

  • Sorting it all Out

    FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)

    • 9 Comments

    Last Friday, Jochen Kalmbach, in response to A little bit about the new CharUnicodeInfo class, asked the following:

    By the way: is there some equivalent to FoldString, especially "MAP_PRECOMPOSED" and "MAP_COMPOSITE"? Neither StringInfo nor TextInfo provide such a function, or?

    My answer was:

    The .NET Framework has something even better than FoldString here -- I'll post on it tomorrow....

    But I got busy this weekend and never got around to posting the answer to the question. Sorry about that! I'll do it now (I hope Jochen did not give up on me in the interim!).

    The description of FoldString from the Platform SDK: The FoldString function maps one string to another, performing a specified transformation option.

    There are many different suported transformations:

    MAP_FOLDCZONE Fold compatibility zone characters into standard Unicode equivalents. For information about compatibility zone characters, see the following Remarks section. MAP_FOLDDIGITS Map all digits to Unicode characters 0 through 9. MAP_PRECOMPOSED Map accented characters to precomposed characters, in which the accent and base character are combined into a single character value. This value cannot be combined with MAP_COMPOSITE. MAP_COMPOSITE Map accented characters to composite characters, in which the accent and base character are represented by two character values. This value cannot be combined with MAP_PRECOMPOSED. MAP_EXPAND_LIGATURES Expand all ligature characters so that they are represented by their two-character equivalent. For example, the ligature 'æ' expands to the two characters 'a' and 'e'. This value cannot be combined with MAP_PRECOMPOSED or MAP_COMPOSITE.

    Digit folding functionality is covered by the methods I described in CharUnicodeInfo, especially GetDecimalDigitValue. Some of the other methods will do an even fuller job, supporting many of the non-decimal digit numbers, which FoldString never handled....

    The ligature functionality does not really exist right now, though that does work well in comparisons, whenever it needs to.

    But the other three mapping types see new life in Whidbey, with tables that cover the Unicode 4.0 version of normalization, as described in UAX #15, UNICODE NORMALIZATION FORMS.

    How does it work? Well, in the Whidbey release of the .NET Framework, two new methods were added to System.String:

    bool IsNormalized(NormalizationForm normalizationForm)

    string Normalize(NormalizationForm normalizationForm)

    The functionality of the methods is obvious enough from the names -- the first checks if the string is in a specified normalization form, and the second puts it in a specified form.

    The enumeration with the forms (NormalizationForm) has four members:

    public enum NormalizationForm
    {
        FormC    = 1,
        FormD    = 2,
        FormKC   = 5,
        FormKD   = 6
    }

    The normalization forms, which are described much more fully in the UAX#15 spec, have easy analogues to their FoldString counterparts:

    FormC      MAP_PRECOMPOSED
    FormD      MAP_COMPOSITE
    FormKC     MAP_PRECOMPOSED | MAP_FOLDCZONE
    FormKD     MAP_COMPOSITE | MAP_FOLDCZONE

    In fact the only real difference is that FoldString only does part of the job, because the FoldString tables do not have all of the mappings that are in Unicode, a point I discussed previously. But these normalization methods do. So you can do all the mapping you need to in order to take equivalent forms of the same string and put them into one consistent form.

    Since the "default" method used in most situations is Form C, there are also overrides to the two methods with no NormalizationForm parameter that use Form C automatically. In many cases, that is the one you may want to use. Making Form C the "default" normalization form is not an arbitrary decision -- almost all of the keyboards in that ship in Windows input text in Form C already (though of course keyboards created by MSKLC, beng user-created, can be in whatever form).

    Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

    Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form C  ---> õĥµ¨ (00f5 0125 00b5 00a8)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form D  ---> õĥµ¨ (006f 0303 0068 0302 00b5 00a8)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KC --> õĥμ ̈  (00f5 0125 03bc 0020 0308)
    õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KD --> õĥμ ̈  (006f 0303 0068 0302 03bc 0020 0308)

    Ideally they would always compare as being equal even if the forms are different, but this is definitely not a 100% of the time result, as I pointed out a few months ago when I answered the question Normalization and Microsoft -- whats the story? Therefore normalization is the one way you can use to make sure that you will always get the right comparison, especially in some cases that may not ever be fully supported in comparison, like "ﷺ" (U+fdfa, a.k.a. ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM), which decomposes to:

    صلى الله عليه وسلم

    (0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)

    (Since most fonts do not support U+fdfa, if you can see the string above then it points to at least one time that normalization Form D helped out for a lot of people!)

    You can also see the Beta documentation for the IsNormalized method, the Normalize method, and the NormalizationForm enumeration.

     

    This post brought to you by "ﷻ" (U+fdfb, a.k.a. ARABIC LIGATURE JALLAJALALOUHOU)
    A liagture that decomposes to "جل جلاله" or
    062c 0644 0020 062c 0644 0627 0644 0647.

     

  • Sorting it all Out

    We broke CharNext/CharPrev (or, bugs found through blogging?)

    • 14 Comments

    (special thanks to James for pointing out this bug)

    It is amazing how sometimes one can be so busy trying to make a point that one can miss the point.

    A few days ago, I pointed out that CharNext(ch) != ch+1, a lot of the time.

    That ought to be true. It is true if you are running Windows NT 3.51, Windows NT 4.0, or Windows 2000.

    But in XP, things seem to have changed a bit.

    It used to be that if one took combining characters like U+0308 (COMBINING DIAERESIS) and passed them to the GetStringTypeW or GetStringTypeEx APIs with the CT_CTYPE3 dwInfoType, it would return (C3_NONSPACING | C3_DIACRITIC). If you look at the Platform SDK topics for these APIs, the types are defined as follows:

    Name                      Value       Meaning
    C3_NONSPACING    0x0001       Nonspacing mark. 
    C3_DIACRITIC        0x0002       Diacritic nonspacing mark. 

    Starting with Windows XP and continuing on with Windows Server 2003, it now just returns C3_DIACRITIC. Looking at the definitions, this makes sense -- C3_DIACRITIC claims it is for nonspacing marks, too. So the relevant part of the change is:

    1. There used to be no characters marked with just C3_DIACRITIC.
    2. There are no characters that are marked with just C3_NONSPACING now (there used to be several).

    This would all be fine given the above definitions (well, not really -- but we'll let that lie for a bit). The problem is that the CharNext and CharPrev APIs are relying on that C3_NONSPACING definition to figure out when to skip characters.

    I'm not sure what scares me more -- that this bug has been around since October of 2000, or that it was found due to a blog post that I might not have thought to do had not someone suggested it to me.

    I'll see about making sure this bug gets put in on Monday.

    So, between this one and the one I found myself (described in the answer to Guess #3 in Why I don't like the IsTextUnicode API), two longstanding bugs in Windows have been found through the act of blogging.

    This answers the question I posted in OT -- They taste like chicken, don't they? once and for all. Blogging may annoy me, but its not really relevant anymore. They help me make the product better. So I think I'd better keep doing it....

    Scoble, you reading this? :-)

     

    This post sponsored by all 792 of the nonspacing marks in Unicode

  • Sorting it all Out

    Why I don't like the IsTextUnicode API

    • 12 Comments

    The IsTextUnicode API has been around since NT 3.5, according to the Platform SDK histories. According to the PSDK, its purpose is as follows:

    The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

    It then goes on to describe the many different tests that it can do when the appropriate flags are passed:

    IS_TEXT_UNICODE_ASCII16
       The text is Unicode, and contains only zero-extended ASCII values/characters.

    IS_TEXT_UNICODE_REVERSE_ASCII16
       Same as the preceding, except that the Unicode text is byte-reversed.

    IS_TEXT_UNICODE_STATISTICS
       The text is probably Unicode, with the determination made by applying statistical analysis. Absolute certainty is not guaranteed. See the following Remarks section.

    IS_TEXT_UNICODE_REVERSE_STATISTICS
       Same as the preceding, except that the probably-Unicode text is byte-reversed.

    IS_TEXT_UNICODE_CONTROLS
       The text contains Unicode representations of one or more of these nonprinting characters: RETURN, LINEFEED, SPACE, CJK_SPACE, TAB.

    IS_TEXT_UNICODE_REVERSE_CONTROLS
       Same as the preceding, except that the Unicode characters are byte-reversed.

    IS_TEXT_UNICODE_BUFFER_TOO_SMALL
       There are too few characters in the buffer for meaningful analysis (fewer than two bytes).

    IS_TEXT_UNICODE_SIGNATURE
       The text contains the Unicode byte-order mark (BOM) 0xFEFF as its first character.

    IS_TEXT_UNICODE_REVERSE_SIGNATURE
       The text contains the Unicode byte-reversed byte-order mark (Reverse BOM) 0xFFFE as its first character.

    IS_TEXT_UNICODE_ILLEGAL_CHARS
       The text contains one of these Unicode-illegal characters: embedded Reverse BOM, UNICODE_NUL, CRLF (packed into one WORD), or 0xFFFF.

    IS_TEXT_UNICODE_ODD_LENGTH
       The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.

    IS_TEXT_UNICODE_NULL_BYTES
       The text contains null bytes, which indicate non-ASCII text.

    IS_TEXT_UNICODE_UNICODE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_ASCII16, IS_TEXT_UNICODE_STATISTICS, IS_TEXT_UNICODE_CONTROLS, IS_TEXT_UNICODE_SIGNATURE.

    IS_TEXT_UNICODE_REVERSE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_REVERSE_ASCII16, IS_TEXT_UNICODE_REVERSE_STATISTICS, IS_TEXT_UNICODE_REVERSE_CONTROLS, IS_TEXT_UNICODE_REVERSE_SIGNATURE.

    IS_TEXT_UNICODE_NOT_UNICODE_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_ILLEGAL_CHARS, IS_TEXT_UNICODE_ODD_LENGTH, and two currently unused bit flags.

    IS_TEXT_UNICODE_NOT_ASCII_MASK
       This flag constant is a combination of IS_TEXT_UNICODE_NULL_BYTES and three currently unused bit flags.

    Sound impressive and interesting enough yet?

    A bit of trivia -- the code for a flag that used to be documented (IS_TEXT_UNICODE_DBCS_LEADBYTE) is still there (and it is still in the header file, obviously -- the PSDK never breaks people like that). But the flag does not work well, so it is probably just as well that it is not documented any more. I highly recommend not passing it. Or ignoring when it is returned. The flag not dangerous or anything; it's just not too terribly useful for its intended purpose (detecting text that is actually DBCS).

    As I mentioned, the API has been around since NT 3.5. It was written by someone else, outside of the NLS team (such as it was in those days). That is fairly cool since there was not as much Unicode awareness/acceptance back then as there is now....

    In those heady days when to most developers Unicode was little more than a foreign word that translated to "twice the memory and space required for strings", this function was mostly used as a way to know when to call WideCharToMultiByte to know when to convert strings out of Unicode1, and there were very few callers even for that not-so-noble purpose. NT 4.0 did not see much of a usage explosion, although Windows 2000 did , where the number of callers throughout the entire Windows source tree just about tripled (to 65 or so callers). Not much movement on the caller side in XP or Server 2003, either. I don't mind this fact much, given why it mostly seemed to be used.

    Some time between XP and Server 2003, I did add it to MSLU, as a nice gesture to developers who were frustrated by NT-only APIs2.

    Nevertheless, as the title of this post indicates, I don't like the IsTextUnicode API.

    You may think you know why -- go ahead, I'll give you three guesses.

    Guess #1: Because I do not own it?

    Sorry, that's not it -- but your opinion about my ego is noted. :-)  Strike one!

    I'll give you a hint.

    Hint#1: Look at the Platform SDK description (I'll add emphasis to enhance the hint):

    The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

    Guess #2: Excuse me, I meant because the NLS team does not own it?

    Hmm, sorry. I figured that was you meant the first time. Strike Two!

    I'll give you another hint.

    Hint #2: There has only been one substantive change made to this API from the time of its creation until Server 2003 shipped -- a const was added to the lpBuffer parameter.

    Got it now? Think carefully now, this is your last guess.

    Guess #3: Because it considers "CRLF (packed into one WORD)" to be illegal, even though U+0d0a is MALAYALAM LETTER UU?

    Ooh, good one -- that looks like a bug in the IS_TEXT_UNICODE_ILLEGAL_CHARS flag detection. Even cooler that you properly figured out the byte reversal issue. Or maybe you did not notice that part, since both that ASCII CRLF packed into a WORD and the character would reverse on little-endian systems to look like 0x0a0d in memory, and if you did not allow for byte reversal you would have been right then anyway.

    Given the support for Malayalam described previously in the post Lions and tigers and bearsELKs, Oh my!, this is kind of embarrassing. Or maybe given the fact that the code point has been allocated since Unicode 1.1 (according to DerivedAge.txt) which was released in June of 1993 (according to enumeratedversions.html), this is particularly embarrassing. Though that does make the comment over its use in the API source pretty amusing:

                //  The following is not currently a Unicode character
                //  but is expected to show up accidentally when reading
                //  in ASCII files which use CRLF on a little endian machine.

    If you think about it, most UTF-16 big endian files would be from other operating systems and have just a CR or just an LF for their line breaks, even if they were just ASCII. I guess we know why there is no big-endian check for illegal characters? :-)  Makes the whole IS_TEXT_UNICODE_ILLEGAL_CHARS check weird even if it were not totally busted anyway.

    For MSLU fans, yes I ported this bug there as well, though not on purpose. Sorry about that, I am not used to reading code points as reversed bytes....

    Of course, since I did not know about this problem before, it can't be why I started this post not liking the API. Hell, if not for this imaginary conversation I put together, I still wouldn't know about it. Lucky for everyone that I have displayed this psychological dysfunction in public and thus cannot be further embarrassed by reporting the bug on it, right? Strike 3!

    Or we could call it a foul tip, since you found a decade-old bug and all. Ok, it is still Strike 2. :-)

    One more hint:

    Hint #3: There has been no change to this API's underlying mechanics since at least NT 3.51 (and probably since the original NT 3.5 release).

    Any more guesses?

    Guess #4: Because it only seems to test the first 256 bytes, no matter how big of a string I pass?

    Well, no. I never cared too much for that one, even before I came to Microsoft. But I never really found a file where it made a difference. It would be nice if someone were to change this, but I wouldn't lose any sleep over it -- so it's definitely not a reason to dislike an API. Strike 3!

    Ok, I'll just tell you now. Because as an API intended to verify whether a string is following a standard, it wins an award for its obtusitality. Why on earth would the following not have been added, over the years if not in the initial release?

    IS_TEXT_UNICODE_UNPAIRED_SURROGATES
       
    Since it is invalid to have a high surrogate without a low surrogate following it and a low surrogate not proceeded by a high surrogate, why not detect such non-conformant cases?

    IS_TEXT_REVERSE_UNICODE_ILLEGAL_CHARS
       It seems only fair to round out the checks for UTF-16BE by including the reverse version of this flag, doesn't it?

    IS_TEXT_UNICODE_INVALID_FOR_4_00
       Obviously new flags could be added for each major version -- what better way to check for what is invalid then to check against an official "valid" list?

    IS_TEXT_UNICODE_INVALID_SCRIPT_USAGE
       
    There are all kinds of sequences that would indicate bad usage, from combining marks from one script used in an unrelated script to illegal sequences to text with invalid ordering per the canonical combining classes, and so on.

    IS_TEXT_UNICODE_VALID_UTF8_PER_RFC2799
       The initial description of UTF-8 in RFC 2279, which I think is the method used by Notepad3.

    IS_TEXT_UNICODE_VALID_UTF8_PER_UNICODE
       
    The more strict definition of UTF-8, which disallows surrogate code sequences and other non-shortest forms.

    IS_TEXT_UNICODE_VALID_UTF32 / IS_TEXT_UNICODE_VALID_REVERSE_UTF32
       
    These flags could be combined with some of the older signature detection flags if a UTF-32 LE or BE signature is found.

    IS_TEXT_UNICODE_UCS2_32 / IS_TEXT_UNICODE_REVERSE_UCS2_32
       
    Analagous to the IS_TEXT_UNICODE_ASCII16/IS_TEXT_UNICODE_REVERSE_ASCII16 flags, they would detect UTF-32 that looks like it could all be represented as UTF-16 without needing surrogate pairs.

    You get the idea -- Unicode is a dynamic standard, getting more interesting and more complicated all the time, not just for its own sake but in how the platform uses it. How can an API which is written a decade ago and never updated, whose job is to ask "is this flipping buffer full of Unicode text?" ever hope to keep up with such a standard?

     

    1 - Notepad being a noteworthy exception to this rule, since it used the API to try to detect when a text file was Unicode without a BOM.

    2 - Similar to why BeginUpdateResource, UpdateResource, and EndUpdateResource were added, though I must admit that for the *UpdateResource APIs it was mainly due to the fact that former MSFTie Matt Curland did all the work to make the functions Win9x-friendly. :-)

    3 - These are the rules that have been used by MultiByteToWideChar in later years. Ironically, the MultiByteToWideChar API is used by Notepad to convert files that it detected as UTF-8 by using RFC 2279 rules, meaning that any illegal sequences will be dropped without so much as a warning. Better keep those CESU-8 files away from recent enough versions of Notepad!

     

    This post sponsored by out much maligned little brother "ഊ" (U+0d0a, a.k.a. MALAYALAM LETTER UU)
    Who, like the rest of the Malayalam script, felt very supported by XPSP2, only to find out that the IsTextUnicode API did not share that opinion....

  • Sorting it all Out

    Every character has a story #5 (U+262b FARSI SYMBOL)

    • 5 Comments

    This character has an interesting history. As noted by Roozbeh Pournader:

    Neither Farsi, nor a symbol. In real life, it is the official emblem of the goverment of the Islamic Republic of Iran.

    Technically that would make it a logo and thus not a suitable candidate for encoding. But Roozbeh also noted:

    Exactly. The funny fact is that it has been in Unicode since 1.0...

    Truer words have ne'er been spoken. Luckily Ken Whistler stepped in to help explain the inconsistency:

    And in Unicode 1.0 it was called "SYMBOL OF IRAN", which was closer to your description of its use. It was WG2 that insisted on renaming it "FARSI SYMBOL" to get "IRAN" out of the name...

    P.S. I can feel another "Every Character Has a Story" story coming on...

    Of course this does seem to violate the stability rules, which claim that once a character is encoded, its name will not be changed. Luckily Ken once again stepped up to explain:

    Ancient history. Hundreds -- maybe thousands -- of Unicode 1.0 character names were changed in 1993 for Unicode 1.1 as part of the merger between the repertoires of Unicode and ISO/IEC 10646-1:1993. (The Great Compromise) The gory details of all the changes can be found in UTR #4, The Unicode Standard, Version 1.1. It was *after* that point (which was *very* painful for some people) that we put in place the never change a character name rule.

    The whole reason for having a Unicode 1.0 Name field in the UnicodeData.txt file was to track that name change.

    Now of course UTR #4 has been superseded and is not available online, though one would probably not learn much of interest since most of the fun/interesting parts about "The Great Compromise" are in the history and stories from those who were there, and that is not really captured. Think of it as being like the book of Acts in the New Testament -- many of the stories that would (in my very humble opinion) be really interesting about that particular period of time were not recorded, because the processes of change and compromise always tend to record information that speaks much more kindly about the experience than those who are there would themselves recall if you sat them down and bought them a beer....

    Anyway, back to U+262b. Roozbeh gave some more information in a different thread:

    ...U+262B, the so-called FARSI SYMBOL, which is nothing but the official symbol of the (government of) Islamic Republic of Iran, with no known usage but this. It was specifically designed in 1979 or 1980 for this purpose, and also appears in the flag of the Islamic Republic of Iran adopted at the same time.

    One insteresting
    [sic] point is that it is not Farsi (Persian) in any way! It is a logo form of the Arabic word "Allah", also encoded at U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM. Another interesting point is no one remembers exactly how it has got into Unicode! It has been there since the Unicode 1.0 days, so the source is definitely not an Iranian representative in SC2.

    Another interesting point is that when the very final session for approving a very recent Iranian national standard, defining a minimum subset of Unicode for Persian information interchange, was being held, the committee experts voted for removing this character from the optional characters list (characters which need not be supported but their use should be according to the text if they are), telling that it's really not a character, but a logo: "It's not used in text, but just in letterheads".

    Is anyone collecting notes to write that "Every Character Has a Story"  book some time? It's a good case for such a research! ;)

    When the idea of that "Every Character Has a Story" book was being floated around, I remember suggestion a subtitle of "The Dark Underbelly of Unicode". Amazing how easy it is to get there when you look into the history of some characters....

    And to date, no one (as far as I know) has come forward purporting to know how the "SYMBOL OF IRAN" was added to Unicode 1.0 (who proposed it, or why). Its source remains a mystery to this day....

     

    This post brought to you by who else but "" (U+262b, a.k.a. FARSI SYMBOL, a.k.a. SYMBOL OF IRAN in Unicode 1.0)

  • Sorting it all Out

    Sorry folks, MSKLC cannot trap CONTROL+ALT+DELETE

    • 3 Comments

    A few days ago, Larry Osterman pointed out Why is Control-Alt-Delete the secure attention sequence (SAS)?

    It is funny but one popular topic that comes up in supporting MSKLC is people wanting to be able to develop a keyboard layout that blocks the keystroke combination. So they are just looking for a version of MSKLC that does not have the DELETE key disabled because the other two keys are there and they are so close to their goal....

    Sorry, but they are not close. A keyboard layout cannot be made to take away this functionality. CTRL+ALT+DEL is still the one safe combination, even if the DELETE key were eanbled in the user interface of MSKLC.

    Of course, there is a dialog that Outlook puts up when it feels the need to reauthenticate (network hiccups?) which I never type into since the spoofing potential is so obvious. But I am sure most people do type in their credentials again anyway and would consider me to be paranoid. Maybe they should recommend in the dialog that the user hit CTRL+ALT+DEL and type in their password there rather than trying to prompt for it directly?

    The problem with trying to make a system foolproof is that the designers will always underestimate the ingenuity of complete fools. Unfortunately, those with evil intent do not underestimate complete fools; they thrive on such people....

    This post sponsored by "¶" (U+00b6, PILCROW SIGN)

Page 249 of 257 (3,844 items) «247248249250251»