Blog - Title

September, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    The lasting effects of interns, aka Can you fix my Vista install, aka Can they blow the Shofar, aka I should have split up this post!

    • 0 Comments

    Last week I was talking with Cathy about my post from late last month -- Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask) -- and she pointed out that although I did not specifically name her that people would know she did the deed and added the old Tibetan that was never taken out, since she did a lot of that original work.

    It gave me pause for a minute -- I admit at first I hadn't really been thinking about it in those terms. Was it her?

    Okay, I guess it was. Though I think more people know now that she has pointed out than ever would have realized from my little story....

    Of course I had to point out some important truths if we were really going to get the whole story out.

    Like the fact that she was an intern at the time.

    And it was after she left as an intern yet before she came back as a contractor that the whole "Tibetan Switcheroo" exercise happened, so it wasn't like there was a whole lot she could do when she wasn't even here.

    It's also hard to pin the whole Unicode side of the mess on her, since it was really much more an ISO/Unicode issue as part of the merge.

    Does anyone ever think that an intern project will stay in the source and be used almost a decade later? That's actually fairly impressive, and the work held up remarkably well over that time (a tribute to both her and Julie, though I daresay few people really noticed that either!).

    Anyway, in the end it is one of those embarrassing warts that a complex software project working to implement pieces of a complex (and at the time massively changing) standard, which is kind of funny in retrospect (plus a little funny that Jet and SQL Server still do the old broken thing right now.

    But as a project it has held up rather well!

    Of course last night I had lot of family members pounce on me about problems with Vista; I should probably talk about those issue too at some point?

    It appears to mainly be some appcompat issues, plus the issue mentioned in the MS Knowledge Base article 929734 (You may experience problems after you resume a Windows Vista-based computer from sleep or from hibernation) though even after being given hot fixes for the problem and troubleshooting with people from HP and Dell (for different machines with the same problem) they are still finding the fix elusive.

    I guess I am mentally distancing myself from the problems a little since (a) I have never hit them in my own dogfooding on various machines, and (b) it does not appear to be in any code I own or have owned.

    But I still want to help them so I am going to see if I can find out what is going on, whether anyone is tracking potential problems with any hot fixes that are out there now. Just because it isn't my fault doesn't mean it isn't my problem, right? :-)

    (Warning -- geeking out on Yom Kippur minutia for a moment!)

    I also was talking with my father about this tendency that many synagogues seem to be doing where they hold off the final blowing of the Shofar that is supposed to happen during Ne'ilah is delayed until Ma'ariv, mainly to keep people from leaving after Ne'ilah. It just seems kind of shoddy to me, I am sure thy would still have a ten men, and if not then they had already said the Amidah four times that day! Plus the fact that they do not blow the Shofar even in Ne'ilah even though it is after sundown and if they can make an exception to Sabbath rules to say Avinu Malkainu in Ne'ilah to emphasize the repentance, then blowing the Shofar that last time would make a great bit of punctuation on the argument! But I doubt I have the strength in me to convince anyone about that argument so I guess it is just me grouching off. :-)

    By the way, the spell checking software does not recognize Shofar but has an alternate suggestion of Shiva for the word. I don't know quite what to do with that!

    Okay, there will be some additional technical posts going up soon, this one turned non-technical awful fast!

     

    This post brought to you by    (U+0DD8, a.k.a. SINHALA VOWEL SIGN GAETTA-PILLA)

  • Sorting it all Out

    Docs can whet SiaO's appetite, but where's the blog?

    • 4 Comments

    As I mentioned back in How do I feel about lstrcmpi? I think it blows...., the Mac CFString stuff has some fascinating issues related to collation that I thought I'd chat about, with me owning a MacBook Pro and with Microsoft making Silverlight run on it and all. :-)

    Here are the CFString methods of interest that I am thinking about:

    • Searching Strings
      • CFStringCreateArrayWithFindResults 
      • CFStringFind 
      • CFStringFindCharacterFromSet 
      • CFStringFindWithOptions 
      • CFStringGetLineBounds 
    • Comparing Strings
      • CFStringCompare 
      • CFStringCompareWithOptions 
      • CFStringHasPrefix 
      • CFStringHasSuffix 

    First of all, do you see how they split between comparing and searching? In our FindNLSString/FindNLSStringEx functions, since prefix and suffix matching are a special case of a find operation, they are bundled together with the find, too. Though I can see the argument for the split in this way too....

    But what actually interested me the most was the list of String Comparison Flags that all of these functions seem to take:

    typedef enum {
       kCFCompareCaseInsensitive = 1,
       kCFCompareBackwards = 4,
       kCFCompareAnchored = 8,
       kCFCompareNonliteral = 16,
       kCFCompareLocalized = 32,
       kCFCompareNumerically = 64
    };
    CFStringCompareFlags;

    There seem to be some odd interactions between some of the flags and some of the methods, which makes me suspect that not all of them work together in every case, and in other cases the combinations seem redundant (for example, what's the point of CFString::CFStringHasPrefix or CFString:CFStringHasSuffix with functions that have the kCFCompareAnchored flag? Like there is some other way to be a prefix or a suffix that isn't "anchored" by their definition?).

    And some of the definitions seemed off, like their analogue for StrCmpLogicalW:

    kCFCompareNumerically
            Specifies that represented numeric values should be used as the basis for comparison and not the actual character values.
            For example, “version 2” is less than “version 2.5”. Does not work if kCFCompareLocalized is specified on systems before 10.3.

    There is a version where "2" > "2.5" ? Maybe they meant one where "version 10" is greater than "version 2" here, like sorting digits as numbers is meant to do? Or did they also extend it to decimals here, as well (in which case the example is still dumb but the functionality is pretty damn cool and I'd love to know how well it works and what else it can do) ?

    Plus some of the encoding stuff looked unusual, I thought that might be worth a look too.

    Anyway, after I saw it all, I realized the documentation was really insufficient, and I wanted a Sorting the Mac all Out blog to read so I could find out what was really up with these methods. I couldn't find one in my cursory search, so at some point I may have to start digging in and playing with stuff (unless someone knows of such a blog, of course!).

     

    This post brought to you by (U+24d0, a.k.a. CIRCLED LATIN SMALL LETTER A)

  • Sorting it all Out

    If it isn't Unicode, it isn't ANY code!

    • 0 Comments

    Raymond makes a good point in What happens if you pass a source length greater than the actual string length? about the potential dangers of the NLS semantic for length parameters....

    I make a similar point about the potential problems in Encoding APIs and Security Concerns, APIs and Security Decisions, though I stand by my point in API Consistency and Developer Comfort about how (since you cannot rely on consistency across the Win32 API, at least you can often rely on consistency within families of functions (usually).

    In the end, the real errors are fourfold from the NLS point of view in the scenario Raymond talks about, in addition to the actual semantic differences with how length parameters are handled:

    • You should never convert the case of file names -- preserve the case, always, as I talk about here;
    • You should never pass file names to a function that uses CompareString since that is a linguistic comparisons, which the file system does not use. CompareStringOrdinal is much more appropriate as I mention here and in other posts;
    • Even if the theoretical invariant_strnicmp Raymond posited were calling the right function, the fact that it assumes the two strings passed to CompareString are of the same length, which they don't have to be in that function (the simplification used in many Shell, CRT, and other functions is a dangerous over-simplification;
    • Unicode is not being used!

    How is the title for a new advertising slogan? :-)

     

    This post brought to you by(U+0913, a.k.a. DEVANAGARI LETTER O)

  • Sorting it all Out

    [In/Out]laws can be a great news source (aka Maybe Alaska Air should sacrifice more goats?)

    • 0 Comments

    (With a title like that, you're  expecting a technical topic???)

    So I was over at Faye's breaking fast and brother-in-law Zack shared a fun little news story with us care of Reuter's entitled Nepal airline sacrifices goats to appease sky god.

    Sister-out-law Jenny had also seen the story and they were both talking about it.

    Now of the last 16 Alaska Air flights I have been on, only 4 have left on time -- and several of those delayed flights were caused by mechanical/equipment failures.

    I have to wonder whether that means William Aire (chairman, president, and CEO of Alaska Airlines) should have a more liberal policy on goat sacrifces. It might improve their on-time departure stats....

    In case you were wondering, I am clearly not speaking for Microsoft here; as far as I know no goats were sacrificed to help with any recent legal issues in Redmond, which come to think may be one of the problems.... :-)

    (Hat tip to both Zach and Jenny, my unusual news and ginger gurus!)

     

    This post brought to you by 𐂈 and 𐂉 (U+10088 and U+10089, a.k.a. LINEAR B IDEOGRAM B107F SHE-GOAT and LINEAR B IDEOGRAM B107M HE-GOAT)

  • Sorting it all Out

    DropDownWidth not dropped down with a localizable attribute?

    • 0 Comments

    The purpose of marking properties in WinForms as localizable is to make sure that properties that localizers would posisbly need to change can be exposed to them.

    All well and good, but it is easy to miss properties that are in retrospect obvious ommissions.

    Like just the other day when Kollen mentioned:

    We are in the process of localizing for Spanish and noticed that the DropDownWidth property on the ComboBox is not showing up in WinRes so we are unable to resize drop-downs to accommodate longer Spanish items.

    While the work around is to subclass, override the property and add the attribute, that can be difficult especially since this probably wouldn't get noticed by teams until they start the localization process after code freeze.

    Kollen is completely right, and it is an obvious ommission that will hopefully be addressed in some future version of WinForms....

    I'll talk more about the general issue another day (I have to get to Shul now, sorry!).

     

    This post brought to you by ͅ (U+0345, a.k.a. COMBINING GREEK YPOGEGRAMMENI)

  • Sorting it all Out

    A high holiday that goes to 11

    • 2 Comments

    When visiting family in Cleveland over Yom Kippur, some interesting issues take place.

    They go to an orthodox congregation now, and the scooter question is a big one.

    Many of the prohibitions about work on the Sabbath apply to Yom Kippur, and even if they did not it also falls on the Sabbath this year.

    I did try going to כל נדרי (Kol Nidre) without the scooter but I quickly regretted it and let people know in no uncertain terms that I couldn't do it again....

    Perhaps not surprisingly, the issue is different for different people....

    • Some are of course troubled at the idea of driving to the Shul that day (though even if I were not there my >90 grandmother would have influenced that one since she would be unable to walk there);
    • Some are troubled at the assembly of the scooter after one drives there (though my parents live on a hill and leaving it assembled for 25 hours and scooting back and forth is not possible)
    • Others are troubled at the notion of charging the battery of the scooter;
    • Still others are troubled by the battery powered device being used at all (medical reason or no), though that number ended up being small;
    • And of course other random variations thereof.

    I suspect people who were unhappy (and who will be today) got over it quickly so I was not too worried, it gave them something to talk about if nothing else....

    But it led me to think about issues like I pointed out in If the porcine is טְרֵפָה then the fact that the bovine probably is too ought to count for something, though more extended where you can likely have 12 or more opinions for any 10 people you talk to about it....

    The law can be clear, but not all the people who follow the law are clear on how they feel about it. The dynamic is interesting, as one would expect it to be.

    Of course as I sit here and type this on my laptop I am clearly not troubled by the notion of using electricity and the computer and the Internet on Yom Kippur and if people wanted to find my sacrilegious I would think that this makes a much appropriate target.

    Not that I am afraid of a little controversy. :-)

    I saw my old math teacher Mr. Snodgrass who was quite unamused years ago about me in his math class doing calculations in base 5 so many years ago (while I suspect being delighted that he inspired the idea in the first place).

    And I also saw the Sklars, which is always interesting (years ago in high school, I was specifically told that I was not allowed to date their daughter, you see; it now manifests in the conversation as an odd familiarity about the eagerness of youth and the jokes about how she was visiting for ראש השנה (Rosh Hashanah) and we had "just" missed each other, something we had technically been doing since I last saw her in New York some 7+ years ago!).

    Plus several others, all of which was very nice. It is in part those personal connections that make trips back to where one used to live an interesting experience.

    It makes for a high holiday that rates a bit higher in my book, in fact that has been cranked to 11. :-)

    I was going to talk more about Yom Kippur and its historical connections with both Christianity and Islam but the Wikipedia article probably says more than I could so I'll let you read that if you are interested....

     

    This post brought to you by (U+2721, a.k.a. STAR OF DAVID)

  • Sorting it all Out

    A&P of Sort Keys, part 10 (aka I've kana wanted to start talking about Japanese)

    • 4 Comments

    Previous posts in this series:

    Today's post is going to be a first look at some of the Japanese support that is there in Windows....

    Well, not a first look, since this post and the seven others it links to have talked about it already. :-)

    In general. given the challenges faced in trying to handle a Kanji sort correctly, at present (and for the last decade!) only the Kana are handled (but anyone could add the kana for the pronunciation in a database and that additional column for collation. This will have to do until a more intelligent type of pronunciation-based Kanji sort is given....

    Also, kana sorts properly in all locales, which can come in handy.

    So we take a nice word like ramen (a loan word from Chinese) and look at it in katakana, narrow katakana, and hiragana (using LCMapString/LCMapStringEx to do the various conversions, of course).

    Note that the word would usually be spelled using katakana, so the other forms are just illustrative for us:

    ラーメン   22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 c4 c4 c4 c4 ff ff 01 00
    ラーメン      22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 c4 c4 c4 c4 ff c4 c4 c4 c4 ff 01 80 17 06 03 00
    らーめん  22 42 22 02 22 35 22 80 01 01 01 ff 03 05 02 ff ff 01 80 17 06 03 00

    Clearly all three of them will sort near each other with identical primary weights.

    But some interesting havoc is being wreaked here in both the special weights and punctuation weights areas, definitely worthy of some investigation and discussion....

    The order being strived for is something I talked about a little bit in Knock knock! Who's there? Kana! Kana Who?, which you may have seen before. And indeed if you look at the sort keys, this was accomplished for the simple example, though it does not look as rigorous as it maybe could, from first glance.

    Now I could cheat and give it to you by looking at the source and the data and explaining it, but I think we should do it the interesting hacking kind of way, don't you? :-)

     

    This post brought to you by (U+247d, a.k.a. PARENTHESIZED NUMBER TEN)

  • Sorting it all Out

    If you had gotten there first, you might have staked your claim too!

    • 3 Comments

    I was reading Raymond Chen's blog post Find the Flowers vs. Minesweeper which is a pointer to David Vray's post over on the Shell Blog entitled The UI design minefield - er... flower field?? and it got me thinking about the most recent version of a question that gets asked by someone to one of the various lists I am on at least twice a month, if not more often....

    A question that the Shell post did not mention but which does end up getting involved in many cases.

    The question?

    Is there a difference between running a localized version of Vista and running with the same particular user interface language?

    This question is really ill-formed, though. And I don't just mean in the way that the two clauses don't seem to be balanced very well; that part is my fault. :-)

    It is just that comparing a UI language to the localized SKU is comparing your right hand to your left index finger to ask if they are the same!

    To the conversation I will add the concept of a "mother tongue" for the copy of Windows -- the first language that is installed.

    THAT language is special, because there are several items like folder names and account names and so forth that NEVER change from that mother tongue.

    In versions of Windows prior to Vista it was almost always English1 but in Vista English really is just another language, so the mother tongue really can be any localized SKU.

    Now that mother tongue gets to make a lot of decisions that are pretty much irreversible for the install, things that will be different for any other mother tongue with the matching UI language installed on top of it.

    It is fun to think about issues like games and such, but the same technologies come into play when decisions like R.O.C. date formats in Traditional Chinese or other truly contentious issues come up, as well. One never knows where the next geopolitical issue will come from, so being able to have lots of choices is important!

    Plus there are additional features cued by user interface language like speech (discussed previously here and here) and some mostly by mother tongue but subtly altered in part by user interface language (like localized paths, as I discussed here).

    You can probably find the odd bug here and there where the documentation (which is largely going to be user interface language based) runs afoul trying to document features that change based on SKU.

    When you see default font decisions based on that bizarre combination of user interface language, default system locale, and the UI language of the LocalSystem account (which almost never shows UI), it becomes well nigh impossible to know what font will be used sometimes unless you either choose it yourself or are on the typography team like Si.

    So in the context of the Flower Field vs. Minefield kind of decision, someone has to decide if the decisions are to be SKU based or user interface language based, and then all of the documentation issues pop up as well if decisions are not based on the user interface language.

    So the answer to the perennial question

    Is there a difference between running a localized version of Vista and running with the same particular user interface language?

    is simple to answer:

    Why yes, there is. Because the localized language was there first!

     

    1 - As I pointed out Microsoft, you giving us some LIP?, some of the worst scenarios with the English base solution were addressed in some of the XP SP2 LIPs.

     

    This post brought to you by (U+2698, a.k.a. FLOWER)

  • Sorting it all Out

    A&P of Sort Keys, part 9 (aka Not always transitive, but punctual and punctuating)

    • 7 Comments

    Previous posts in this series:

    Today's contribution to this series is one that has been the source of several misunderstandings.

    It has also led to more than a few bug reports, though in truth most of the bugs were duplicates of the very same issue. :-)

    It is about certain types of punctuation, and the impact of WORDSORT (as opposed to STRINGSORT) in linguistic comparisons.

    I first talked about the feature in A few of the gotchas of CompareString and I will quote the relevant bit here:

    SORT_STRINGSORT - Treat punctuation the same as symbols. For example, a STRING sort treats co-op and co_op as strings that should sort together since the hyphen and the underscore are both treated as symbols. On the other hand, a WORD sort treats the hyphen and apostrophe differently, so that co-op and co_op would not sort together but co-op and coop would. The real documentation for this is built into the winnls.h header file:

    //
    //  Sorting Flags.
    //
    //    WORD Sort:    culturally correct sort
    //                  hyphen and apostrophe are special cased
    //                  example: "coop" and "co-op" will sort together in a list
    //
    //                        co_op     <-------  underscore (symbol)
    //                        coat
    //                        comb
    //                        coop
    //                        co-op     <-------  hyphen (punctuation)
    //                        cork
    //                        went
    //                        were
    //                        we're     <-------  apostrophe (punctuation)
    //
    //
    //    STRING Sort:  hyphen and apostrophe will sort with all other symbols
    //
    //                        co-op     <-------  hyphen (punctuation)
    //                        co_op     <-------  underscore (symbol)
    //                        coat
    //                        comb
    //                        coop
    //                        cork
    //                        we're     <-------  apostrophe (punctuation)
    //                        went
    //                        were
    //

    The reasons for this feature are fairly clear -- in many contexts, the default word sorting is more useful and much more intuitive.

    Of course among the reasons that it is not always expected:

    • When - (U+002d, a.k.a. HYPHEN-MINUS) is being less of a hyphen and more of a minus, the results are not only unintuitive, they also cause interesting problems such as I discussed in The problem of string comparisons, WORD sorts, and the minus that is treated like the hyphen;
    • Sometimes, for example in situations like the one discussed in Sort the words, sort the strings, even for nominal hyphens the behavior can be less than intuitive (mainly when they are a part of symbolic identifiers);
    • When ' (U+0027, a.k.a. APOSTROPHE) is used as a variation of " (U+0022, a.k.a. QUOTATION MARK) then the fact that the behavior is intuitive for contraction usage is sullid by how unintuitive it is for other uses.

    The most recent report of the first issue (by far the most common) was just a week ago, and it led to ond of the developers over in MSN concluding that:

    It just shows how evil String.Compare is (for being completely counter-intuitive.)

    Hard to argue with that!

    It almost makes one want to do something smarter in the function to try and detect the two cases and handle them differently -- perhaps not too hard since the actual cases are so very different?

    Plus it would be cool to add a SORT_SMARTSORT constant here. :-)

    Let's take a step back and see what the sort keys say.

    Remmeber that sort keys are only run on a single string and are thus not subject to the "a < b but b < a" type bugs that can lead to real problems. Although as a general principle any time transitivity is not there or when behavior is different between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY deciding which one is wrong can vary, in practice it is usually not the sort keys -- thus they make the best baseline for us, functionally1.

    (WS) -0.67:-0.33:0.33    0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 1b 06 82 00
    (WS) -0.67:0.33:-0.33    0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 07 06 82 80 2f 06 82 00
    (WS) 0.67:-0.33:0.33     0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 80 1b 06 82 00

    (SS) -0.67:-0.33:0.33    06 82 0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00
    (SS) -0.67:0.33:-0.33    06 82 0c 03 07 33 0c 7d 0c 90 07 37 0c 03 07 33 0c 46 0c 46 07 37 06 82 0c 03 07 33 0c 46 0c 46 01 01 01 01 00
    (SS) 0.67:-0.33:0.33     0c 03 07 33 0c 7d 0c 90 07 37 06 82 0c 03 07 33 0c 46 0c 46 07 37 0c 03 07 33 0c 46 0c 46 01 01 01 01 00

    Clearly, there is a specific preferred ordering for the sort keys based on the earlier hyphen trumping the later one, so in choosing which CompareString/CompareStringEx to prefer, one has a way to go. This is thankfully much less controversial of a decision than when one has to choose between CompareString/CompareStringEx and LCMapString/LCMapStringEx with LCMAP_SORTKEY, since in those cases clients like SQL Server are already contending with index corruption due to the "not so very transitive" results, so as long as one is correct and the other is treated as a bug, they can fix it without waiting for new collations, etc.

    At this point, many people thinking about the fillers one sees in the DW and CW values will wonder exactly how these punctuation weights are being defined. I'll give you a very big hint:

    -012345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 07 06 82 00
    0-12345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 0b 06 82 00
    012-345     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 13 06 82 00
    0123-45     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 17 06 82 00
    01234-5     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1b 06 82 00
    012345-     0c 03 0c 21 0c 33 0c 46 0c 58 0c 6a 01 01 01 01 80 1f 06 82 00

    See what is going on? It still is acting as a filler -- it is just compressing it some. This is much less feasible to do for DW and CW portions of the weight given the additive nature of the values in them, but at least you know punctuation does not waste too much space in sort keys!

    There are other characters that have this special weight value; to find them all, you could grab the sort key of each character from 0x0000 to 0xFFFF when the SORT_STRINGSORT flag is included -- these "special punctuation" values all have a first byte value of 06.

    In Vista, these values (in weight order, with their other weights present as well) are:

    CODEPOINT SM   AW   DW   CW   COMMENT
    0x0001    6    3    2    2    ;Start Of Heading
    0x0002    6    4    2    2    ;Start Of Text
    0x0003    6    5    2    2    ;End Of Text
    0x0004    6    6    2    2    ;End Of Transmission
    0x0005    6    7    2    2    ;Enquiry
    0x0006    6    8    2    2    ;Acknowledge
    0x0007    6    9    2    2    ;Bell
    0x0008    6    10   2    2    ;Backspace
    0x000e    6    11   2    2    ;Shift Out
    0x000f    6    12   2    2    ;Shift In
    0x0010    6    13   2    2    ;Data Link Escape
    0x0011    6    14   2    2    ;Device Control One
    0x0012    6    15   2    2    ;Device Control Two
    0x0013    6    16   2    2    ;Device Control Three
    0x0014    6    17   2    2    ;Device Control Four
    0x0015    6    18   2    2    ;Negative Acknowledge
    0x0016    6    19   2    2    ;Synchronous Idle
    0x0017    6    20   2    2    ;End Of Transmission Block
    0x0018    6    21   2    2    ;Cancel
    0x0019    6    22   2    2    ;End Of Medium
    0x001a    6    23   2    2    ;Substitute
    0x001b    6    24   2    2    ;Escape
    0x001c    6    25   2    2    ;File Separator
    0x001d    6    26   2    2    ;Group Separator
    0x001e    6    27   2    2    ;Record Separator
    0x001f    6    28   2    2    ;Unit Separator
    0x007f    6    29   2    2    ;Delete
    0x0027    6    128  2    2    ;Apostrophe-Quote
    0xff07    6    128  2    3    ;Fullwidth Apostrophe-Quote
    0x07F4    6    129  2    2    ;NKO HIGH TONE APOSTROPHE
    0x07F5    6    129  20   2    ;NKO LOW TONE APOSTROPHE
    0x002d    6    130  2    2    ;Hyphen-Minus
    0xff0d    6    130  2    3    ;Fullwidth Hyphen-Minus
    0xfe63    6    130  2    8    ;Small Hyphen-Minus
    0x2212    6    131  2    2    ;Minus Sign
    0x208b    6    131  2    4    ;Subscript Hyphen-Minus
    0x207b    6    131  2    14    ;Superscript Hyphen-Minus
    0x2010    6    132  2    2    ;Hyphen
    0x058a    6    132  21   2    ;Armenian Hyphen
    0x2011    6    133  2    2    ;Non-Breaking Hyphen
    0x2027    6    134  2    2    ;Hyphenation Point
    0x2043    6    135  2    2    ;Hyphen Bullet
    0x2012    6    136  2    2    ;Figure Dash
    0x2013    6    137  2    2    ;En Dash
    0xfe32    6    144  2    12    ;Glyph For Vertical En Dash
    0x2014    6    144  21   2    ;Em Dash
    0xfe58    6    144  21   8    ;Small Em Dash
    0xfe31    6    144  21   12    ;Glyph For Vertical Em Dash
    0x2015    6    146  2    2    ;Quotation Dash
    0x301c    6    147  2    2    ;Wave Dash
    0x3030    6    148  2    2    ;Wavy Dash

    The control characters are there for compatibility with prior versions and are used by some very low level pieces of Windows as sentinels (when we triedd to change the values we qwuickly made the system unbootable!).

    The only other thing to note is that 08 case weight bug I mentioned in Part 3 is here too for a few characters. Did anyone from the NLS test team put that bug in yet? :-)

    As a bonus, can anyone explain why it doesn't matter in the default case, and under what combination of circumstances it would matter? I'll put the answer in a comment eventually if no one else figures it out....

     

    1 - The de facto decision to consider sort key results to be definitive is quite ironic given that SQL Server2 uses CompareString to build its sort keys.
    2 - It is fascinating though perhaps not unexpected to note that the single biggest reporter of
    CompareString anomalies is in fact SQL Server.

     

    This post brought to you by 9 (U+0039, a.k.a. DIGIT NINE)

  • Sorting it all Out

    Avoiding entitlement (aka Don't bother, it's okay)

    • 6 Comments

    Absopositively nothing technical whatsofreakingever. 

    Over the last few months I have probably taken the bus more than any previous time in my life since the bus was an American Flyer taking me to Beachwood Middle School.

    I just found myself needing to get places that I wanted to have the scooter but did not want to deal with paying to park the car -- like downtown Seattle.

    Fer instance at TypeCon. And a bunch of other places. The most recent one just the other dat, as I headed home after dropping off my car to get the roof fixed.

    But I found out something weird during these time.

    No one ever asked me to pay a fare. Not once the entire time.

    Now I had a FlexPass which might occasionally have been in plain sight, but usually it was in my pocket. And I'm just trying to figure out why no one even asked....

    It's not like I am a no-frills passenger, either. The bus has to lose some seats because of the whole accessibility thing, and the driver has to stop to help strap the scooter in (I usually try to do it myself but it's awkward and the driver ends up coming over before I am finished).

    The three times I actually went out of way to show the FlexPass, each time the driver said some variation of "don't bother, it's okay."

    Don't bother, it's okay.

    Hmmm.

    Don't bother, it's okay.

    I am really not sure how I feel about this.

    On the one hand, I'm annoyed.

    I mean, I have the damn FlexPass and even if I didn't I wouldn't have minded paying the fare anyway.

    But on the other hand, maybe the drivers figured a favor is being done. Something that probably makes them feel better.

    If I correct them (or really in any way take my anger about them assuming whatever they were assuming) then they will feel worse, may be embarrassed, and maybe they won't do it the next time.

    Do I even know what they are assuming? Probably not.

    And maybe somebody worse off, living off SSDI and barely scraping by, really was depending on this favor, this weird unwritten (till this blog post now?) entitlement that I never asked for and never needed.

    You know what it's about? 

    It's PRIDE.

    My trouble here is pride.

    I mean, I don't want to take advantage. And more importantly, I don't want to feel like someone is giving me some kind of charity.

    And because of that, which is totally about me, I was not seeing someone trying to be nice, I was seeing pity that I doubt was really even there.

    Totally whack, and I mean that in the worst possible sense (assuming it still means something bad? I lose track of that kind of crap, I probably should have asked Claire or Trisha while they were around!).

    I have to get this chip off my shoulder, though. I am living my life, and whenever I take the time to notice I am actually enjoying it.

    And I'm hardly living the life of Walter Mitty in this blog -- I am doing stuff then writing about it.

    What was the lesson from Ferris Bueller's Day Off again? Oh yeah, "Life goes by pretty fast. If you don't stop and look around once in a while, you could miss it."

    I'm not entitled, at all -- but if people want to act like I am, that's cool. As long as I realize that isn't about me, that's about them. And just enjoy what life offers....

     

    This post brought to you by(U+2169, a.k.a. ROMAN NUMERAL TEN)

  • Sorting it all Out

    Every character[ sequence] has a story #30: The SMILEY (a 25-year old story, in fact)

    • 2 Comments

    Given all of the blather about emoji and emoticons and symbols, the mail I got from Sergey earlier today puts in all in perspective.

    It had the following in it:

    Note the date and time, and when this post goes live.

    For more on Scott and Smiley lore, see here and here.

    Considering how often I use this particular item, I feel like I owe him a serious thank you!

     

    This post brought to you by (U+2323, a.k.a. SMILE)

  • Sorting it all Out

    Don't just delete registry keys!

    • 9 Comments

    The other day, someone from product support was working with a customer whose Add button on the Text Services and Input Languages dialog was grayed out.

    You know, the Add button in this dialog:

    It turns out this can only happen in some pretty catastrophic circumstances, like the registry key being missing.

    I found myself intrigued about what the behavior might be like, so I decided to test things out here. So that you won't have to! :-)

    First, let me point out the usual warning that the Microsoft Knowledge Base does:

    WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.

    Let me add one thing to that, if you wanted to do this yourself, back up the registry subkey, as the menu option below does:

    PLEASE follow this advice -- if you don't then you are kind of semi-screwed without independently fixing the key back up....

    Ok, now I will delete the "HKLM\SYSTEM\CurrentControlSet\Control\Keyboard Layouts" subkey.

    So what happens when you try to launch the dialog?

    Well, something very sad:

    The Add... button is definitely disabled. It seems that since the dialog needs that data to populate the dialog, lack of any into pretty much makes it empty.

    That IME is a Text Services Framework TIP, which is mostly stored elsewhere....

    Of course that begs the question of what would have happened if I only had regular keyboards and no TSF IMEs, more like this situation:

    If I deleted the key then, what would happen?

    You don't want to know, trust me. :-)

    Well, if you are going to twist my arm:

    Yikes!

    Incompatible keyboard driver detected.  This dialog has been disabled.

    In case someone as confused about the blank dialog. :-)

    Of course you know what's next, right?

    I have to try all of this on Vista.

    Backing up the registry key first, of course!!!

    Here is what the deleted keyboards registry key does:

    The Add... button is not disabled!

    Maybe it will work somehow?

    Though the fact that it lists none of my keyboards makes me nervous, if you know what I mean....

    Here goes:

    Doesn't inspire much confidence, huh? :-)

    Okay, enough excitement for one day, let's add those backup up registry keys back.

    Hold our breath, launch the dialog again, and

    Whew! Everything is back.

    And the Add... button?

    Good, everything is back.

    Let's not delete sections of the registry any more. Some of those dialogs are plain disturbing!

     

    This post brought to you by (U+1e35, a.k.a. LATIN SMALL LETTER K WITH LINE BELOW)

  • Sorting it all Out

    Depending on when/where/who you ask, that character may not be your [c]type

    • 7 Comments

    The customer question was:

    The whole story is about saving in XML format foreign symbols – from time-to-time it fails for some of them.
    We found a way to bypass it by filtering them out with isprintable function.
    The problem was solved, but some of good symbols are gone…

    Japanese customers complaining that we are filtering out some of their characters.
    They sent us some string containing such chars.

    I've prepared short test to see what happens.
    Actually second, third and some other character is recognized as not printable, but they – Japanese – say it's perfectly OK…

    The sample string in question:

    ボーリング工具

    Let's look at the GetStringTypeW CT_CTYPE3 values for each character.

    In Vista (where the bug does not repro), the values are:

    •   U+ff8e  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA
    •   U+ff9e  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA | C3_DIACRITIC
    •   U+ff70  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA | C3_HIRAGANA | C3_DIACRITIC
    •   U+ff98  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA
    •   U+ff9d  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA
    •   U+ff78  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA
    •   U+ff9e  C3_ALPHA | C3_HALFWIDTH | C3_KATAKANA | C3_DIACRITIC
    •   U+5de5 C3_ALPHA | C3_IDEOGRAPH
    •   U+5177  C3_ALPHA | C3_IDEOGRAPH

    In XP/Server 2003, U+ff9e (HALFWIDTH KATAKANA VOICED SOUND MARK) and U+ff70 (HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK) did not have the C3_ALPHA on them and were therefore not considered alphabetic by the CRT function.

    So the problem will not happen in Vista, and it will not happen (if memoy serves) in Win2000 and earlier....
     
    Though to be honest I wonder why the CRT would require something to be a letter to consider something printable, it seems strange (being a diacritic should be enough to make something printable, if you ask me). But after the feedback of what the change was actually doing, the NLS change was essentually reverted (in the process of adding all the Unicode 5.0 characters)....

    The underlying issue? Well, since NLS made a change which NLS essentially reverted, I guess you can blame NLS. Though I honestly prefer to think of it as a misguided attempt to be more properly descriptive of Unicode properties in the [boneheaded] NLS character property descriptions, which later caved to the realities of the [equally boneheaded] C runtime character categories. :-)

     

    This post brought to you by (U+ff9e, a.k.a. HALFWIDTH KATAKANA VOICED SOUND MARK)

  • Sorting it all Out

    A&P of Sort Keys, part 8 (aka You can often think of ignoring weights as a form of ignorance)

    • 7 Comments

    Previous posts in this series:

    You should think of Part 8 as kind of a seventh inning stretch in this series, where I sit back and you sit back and I impart some of the wisdom I have acquired over the years I have used, worked on, owned, and then assisted with as a "developer emeritus" the collation functionality in Windows.

    I'm also going to take potshots at some product designs, mainly pointing at a single Microsoft product when in reality it applies to many MS and non-MS products....

    Feel free to disregard it

    Feel free to disagree with it.

    But don't skip reading it, because I am right. :-)

    The issue is one that causes at least half of the possible collation operations done in products like SQL Server to be wrong.

    It is one I have discussed previously, in posts like 'Which comes first?' vs. 'Are they equal?'.

    The central problem is that IGNORING WEIGHTS in ordering decisions is just plain ignorant.

    Just like I pointed out how If you don't always preserve case, you don't always preserve meaning, distinctions in the data should not be ignored, or folded away, or treated as un-needed. They are distinctions -- in case, in width, in diacritics, in Kana. When you order the data, you should keep those distinctions in mind, as ignoring them is requesting that the items be put in random, non-deterministic order.

    In short, it is ignorant.

    Of course, when doing identity checks, it can make sense to ignore distinctions.

    Even when querying for a subset of the data, it can also make sense as an option.

    Neither of those two uses is ignorant.

    But ordering data (even the data that comes from that subset query) and ignoring distinctions is ignorant.

    So why I was on picking on SQL Server a minute ago?

    Well, I guess I am picking on most databases here, not just SQLS. Since most of them:

    • do not provide a good way to look at the data in these two distinct ways, or
    • they do provide them both they don't provide a way to do both operations in an indexed way, or
    • they do provide them both in an indexed way but the way is not intuitive and requires one to build two different indexes.

    Let's ignore the IGNORANT nature of the first and second of those bullet points, since the reasons should pretty obvious.

    And also they don't apply to SQL Server except perhaps in some earlier, lamer versions that came along when it was less of a product that it has managed to become.

    Instead let's think about that third bullet point.

    This one an my answer don't (strictly speaking) apply to SQL Server since they don't use our sort keys for their indexes (as I pointed out in When good SQL queries have trouble...), they build their own using their borrowed CompareString call. Though you could always hold them responsible too if their solution does not allow what I am about to suggest. :-)

    Now if you look back on that very first post, Part 0, where I gave how the sort key looked:

    [all Unicode sort weights] 01 [all Diacritic weights] 01 [all Case weights] 01 [all Special weights] 01 [Punctuation weights] 00

    Passing NORM_IGNORENONSPACE or NORM_IGNOREWIDTH/NORM_IGNORECASE or NORM_IGNOREKANA is telling LCMapString/LCMapStringEx with the LCMAP_SORTKEY flag to basically take the diacritic or case or special weights and just skipping them.

    In other words, when you pass those flags, everything relevant between the appropriate 01 sentinels is removed.

    Now if in SQL Server you use the method I talked about in Making SQL Server index usage a bit more deterministic, which is essentially the only way around the problem of supporting two different methods of indexing, you are STILL required to have separate indexes, even if one of the indexes is a literal and complete subset of another (i.e. an index that ignores case and diacritics vs. one that denies neither).

    When there is a clear way to use the same index value and not create the huge space burden of mainly duplicated indexes via a non-intuitive and not very well documented syntax with no good user interface support?

    Ignoring the multi-language issue for a moment (since those indexes are usually not subsets), providing an intuitive way to support the different views of the data for identity/subset vs. ordering operations via a single index is actually more work to decide how to expose the feature to customers then it would be to provide the technical solution....

    Hmmm.... some features to think about for future versions, huh? :-)

    On the other hand, they are still at least half a hundred collations behind Windows on the language side with some terrible defaults in the Unicode collation side, let alone providing proper results for scenarios that are only well-understood by regular readers of collation/case posts in this blog. So I don't think we're talking about next version here.

    But one day, ignorance of collation can perhaps be cured in products....

     

    This post brought to you by 8 (U+0038, a.k.a. DIGIT EIGHT)

  • Sorting it all Out

    A&P of Sort Keys, part 7 (aka You're very thin now, but I can still recognize you)

    • 6 Comments

    Previous posts in this series:

    Here at Microsoft, there are a whole bunch of people who do the 20/20 program to help them lose a lot of weight. And a lot of people finish up the program a lot thinner than when they started. Of course even if you become tons thinner you are still the same person (it is almost like one of those Comcast phone service commercials everyone hates!), and everyone can still recognize you. It is to all the people I know who have gone through that this post in the series is dedicated to....

    Prior to Unicode, the concept if the width of a character was in at least the sense of amount of space needed to store the character, pretty fundamental -- wide characters took literally twice as much space to store as their narrower counterparts (1 byte vs. 2  bytes).

    Now over time the difference also tended to show visual distinction as well, thus

    ABCDEFGHIJKLMNOPQRSTUVWXYZ

    would be about twice as wide (if the same font was used) as 

    ABCDEFGHIJKLMNOPQRSTUVWXYZ

    Though of course the magic of font linking can make the two look less distinct depending on you browser and OS language settings. :-)

    Now although the two sets of characters are far apart in Unicode, no one ever expected anyone to not sort them together -- there is something fundamentally A-like1 about both and A, and there are very good reasons to expect both of them to sort before B.

    But obviously they are not completely equivalent to each other.

    And thus the notion of character width was brought to sorting -- it used to (but no longer) affects the storage size, it still affects fonts if you use the same font, and affects collation in regards to the "width" piece of the sort weight.

    Our sample strings are just going to be plain old ASCII compared to the fullwidth versions of the same characters (future posts will get into Korean, Japanese, and Chinese examples that also deal with width, never fear!).

    You can choose to ignore width differences via the NORM_IGNOREWIDTH flag, which literally just removes the bit on the sort key that indicates the character is full width. This has the additional benefit of shrinking sort key size if wide characters are there.

    Also, generally speaking, in some cases but not all we tend to move the full width versions of characters when the half-width versions are moved -- as the examples below will show. There is some debate as to whether what we do here is in fact correct, since the main difference between them in the eyes of most people is display, and thus they are either not used in a language or are used in some way that the correct sorting behavior would be expected.

    I tend to believe that what we do is incorrect any time we don't move them (in other words I believe moving is the correct thing to do), but no one is really complaining loudly enough at this point to make the change worthwhile. It is a similar point to whether one should move every letter with a particular base if one moves the base, and I am inclined to think we ought to, consistently.

    I'll show examples where it gets weird just so you see what I am talking about....

    And here are some samples:

    U+ff21  A  0e 02 01 01 13 01 01 00
    U+ff41  a  0e 02 01 01 03 01 01 00
    U+ff21  A  0e 02 01 01 12 01 01 00   (w/NORM_IGNOREWIDTH)
    U+ff21  A  0e 02 01 01 03 01 01 00   (w/NORM_IGNORECASE)
    U+0041  A   0e 02 01 01 12 01 01 00

    Now a few things are immediately obvious -- like that WIDTH is stored in the CASE weight, but NORM_IGNORECASE has no effect on it. And also that it just adds 01 to the case weight any place one has a full width character.

    How about with different languages? Well here is a Danish example:

    en-US U+ff21 U+ff21  AA  0e 02 0e 02 01 01 13 13 01 01 00
    en-US U+0041 U+0041  AA    0e 02 0e 02 01 01 12 12 01 01 00
    da-DK U+ff21 U+ff21  AA  0e 02 0e 02 01 01 13 13 01 01 00
    da-DK U+0041 U+0041  AA    0e b1 01 03 01 1a 01 01 00

    And here is a Lithuanian example:

    en-US U+ff29  I  0e 32 01 01 13 01 01 00
    en-US U+0069  I   0e 32 01 01 12 01 01 00
    en-US U+ff38  X  0e a6 01 01 13 01 01 00
    en-US U+0058  X   0e a6 01 01 12 01 01 00
    en-US U+ff39  Y  0e a7 01 01 13 01 01 00
    en-US U+0059  Y   0e a7 01 01 12 01 01 00
    lt-LT U+ff29  I  0e 32 01 01 13 01 01 00
    lt-LT U+0069  I   0e 32 01 01 12 01 01 00
    lt-LT U+ff38  X  0e a6 01 01 13 01 01 00
    lt-LT U+0058  X   0e a6 01 01 12 01 01 00
    lt-LT U+ff39  Y  0e 33 01 01 13 01 01 00
    lt-LT U+0059  Y   0e 33 01 01 12 01 01 00

    See how in the case of Lithuanian the Y-like characters were moved including the fullwidth ones, while for Danish they were not?

    Now I know this is part of a bigger philosophical issue of what to do with letters that are generally not used in a language but which look a bit or more than a bit like ones that are.In general whether it comes to width, diacritic, or alternate case forms we have no consistent story -- some we move and some we do not.

    Is it a bug? Well, maybe not, but it seems like it ought to be, since we kind of halfway do it. I doubt that the fullwidth Y moving was done at the behest of the Lithuanian Microsoft subsidiary. :-)

     

    1 - I have been encouraged to stop using the term A-ness in public to avoid what I will henceforth refer to as the 'Beavis Effect', preferring instead the term A-like.

     

    This post brought to you by 6 and 7 (U+0036 and U+0037, a.k.a. DIGIT SIX and DIGIT SEVEN)

Page 2 of 5 (71 items) 12345