Blog - Title

July, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Why do the high surrogates have the low numbers?

    • 3 Comments

    This is a question with true 'drive on a parkway, but park on a driveway' feel to it, but one that I have been asked by many people.

    If you look at the surrogate range and its definition in Section 3.8 of the Unicode Standard:

    3.8 Surrogates

    D25 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.

    D25a High-surrogate code unit: A 16-bit code unit in the range D80016 to DBFF16, used in

    UTF-16 as the leading code unit of a surrogate pair.

    D26 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.

    D26a Low-surrogate code unit: A 16-bit code unit in the range DC0016 to DFFF16, used in

    UTF-16 as the trailing code unit of a surrogate pair.

    • High-surrogate and low-surrogate code points are designated only for that use.

    you may not find the conformance definitions to be too terribly useful here -- they confirm what we already knew. So what is the story?

    Well, a lot of it has to do with the way human beings try to equate what we understand to what we do not.

    I remember trying to explain to someone about our collation weighing system, and the way we gave the items that come earlier 'less weight' so that they come first. He was confused because he was thinking about it like an indicator that went from 0 to 100 -- the items you wanted to have first would thus be 'heavier' so they would sink to the bottom of the list.

    Now this person was not 'wrong' conceptually, it was just that his model did not match ours. :-)

    So it is with the high and low surrogates. The high ones, which come first any time you have a legal surrogate pair, are assigned first. Since they are assigned earlier in the range of possible code points, they have lower numbers (0xd800 is a lower number than 0xdc00 in any computer language I have ever heard of), but no one was really thinking about the low/high surrogate thing in terms of code point values, they were thinking of the 'high that comes first' instead.

    In case you are still rebelling against the conceptual disconnect, keep in mind that people say "WE'RE #1" to mean that they have a higher ranking than #2 and #3 and so on, despite the fact that the numbers are lower. That may help people to see that we each have our own assumptions about ranking and ordering that do not always use the same model....

  • Sorting it all Out

    Why my syndication links were broken....

    • 5 Comments

    The other day I complained about how My syndication links are broken.

    We can blame it all on surrogates! :-)

    You see, when I complained that There is no such thing as a surrogate character (dammit!), the post was sponsored by the first high surrogate, U+d800. And I made the mistake of putting that lone surrogate in quotes as the sponsor.

    Community Server was using XML functionality that assumed standards conformant text -- which of course unpaired surrogates are not. So it was choking on the invalid text.

    Earlier, when I was saying that blogs.msdn.com was broken, too? Well, my post was in there too - until it scrolled off of the most recent 25. That is also why it came back up so quickly.

    The moral of the story? Don't use unpaired surrogate code points. My site was down to RSS aggregators for days!

    (the Community Server folks will also try to handle this case a bit more gracefully in future versions, too!)

  • Sorting it all Out

    What's wrong with the SortKey class?

    • 0 Comments

    You may be wondering what I am talking about here....

    Well, it is about the SortKey class in the .NET Framework, which you get to by calling the GetSortKey method of the CompareInfo class.

    The class contains everything you need -- the source string and the binary sort key. So what could be missing?

    Well, for starters the flags used to create the instance. The original string is avaiable as is the final result, but there is no way to know how you move between the two. Now I am not saying that this is necessary, but it is a little odd given that an object happens to be there.

    In practice, it is hard to know when you would ever need the string -- wouldn't you know what the string is, and thus not need to have another copy of it stored? It seems kind of wateful to me....

    Now there is a Compare method that you can use to compare two sort keys, and that is great. But that means if you want to use the functionality to create sort key to build a database index, that you are on your own unless you either (a) serialize the class and thus duplicate the string and the indexing, or (b) take care of the binary comparison yourself. I do not mind (b) and that is easy to do, but wouldn't it better to have a static method to do the work for me? I don't see the serialization here to be a realistic implementation option.

    So, these are things that are (in my opinion) wrong with the SortKey class. The real people using it pretty much need to grab out the sort key and then throw it away, since it does not really provide anything else that is needed. I'd be happy if there were just a comparison method available for the byte arrays that make up the sort key data....

  • Sorting it all Out

    New in Vista Beta 1: FindNLSString (an 'internationalized' strstr)

    • 8 Comments

    This is an example of the kind of features that we in NLS can add to a product -- not as fancy as transparency and other cool Vista stuff that gets all of the press coverage. But there is a certain class of people, a class with a big overlap with those who read this blog, who may find it to be quite interesting. I am not going to leak anything that is not available in the legitimate beta that may be in your hands right now (or could be some time soon!), so don't get too excited. But there are some very cool features that are going into Windows Vista that may be fun for geeks like me, so consider this the first of many such notices. :-)

    The strstr function has been a part of the C Runtime for ages. It's simple job is explained in the docs: "Returns a pointer to the first occurrence of a search string in a string."

    But of course that function (or its Unicode cousin, wcsstr) would never do any of the interesting fun things that CompareString is so famous for, from ligature equivalences (U+00e6 æ being equal to the letters ae for most locales) to Unicode canonical equivalences and more.

    So for LonghornVista we have added an NLS version of this long-existing functionality -- the FindNLSString function!

    The Vista Beta 1 SDK will be available soon, so consider this a marketing preview of the new function. :-)

    If you are a developer who has already picked up Beta 1 of Vista off of the MDSN servers, this function is exported from kernel32.dll and gives you all of the functionality of the managed methods off of CompareInfo (i.e. IsPrefix, IsSuffix, IndexOf, and LastIndexOf).

    The new FindNLSString has one extra bit of functionality that neither wcsstr nor those managed methods have ever had before -- an OUT param that will allow the caller to find out the length of the string that was found (which may not be the same size as the search string!). Now if you think about what the FindNLSString function may be used for (a good example is someon using the ReplaceText common dialog to replace one string with another), what better way to mess up an operation than to not know of the length of the string that was actually found? I mean, it is all well and good for the Unicode standard to say that U+00e5 (LATIN SMALL LETTER A WITH RING ABOVE) is canonically equivalent to U+0061 U+030a (LATIN SMALL LETTER A + COMBINING RING ABOVE), but if your replace operations starts improperly detecting the subset then it will not be a very effective replace operation, now will it? :-)

    Now one feature that has not been added is that there are no separate 'A' and 'W' functions -- there is just one Unicode version, without decoration. The trend that started in Windows Server 2003 with IsNLSDefinedString to only add Unicode versions of functions clearly looks to be the way things will be going forward for NLS. If you are not using Unicode, then you will want to realize that you are not going to see some of the features coming out in products.

    One obvious question.... why not just call the function FindString to go along with CompareString, LCMapString, FoldString, and so on?

    Well, I did try to do that, and managed to break our private build with the change since there were so many cases of internal functions in components and utiities and Platform SDK samples named FindString. Maybe if we had reserved the name 15 years ago, we'd be all set. But even if I changed all of those cases, it is obviously something that would be a problem for users as well. Anything that is in our source code once is in customers' code hundreds of times, and I don't even want to think about how many times it would be in customers' code. Calling it FindNLSString keeps that overlap from being a problem....

     

    This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
    A letter that is anxiously awaiting Vista Beta 1 so that all of its different normalization forms can finally be considered equal!

  • Sorting it all Out

    Microsoft Company Picnic

    • 0 Comments

    Yes, I went to the company picnic yesterday, scooter and all....

    It was an interesting time, though it does seem to be especially made for kids (and parents with kids). There were many events there that I was not going to try but Teresa thought she would give a shot. I even took some pictures for you to enjoy if you like that sort of thing. I took some liberties with temporal issues for the sake of the narration, hopefully I can be forgiven that for the sake of the story....

    (Note to self: remember the spare battery next time, doofus!)

  • Sorting it all Out

    That 'international' aimeemann.com site I was talking about

    • 1 Comments

    A while back I mentioned how I was once involved a little bit with Aimee Mann's web site. I mentioned it some people at work who did not even believe me until it finally went live -- when I showed them the local version I had running they thought I just needed to get a life. Which was true, but the work was still for the official site. :-)

    You can even see a credit that then webmaster Russ Nordmeyer put in to the links page (see the end of the Links page on the old site for the proof that this happened). Now this was back in late 2001, and as soon as the at-the-time new album (Lost in Space) was put out, and it was Flash based so these three minutes in my 15 minutes of fame came to an end. And it was a real end -- although the credit still exists, the actual site was archived and moved off of a Microsoft server and over to one that did not understand ASP.

    Ah well, there was not too much there, just localized versions of a list of tour dates and a bio page that were both translated into several other languages (Matt would have loved it, since the links were region-based!). It was fun to do and fun to work with both professional localizers and fans into what proved to be a fairly challenging localization project, given the nature of the biography.

    (Interestingly the Internet Archive does understand ASP, so some of the site has been "revived" there -- you can see a few of the languages in their original form on the tour dates and bio page -- though later folks did inject some English on that tour dates page!)

    And the European tour dates were postponed until late 2002 after all that happened on September 11th, 2001, which of course made some of the project's urgency and interest get lost (since the upcoming European tour seemed like a great reason to get many different languages posted).

    Now that the site is now three years archived, it occurred to a few people that the whole thing might make an interesting comparative localization project. :-)

    Now this particular localization was more difficult than usual for a software project for several reasons:

    • There were not a huge number of localizers who were Aimee Mann fans;
    • There were not a huge number of Aimee Mann fans who understood the principles that guide good localization;
    • Merely speaking a language well enough to localize professional software was not enough -- one would have to also have a good command of both the source and target languages to not just know what words to use in the target langauge but to be able to understand all of the slang, idiom, and allusions of the original source;
    • The project was an all-volunteer effort -- there was no money being offered at all (this was not as much of a problem as you might guess);
    • Fans have a hard time being corrected, even when other fans are certain that there are errors in translation.

    But it was nevertheless a fun project for a many different people, and it was live for a little while, and excited fans were asking to add their own translations right up until the day that the new site was put up. Anyway, here are the languages that various volunteers submitted and/or commented on and/or edited over those months (items in bold are countries that were a part of both the earlier planned European tour and the one that happened the next year).

    United States      Germany      Japan      Greece      Sweden      France      Portugal      Georgia      Thailand      Great Britain

    If you speak any of these languages, enjoy the issues in the above bullet list! :-)

     

    This post brought to you by "€" (U+20ac, EURO SIGN)

  • Sorting it all Out

    My syndication links are broken

    • 6 Comments

    I am posting this so that maybe people will no longer feel like they need to contact me to tell me that the RSS 2.0 and Atom 0.3 links are out of order. I know they are.

    I hav actually been looking at my stats for the first time -- it is simultaneously fascinating and infuriating to realize how many people view the site from aggregators (the former because the stats are cool and the latter because there are clearly a whole bunch of people who can't see the site from the aggregator and are not bothering to look further).

    None of the category RSS links seem to be affected.

    But http://blogs.msdn.com/ has the same problem with its syndication link. And maybe some other people do, too. Perhaps this will be fixed soon....

    You do not need to either send email or contact me via the site. :-)

  • Sorting it all Out

    Disagreeing with minimsft again

    • 0 Comments

    Well, our old pal minimsft, has proven that once again, not everyone at Microsoft will agree with what he or she has said.

    But I guess I am used to that. We all are at Microsoft (well those who read that blog from time to time).

    So what am I disagreeing with? Well, mainly his Great! Amazing! Innovate! Huge! post that went live a little while ago....

    It is convenient, when one wishes to place Microsoft in the role of "evil demon" to think ot it as just one single company.

    And at a very high level, I suppose we are. But at that very high level, there is only so much that can be said, really.

    Example -- when the .NET craze first started (back in the middle of 2000), there was a company transforming vision about this new thing called .NET. There was a lot of excitement, somewhat temped by people like Joel Spolsky, who were smart enough to see back in July of 2000 (didn't I say middle of 2000?) that it did not say very much.

    And do you want to know what? Joel is right, it didn't. Duh?

    It is really not until the message moves down into levels below Steve Ballmer that it says more. Because the odds of having details that can excite 60,000 people are just about nil -- our jobs (those who have conventional jobs!), our products (those who have conventional products!), our lives (those who have conventional lives!), are simply way too different. About the only thing I have in common with someone who is in marketing for Microsoft Works or user education for Office or an operator for the main phone line in Remond or development for ActiveMates Barney is a blue badge, and I doubt I even have that in common with a subsidiary program manager who works in Microsoft Singapore or someone in PSS (ISS?) for Microsoft Thailand. So how can Steve Ballmer as the CEO send out an email that is less than 1000 printed pages that has relevant detail for every single person, every single product in Microsoft?

    He can't, obviously.

    He is the freaking CEO of a company that puts out a hugely diverse line of products and services, for crying out loud!

    The only way he can talk to everyone is to talk in generslities, and try to get people excited about the broad goals that affect the whole company and everyone in the company. It then becomes of the job of the vice presidents and others below him to take that message and apply it to more specific areas -- the products, the services, and so on. So (to take that middle of 2000 .NET example) they can talk about what .NET will mean in Works, or Office, or Thailand, or the operators who will be taking calls from customers about .NET.

    Or even Technical Leads reporting to Development Leads in Globalization Services, who report to Development Managers in GIFT (Globalization Infrastructure, Fonts, and Tools), a big part of GPTS (Global Platform Technologies and Services) headed by Lori Brownell, which is a small but important part of COSD (Windows Core Operating System Division) headed by Brian Valentine, a crucial piece in the Platforms Group headed by Jim Allchin, which is under Steve Ballmer.

    (I included names and links for the people I could find public links for, and added adjectives to describe relative importance as I see it; I am sure other groups will see it differently!)

    Now of all of these people, who do you expect to have the most relevant details to apply the plan of what Microsoft's big vision has to do with me personally? Steve? Or someone down in the chain who maybe even knows my first name an what I do when I am working?

    Every cynical person who is smart enough to recognize this (like me) takes what Steve Ballmer says with a grain of salt, perhaps guesses at what it may mean for him or her, and waits for someone to fill in the spaces.

    Every cynical person who is not smart enough to recognize this (like Mini-Microsoft, though I suspect he or she may actually be smart enough -- but likes to complain!) complains about how the emporer has no clothes for them, not realizing (or not admitting) that it is not the emporer who fits their clothes -- it is the local tailor.

    Think about it....

  • Sorting it all Out

    There is no such thing as a surrogate character (dammit!)

    • 8 Comments

    The title of this post, including the parenthetical note, is something that people associated with the Unicode Standard have to tell people all the time (of course generally people only say that parenthetical note to themselves, and really only because they have to say it so many times!).

    The issue is clear in both the Unicode Glossary:

    Surrogate Character. A misnomer. It would be an encoded character having a surrogate code point, which is impossible. Do not use this term.

    and the Unicode FAQ:

    Q: Are surrogate characters the same as supplementary characters?

    A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

    There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point).

    In fact, if you look to the Unicode Roadmap, each plane has its own name:

    • Plane 0: BMP (Basic Multilingual Plane)
    • Plane 1: SMP (Supplementary Multilingual Plane)
    • Plane 2: SIP (Supplementary Ideographic Plane)
    • Plane 14: SSP (Supplementary Special-Use Plane)

    They are supplementary characters, one and all. They are not surrogate characters. Truly.

    This is easy, right?

    Of course even the clearest intention will not always find itself communicated properly, which is why the Char.IsSurrogate method will have text like "Indicates whether a Unicode character is categorized as a surrogate character" or when the Windows CE docs say "For sorting, all surrogate pairs are treated as two Unicode code points. Surrogates are sorted after other Unicode code points, but before the PUA (private user area). Sorting for a standalone surrogate character (that is, either the high or low character is missing) is not supported.". I do mind the not-entirely-accurate statement about the collation, but I will talk about that another day!

    I do not mind the surrogate character usage like that in the previous paragraph so much, as it is a more benign error -- when people say surrogate character in this context, they mean to say surrogate code point. Harmless error and it even shows up as a NULL glyph as if it were a character of some sort, and we can just the documentationl language at some point (hopefully soon, but I will not lose sleep if they do not).

    The real problem case is when they try to equate the term surrogate character with the term surrogate pair. If they compound it by the naming the method that way, like the XmlWriter.WriteSurrogateCharEntity method, which in addition the evil method name, say things like:

    When overridden in a derived class, generates and writes the surrogate character entity for the surrogate character pair.

    This is a bit harder to fix (not the doc. portion, but the method name, which obviously cannot be removed.

    But we'll figure something out. Eventually.

    Until then, please remember what the title of this post is telling you -- there is no such thing as a surrogate character!

     

    This post brought to you by U+D800, the first surrogate code point -- not a surrogate character!
    (This code point has come to terms with his lack of character-ness, but has mentioned that the fact that no one else has may put him into therapy)

  • Sorting it all Out

    Windows Vista Beta 1 is released!

    • 0 Comments

    Windows LonghornVista Beta 1 is available from MSDN Subscriber downloads

    Also check out the nice looking Windows Vista developer centre on MSDN:

     http://msdn.microsoft.com/windowsvista/about/

    I will be posting about Longhorn NLS features that are available in Beta 1 over the next while -- some pretty exciting stuff, all things considered!

    You can also hear Chris Jones talking about Windows Vista Beta 1....

  • Sorting it all Out

    Aimee Mann is on David Letterman tonight

    • 0 Comments

    Just thought I would mention this so people could set their Replay TVs or their Tivos appropriately!

    Some usually unreliable sources have told me that they still are not sure which new song they want to play from The Forgotten Arm (the new CD, which I mentioned previously). I would personally vote for King of the Jailhouse but they will probably go for something more upbeat musically and/or thematically (plus the band seldom calls me to ask what they ought to play on nationalized television appearances, believe it or not!).

    She is also doing the whole Pacific Northwest, and I will probably be at 50-75% of these shows:

    Fri Aug 5 '05      Vancouver, BC     Commodore Ballroom
    Sat Aug 6 '05   Seattle, WA    South Lake Union Park
    Sun Aug 7 '05   Portland, OR  The Aladin
    Tue Aug 9 '05   Eugene, OR  McDonald Theater

    I think there are still some tickets left for Seattle, and it is general admission so all seats can be good ones. There is even a chance you will see me there if I can get my scooter on the grass! :-)

  • Sorting it all Out

    Not everyone likes Unicode

    • 4 Comments

    It is true -- not everyone likes Unicode. This includes a guy by the handle tyomitch who was trying to post a long comment to a post here and hit some kind of length limitation in Community Server. I did not want to appear unwilling to post negative comments about Unicode so I'll put it here. :-)

    Can't post this comment to http://blogs.msdn.com/michkap/archive/2005/05/21/420666.aspx, but maybe you care anyway...

    When I first came across OCS, I was stumbled by the very limited support for it in Unicode. Then I went discussing it with a person whose job is typesetting books in OCS, and he told me that in his opinion, noone at all in the whole Unicode group cares about _usable_ OCS support, since those OCS users aren't a business target group who can 'sponsor' necessary procedures for registering characters. Sadly, it looks like he's right: what Unicode group has got to this point is some 'pro forma' support, just so they can't say "we're not supporting OCS characters".

    Actually, among the people who make the real OCS typesetting software (that is, fonts, on-screen keyboards, Word macros etc.) noone uses Unicode (at least I haven't found anyone). They all use some custom 8-bit charsets, which don't even agree on what is a character and what isn't, complicating translation from one encoding to others even further. That is, there are dozens of totally incompatible OCS charsets used, and there is Unicode, which noone even considers usable. Does this satisfy that Unicode group?

    Now, to be done with the preamble :-) The person mentioned in the second sentence agreed that UK has five variant shapes: that is, small/capital ligatures, and small/title/capital digraph. Here are the reasons which I can come up with _against_ considering the ligature suitable for U+0478/U+0479:

    1) the digraph isn't two letters stacked together, because the first part is OCS letter ON (identical to U+041E/U+043E) and second part isn't a valid OCS letter at all. The second part is just a glyph that has no meaning outside the context of this digraph. Separating the digraph into two characters is as crazy as would be separating U+2116 "Numero Sign" into combination of U+004E and some character for underlined O on the only base that U+2116 looks like two letters together.

    So, if digraph is to be included in Unicode, it can only be represented with a single character.

    2) as based on previous Unicode revisions, those fonts that claim to support Unicode OCS actually have the digraph located at U+0478/U+0479. Those fonts include even the stock "Microsoft Sans Serif" of WinXP. Changing the character assignment would break all the existing documents which happen to use this letter (assuming there are any). Also, the Unicode fonts that have 'old-style' glyphs have the ligature at U+0423/U+0443, and the titlecase digraph somewhere in the unused slots of U+0400 block.

    3) as already stated in the comments of your post, those shapes are position-based variant forms and not separate letters. Maybe they don't even deserve any more than three characters (for the three cases)? Unlike the 'final sigma' case, now we don't have a legacy charset to keep compatibility with, and compatibility with previous Unicode revisions has already been broken. Maybe it's worth going to the very end and _removing_ the characters U+0478/U+0479?

    What I personally would be completely happy with is leaving the two digraph cases at U+0478/U+0479 where they are, allocating a character for the titlecase, and making fonts responsible for displaying U+0423/U+0443 as either the old-style ligature or the modern Y-shaped letter. Is this too simple? Why should there be a (based solely on glyph shape) identity between Cyrillic Letter U and (not ever used alone) second part of the digraph of Cyrillic Letter UK?

    Not that I'm expecting a detailed reply, but now I've expressed my opinion... Can this message be at least appended to the comments at http://blogs.msdn.com/michkap/archive/2005/05/21/420666.aspx?

    The post is correct about the need to not change fonts around since it would change the way documents have been written. And I am in cases like this pointing to the real dark underbelly of Unicode. But it is very likely (bordering on almost certainly) true that if the Olc Church Slavonic experts are not using Unicode then they are piling up future problems for themselves. If there are missing characters they should be added, and there is certainly no desire on the part of Unicode to not support a plain text requirement....

    I would be interested in knowing what communications were rebuffed or glossed over -- and who was doing the apparent glossing. It is certainly not any kind of official or unofficial Unicode policy to do such a thing....

     

    This post brought to you by "Ѹ" (U+0478, a.k.a. CYRILLIC CAPITAL LETTER UK)

  • Sorting it all Out

    All code page architectures are created equal

    • 10 Comments

    Yes, I said it -- all code page architectures are created equal. But in the most Orwellian sense, some are more equal than others....

    First I will digress into a favorite Odgen Nash poem of mine, which is very short. I pretty much memorized it:

    Let's talk about eggs:
    Eggs have no legs.
    Let's talk about chikens:
    Chickens do have legs.
    The plot thickens --
    eggs come from chickens!
    But they have no legs under 'em
    What a conundrum!

    Why this poem popped into my head may become apparent shortly. If not then it is still a nice poem (Ogden Nash at his finest!).

    Anyway....

    If you look at the official, sanctioned encoding architectures owned by the GIFT team, there are three of them:

    • The Win32 NLS API model, used by the unmanaged universe and which sports a very C-focused model;
    • The MLang model, used by Internet Explorer and which sports a COM-based model;
    • The .NET Framework model, used by the managed universe and which sports a managed code model.

    (there is a fourth model for Kernel mode and the Rtl* functions that can be used in both kernel and user mode, but I will cover that another day -- for my purposes here just consider it for now like Win32 but more limited!)

    If these were three entirely separate models, it all might be easier. However:

    • for MLang, many code pages call the Win32 code, occasionally in edge cases returning HRESULTS that in many cases exceptions to be thrown
    • There are several code pages which have bugs in edge cases in Win32 that were fixed in MLang
    • The 1.0 and 1.1 versions of the .NET Framework code are thin wrappers around the Win32 code (maybe sometimes using MLang to try and work around bugs)
    • The 2.0 managed code started over and tried to fix many of the problems in the two other models (along the way becoming smaller and dare I say a bit faster!), yet in many ways based on the original work

    Talk about conundrums -- these three models are so interrelated even though there are so many times that their behavior differs that I doubt anyone will ever be able to sort out the behavioral differences.

    It represents complex pieces of code in three code bases written across nine versions of Windows, three versions of IE, and three version of the BCL, using unmanged, managed, and COM based code. It is very hard to figure out what is a bug to fix, what is a bug we are stuck with for backcompat reaons, what is an intentional feature that only looks like a bug because the behavior was not documented well enough. You can get a headache trying to figure it out sometimes (and many have!).

    So what does it all mean?

    Well, as Shawn Steele, the owner of the bulk of this complex set of code bases likes to say, people ought to just be using Unicode. And Shawn is spot on here -- the more complex the code page work you do, the more likely you are to run into problems with the use.

    Now I do not include UTF-8 (or even UTF-32 in the .NET Framework) with the rest of those code pages, since it is a Unicode encoding form and all, but just about everything else ought to be a "use if you have to convert something, but then once it is converted stop using!" model.

    Bue please just try to use Unicode, like the opersting system and the .NET Framework prefer, and were basically designed for....

     

    This post brought to you by "" (U+0ce1, a.k.a. KANNADA LETTER VOCALIC LL)

  • Sorting it all Out

    A subkultur iz a shprakh mit an armey un a flot

    • 7 Comments

    The title of this post is inspired by a quote from Max Weinrich, a Yiddish linguist -- A shprakh iz a dialekt mit an armey un a flot. I think it can be understood by many without knowledge of Yiddish, especially if they know German (as German-knowledgable Cathy likes to tell me, in a lot of ways Yiddish is like 16th century German with Hebrew letters). I knew what it meant but I don't know any German at all. Basically it can be translated as "A language is a dialect with an army and a navy."

    He was speaking somewhat ironically when he said this, since obviously Yiddish has neither but nobody would presume to call it a dialect at this point.

    But it does raise an interesting question about one of the difficulties of creating locales -- what would be the location of a Yiddish locale if one were to be added? There isn't one (though I think it might be fun to call it Yiddish - Shtetl, I doubt that would get past the lawyers!). And then of course we would need a Yiddish - Shtetl (Latin) and a Yiddish - Shtetl (Hebrew) to account for the fact that both scripts are used in these times. And the question of what to do wih collation is a fascinating one for the Latin script (though fairly obvious for the Hebrew script one).

    Thus my modified quote, to cover the Windows requirement for cultures and locales as they are defined -- A subkultur iz a shprakh mit an armey un a flot (a culture is a language with an army and a navy). :-)

    Or using the Hebrew script for the Yiddish phrase, something like:

    אײ סובקולטור איז אַ שפּראַך מיט אַן אַרמײ און אַ פֿלאָט

    The same problem exists for Esparanto, and really any language that crosses so many borders and lacks a specific origin location. It is just too hard to figure how they fit into the model of locales that Microsoft ships in Windows and the .NET Framework.

    This is one of the REAL benefits to both opening it all up and getting out of the way, since the difficulties that Microsoft would run into in trying to define a specific locale should not block an individual customer or even a community of customers from defining one that they would like to use.

     

    This post brought to you by  "װ" (U+05f0, HEBREW LIGATURE YIDDISH DOUBLE VAV)

  • Sorting it all Out

    Kristin Connell at Paragon tonight

    • 7 Comments

    I first saw Kristin Connell when she opened for Jim Boggia in the Green Room, and I even bought her CD Second Chances there since I enjoyed several of the songs she played, and she said most of them were on the CD. I had her autograph it, too -- why not? :-)

    Anyway, this talented lady is at Paragon in Queen Anne tonight, where the food is great and the entertainment is free even when it is really good, like when it is Kristin. Highly recommended!

    I never did post the story about the CDs that night. Kristin opened the Green Room show (where they have a nice scooter-friendly elevator) and afterwards I bought the CD. I had only $25 on me, but the CD was just $15 so I figured I would go to the bank tomorrow.

    But then Jim got onstage and did a great show (he even closed with that hilarious Prince imitation I had heard about but never seen). I wanted to ask him to do Mr. Harris (an Aimee Mann song that he once got up on stage with Aimee to do, the night after I had to leave town), but I lost my nerve and did not ask while he was onstage. After the show I started to tell him this and it turns out he remembered the drive out to the show and Mr. Harris the next night and everything (he even remembered me which amazed me even though he did not remember my name, it was still very cool!).

    I decided to buy one his new CD (Safe in Sound), too -- even though his management had sent me one already (I did some stuff with flyers for them), I wanted to give one to a coworker. As a rule I like to make sure more money goes to the artists than to the store, so this seemed perfect. But then I remembered that I only had $10 on me. :-(

    But then I found a 5 euro note in the wallet, which is technically worth more than $5 (exchange rate being what it is). Would they accept it? I have known cab drivers to refuse them here, not realizing what they were worth (the ignorance about some things in this country is staggering!).

    As it turns out, he had no problem taking the euro. :-)

    I kind of wanted to get rid of the euros anyway, since I am back in the US now. But I'll bet you he won't forget the guy who bought his album in Seattle with euros, even if he still doesn't remember my name. Though in fairness he remembered the name of every female in the car from the original trip. Which is probably more gentlemanly than remembering the gentlemens' names, in any case.

    Anyway, hope to see you at the show tonight. I promise to give more notice about this sort of thing in the future (nobody reads this blog on the weekends, right? <grin>).

Page 1 of 5 (63 items) 12345