Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Let me start by saying that MVPs are awesome. I have been talking to many of them here at the Summit and will be talking to many more. They are great contacts and they really are an invaluable help to customers of Microsoft products.
Now even MVPs know of other experts who are not MVPs. And there is plenty of mutual respect there, too.
Obviously some of those people would like to become MVPs (I actually can name three different people I know who are hoping they will get the nod at some point). And just as obviously there are others who would prefer to stay independent of Microsoft and not feel like they have to be as careful about what they say (which is not to say that MVPs are *that* restricted, but I think they generally recognize the prudence in not biting the hand that feeds them!).
I have talked to several of the MVP leads, and they have the same basic feelings. And none of them claim that all of the smarts are exclusive to the MVP program, by any means.
Anyway, it was recently brought to my attention that there seems to be a new policy at Wiley (the publishing company that bought most of the Wrox titles) where authors of technical books about Microsoft technologies must be Microsoft MVPs. If they are not, then they cannot author a book on MS technologies, even if it is a reprint or new edition of a book and they were authors of the original. Even if the original is selling well and the authors have been making names for themselves.
All for the privilege of putting the MVP logo on the book.
Huh?
No offense, Wiley, but such a policy strikes me as both ridiculous and ultimately unsupportable -- and several people at Microsoft have expressed the same opinion to me. Including people involved with the MVP program.
I am hoping that they will clarify this policy and explain that it is not the way things are. Or that they will reconsider this plan if it is true. It seems like a surefire way to alienate authors and potential buyers of their books.
If you are a Wiley author, feel free to say something to them about this policy. And if you are someone who buys Wiley books, then perhaps a complaint from you too will help them come to their senses. Before they succeed in scaring everyone away....
In my post About [not] writing books I said the following:
And there was one great idea I had for a book and I even pitched it to my former acquisitions editor Sharon when she started working for Hungry Minds (when it was its own company). But it was a slightly radical idea and her boss said no, and none of the editors I have talked to since then have been interested either. So perhaps it was a little too radical (or maybe just a bad idea). Perhaps I'll blog about it some day and readers here can tell me if I was on track or on crack.
The book idea was a simple one. It started one day when I realized that MSDN was several gigabytes in size, and that is even allowing for the fact that most (all?) releases trim information out to kep the size down. Telling someone to "read the manual" is unrealistic hopefulness at best and ignorant optimism at worst. There is simply way too much information out there!
Further to that, the indexing system(s) of the data, both the internal indexes and the external ones built by search engines, all work in different ways. And it is hard to know which ones to use and how best to use them. Even those internal indexes have their content built by many different people, with all of the info folded together like shuffled cards.
Clearly, the indexes have not been keeping up with the indexed.
So how can one find the information one needs?
This is where the book idea came from.
The idea was a book entitled:
RTFMHow to make sure Microsoft's help actually helps
The main title of course stands for READ THE FREAKING MANUAL (that is the PG-rated version!). The subtitle speaks for itself....
The book would work to give all the tricks to getting the information you need from the gigabytes of information in MSDN, from the smaller files that ship with VS and SQL Server and Windows and Office, down to the smallest files that are not adequaetely indexed at all. From the ones that are on the web to the ones that never have been and probably never will be.
It would have had assistance from several UE/UA writers/leads/managers who I had talked to, all of whom were interested in being involved with the project. We are talking about a very motivated group of people who truly want the words produced by User Education to assist and the words produced by User Assistance to educate.
There would even be info on building indexes for your own help files, and the mistakes that people make doing this....
Some of the strategies described could even perhaps be automated and used by future versions of the help compiler products!
This is the kind of book that I imagined could be like a smaller-scale version of tomes like Abbie Hoffman's Steal This Book, where people would buy it whether they were going to read it or not. Even if they did not really work on MS platforms they might buy it, just to have on their shelves -- especially since I seem to hear the term RTFM used more often in relation to UNIX and Linux than to Windows!
It just struck me as a book that might really have helped people who did read it and might really have done whatever it was supposed to for the people who bought it and never bothered to open it.
Anyway, the folks at Hungry Minds felt it was too risky of an endeavor, and others were against it because they thought it would be an MS-bashing book despite my proven track record for being very pro-Microsoft even when I am posting about flaws in MS technologies and products.
So in the end, it did not happen.
Well, what do you think? Worthwhile concept or worthless tripe? Was I on track, or on crack?
Ivan just contacted me with the following message:
Hello, I've found a bug in a native NT API, specifically NtQueryDirectoryObject, who should I report it to? The report a bug link on microsoft.com just gives a phone number and I don't want to pay an intercontinental phone call. Ivan.
Hello,
I've found a bug in a native NT API, specifically NtQueryDirectoryObject, who should I report it to?
The report a bug link on microsoft.com just gives a phone number and I don't want to pay an intercontinental phone call.
Ivan.
Of course, NtQueryDirectoryObject is not a documented function in either the SDK or the DDK, so it is hard to know how one could report a bug against it. For all one knows, there could be a comment that describes the usage limitations that could cover the very scenario causing a bug, right? :-)
But you can describe the bug right here in a comment, and I can forward the link to someone. Genuine bugs are obviously always a concern, though I would always suggest trying to avoid calling these functions directly....
Warning: this post picks on Scott Hanselman a bit. But it is only in good fun because I happen to think he is an awesome developer/RD/MVP even if he messes up the occasional string comparison. Note his awesome formulation of how to do string comparisons appropriately in that link: Scott's Rule Number 0x5F: Think about your string compares and their context. Make sure you've expressed your true intent correctly.
Remember when I talked about the 25 locales we added in ELK v.1 in the post Lions and tigers and bearsELKs, Oh my! with the following language list?
And then later when I talked about 11 locales we added in ELK v.2 in the post ELK stampede! with the following language list?
And remember when I posted about the Mitigation tools for IDN security problems and how I mentioned that the normalization stuff was being shipped along with the package?
Anyway, Scott and I have had an interesting dynamic in the times we have been in the same place -- he points out the stuff that is missing that he wants in the .NET Framework or Windows (or both) OR the stuff that he wants to see in future versions and then I get to point out that what he wants is either going to be added in Vista or that it is already in the product today. :-)
So today as the MVP Expo was wrapping up, I ran into Scott who had several of those questions:
Clearly, Scott is a man who should read this blog, since 80% of his questions would have been answered before he even made it to the MVP Summit! :-)
...about the size of the 'Dr. International' sign I was sitting under -- if you will recall, I claimed it would be a big sign.
Well, here it is, compared to a monitor to give some context:
Clearly the people who told me to expect a big sign have some problem with depth perception and/or size estimation. :-)
The Expo itself was very cool and I got to talk to several different MVPs, 50% of whom specifically told me that they loved this blog. Cool!
I'll post more about some of the issues later....
Inspired by Sara Ford's post, I thought I would mention that I too was going to be at the MVP Summit.
I will be at the Microsoft Programs & Services Expo on Wednesday, September 28th at 1:00pm to 5:00pm. I'll be at a booth with a big sign over it that says "Dr. International" on it, though I am not, in fact, Dr. International, who is over here. But I will be one of several internationally minded people at the Expo.
I will also be around at various other events during the Summit, wearing either a Blue or a Red Microsoft events shirt.
If you are an MVP and you see me (and you aren't avoiding me due to owing me money or whatever), then feel free to say hi!
Bill Poser was talking about Multilingual Google earlier, and I noted an interesting bit toward the bottom:
Using Google in another language is a fun way to try out a language you don't know real well. It's easy to switch to a language you do know well if you get stuck and it isn't all that complicated. I do have one small complaint (beyond the fact that they don't yet have all of my favorite languages), which is that they are evidently sorting the list of languages the same way no matter what language they are in, in the order of the Unicode codepoints. This yields unexpected results. For example, on the Catalan list Arabic comes last, after Zulu, because the Catalan word for Arabic is Àrab and the À, whose Unicode codepoint is 0x00C0, follows all of the ASCII letters. Z is 0x005A. If Google really wanted to do things right, they would sort the names using the appropriate collating rules for each language.
Using Google in another language is a fun way to try out a language you don't know real well. It's easy to switch to a language you do know well if you get stuck and it isn't all that complicated.
I do have one small complaint (beyond the fact that they don't yet have all of my favorite languages), which is that they are evidently sorting the list of languages the same way no matter what language they are in, in the order of the Unicode codepoints. This yields unexpected results.
For example, on the Catalan list Arabic comes last, after Zulu, because the Catalan word for Arabic is Àrab and the À, whose Unicode codepoint is 0x00C0, follows all of the ASCII letters. Z is 0x005A. If Google really wanted to do things right, they would sort the names using the appropriate collating rules for each language.
This does indeed seem a little unfortunate to me. I know how hard I worked to get the language list on trigeminal.com to be sorted according the chosen UI language (as we all know, browsers only have one language setting, so like everyone else I overloaded!).
Of course that may be wrong too if one knows what the user locale is, and the setting is different from the UI language. But both are preferable to Unicode code point order!
I will talk more about this particular user expectations topic soon -- it is an interesting one to me. :-)
It all seemed so simple -- that whole 'uppercase and binary comparison' semantic. Used by NTFS, by Windows in so many places like named pipes, mutexes, environment variables, and so on.
But then there is FAT and FAT32. :-(
Take the two characters:
Note that they are the compatibility and combining forms of the same Jamo (see this post for more info on the purposes of each).
Stick them both in filenames in the same directory in a FAT or FAT32 drive (ㄱ.txt and ᄀ.txt).
Works just fine.
Now change the default system locale to Korean and reboot.
If you just try to create the same files in a new directory, then it will give you an error:
There are all kinds of weirdnesses though:
Now if you look at code page 949, U+3131 is there and can roundtrip (it is 0xa4a1), but 0x1100 has a best fit mapping to the same character. I would at first have thought that this had something to do with the problem (it is certainly the cause for problems on Window 98 and Me!), but in Win2000 I can create filenames in Unicode only languages (where everything would map to ? on the code page) on these drives and I have no problems at all. So this is not just a simple code page issue.
It is also not a simple collation being used incorrectly issue, since these two characters are not considered equal there, either.
Luckily it also does not repro on Windows XP, Server 2003, or Vista. So whatever is going on here, they fixed it.
But it does keep the filesystem thing from being simple, especially since the newer machines will have the same behavior if you access those Win2000 drives over the network....
(Special thanks to Gregg Miskelley and Dylan Lingelbach for pointing out some of the anomalies here!)
This post bought to you by "ㄱ" and "ᄀ" (U+3131 and U+1100, a.k.a. HANGUL LETTER KIYEOK and HANGUL CHOSEONG KIYEOK)
Back in the end of April I talked a little about the Hijri calendar, and back in the beginning of April I posted some more.
In that last post I talked about the date advance functionality in Regional and Language Options that lets small alterations to the beginning of the month depending on when the moon was spotted, and mentioned that there were plenty of places where the algorithm is supported but the ability to make the changes is not. For example VB/VBA/COM, and also SQL Server.
I was thinking about all of this the other day when I saw this post in Mohamed Sharaf's blog. He was pointing out the best way to treat Hijri dates as dates, and to sort them appropriately. It does deal well with the SQL Server limitations that do not allow dates with years prior to 1753 by assisting with the conversion of the Hijri dates to Gregorian dates.
Now this does not help with the limitation of the date advance setting not being present, but since that setting would only correctly apply to the current month, it is hard to really know how one would apply it. Mohamed's post does deal well with the solvable portion of the limitations in SQL Server related to Hijri dates....
This post brought to you by "۾" (U+06fe, ARABIC SIGN SINDHI POSTPOSITION MEN)
A few years ago, there was someone internal at Microsoft who was asking about storing binary data in a string. Since they were using VB, I pointed out the many problems with a lot of VB (the intrinsic controls, the built-in functions, the Win32 and other API calls, and more), any of which could corrupt the binary data.
In fact, VB's easy conversions between byte arrays and strings are in many cases somewhat evil, given the chance of data corruption not too long after the conversion.
They were insistent that none of that would happen.
So I finally gave in and said maybe it will be fine, only to have someone else point out a fairly elemental issue -- that UTF-16 has an even number of bytes, while binary data may not. Plus a bunch of other reasons why this particular misuse of strings could really be a problem.
You know what? That person was right.
Then, a few years later, the question came up again, this time in VBScript. Of course everything is a Variant there, but people had problems where they wanted to try to interact with the bytes. So could they put it into a String and then use functions like AscB, LenB, and so on to work with it?
I honestly forgot about the earlier conversation.
So I gave the same warnings, they insisted that no operations involving conversion out of Unicode would be happening. And I said cautiously that if they follow those rules, then it might not hurt the data too much.
Luckily, Eric Lippert saved me from myself an once again pointed out why it was a bad idea.
I spent a little time trying to understand why I had forgotten the earlier conversation. Why the lessons I learn myself are so much easier to remember than the ones someone else points out. It was not embarrassment at being wrong or anything like that -- being wrong is how one learns!
Maybe it would be easier to remember if I did get embarrassed when I was wrong -- I never have trouble remembering times that I am embarrassed, after all. It is almost like my memory requires some kind of emotional tag -- good or bad-- to be effective.
Anyway, yesterday Shawn pointed out that you do not want to treat binary data like a string, and I did not have any problem with the advice, it all makes sense. But if you look at his post, the people who he is talking about are specifically trying to treat the binary data as a string and convert it out of Unicode, which is something I have warned against any time it came up. So I realized that I may not have learned anything yet, at least not to have a deeply internalized answer to the people who insist they will avoid converting the data.
So I finally did internalize a particular notion to cover both the practical and the theoretical aspects of the problem: it is not that strings are sacred or anything; they never were. It is that data is sacred.
Or at the very least that corrupting data (of any sort) is profane. :-)
We'll see if thinking that will help me remember the next time that comes up (of course the act of posting this to my blog may have contaminated the process; posting to the blog might make it memorable, too!).
I used to hear a story from a colleague about an Admin who would talk about how someone with status was a Very "V.I.P." person (from the description she was someone who you could tell was putting the quotes there, even if she was speaking).
This is obviously silly since V.I.P. stands for Very Important Person. It simply makes no sense to talk about a very very important person person!
Anyway, tonight I went to see Ari Bixhorn and Richard Turner talk about Indigo. It was a cool talk and even came with a T-shirt I might wear at some point due to the cool logo image thingy:
Of course they explained how it was not called Indigo anymore; it was actually called W.C.F., or Windows Communication Foundation. they were unable to repress the shudder that I am sure we all felt.
Which is of (in my opinion) the worst product name for a cool code name since Internet Studio became Visual Interdev. But I'll talk about code names some time soon. I do like Ari's thought that the cooler the code name, the worse the product name -- this could henceforth be codified as Bixhorn's Law, if no one had yet stated it for the record....
The problem, however, was that toward the of the talk, they said (and what was said was on the slide), in true Mickey Blue Eyes The La Trattoria style:
WCF Foundation
Now obviously there is a language phenomenon where acronyms like RADAR, LASER, and SONAR become words to many people who do not even know what they stand for, or even if they stand for anything. In computers it happens with BIOS, NT, and DOS and a million other terms just as readily. I have heard people say Light Amplified LASER and BIOS system and not shuddered more than momentarily.
But WCF Foundation?
That awful name for what looks to be the coolest thing since sliced bread (or more accurately, the coolest thing since the delivery service that can get the sliced bread delivered in the distributed environment!) is in no way mature enlough tp have lost its acronym status!
Anyway, I am taking the day off for my birthday, though I may end with a post or two here that may or may not be technical. In the meantime, do your best to remember what silly acronyms stand for so you can correct those from the camp of The La Trattoria, if you know what I mean! :-)
Not everything is in Unicode.
I mean, it is close, but there are still lots of things that are not there. They fall into two categories:
There are some areas of the Unicode code space that have been set aside for supporting such situation. The title for them is the Private Use Area. From the Unicode Glossary:
One the most important points is in that first definition: "whose use may be determined by private agreement among cooperating users"
This would seem to be no better than the hack font solutions that often go with non-Unicode solutions out there. And to be honest, it is not really any better at all, being basically a shtetl (of if one prefers, ghetto) for characters that are not in Unicode at this time.
Now that i think about it, I guess it is a little better than a font hack in that it does act like a shtetl and keeps the custom stuff out of pure character data that uses assigned Unicode code points.
But it also means that none of the Unicode property data, or fonts/font linking, or shaping engines can be used on big operating systems since they do not know what the characters are. Without arguing or even attempting to argue the semantic content of a term like "private use" it is just plain common sense that if it is in Windows, then it is hardly private....
And this is why (for example) the PUA is not considered "sortable" according to the IsNLSDefinedString function or the CompareInfo.IsSortable method, even though they are (in fact) given weight.
Web standards tend to think along the same lines, an the PUA is not considered acceptable for ideal use in identifiers in many contexts, such as XML. They are definitely second class citizens.
With that said, they will work. You have to build your own font, and you cannot hope to have the shaping done for you by Uniscribe and you have to build your own keyboard with a tool like MSKLC, but if you do all of that and arrive with no unrealistic expectations (and of course if you distribute your fonts and keyboards and so on to users yourself!) then things should work well or you.
But never forget that you are in the "do it yourself" or the "some assembly required" area of Unicode. The area where the only thing that is done for you is to give your characters a segregated space where they will not be mistaken for other, valid characters....
This post brought to you by "" (U+f8ff, a.k.a. the last PUA character on the Basic Multilingual Plane)
Regular reader Maurits asked, in the Suggestion Box:
Can you comment on Andreas Stötzner's 2004 proposal for an upper-case ß code point, which was rejected by the Unicode consortium? http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf
Can you comment on Andreas Stötzner's 2004 proposal for an upper-case ß code point, which was rejected by the Unicode consortium?
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf
The proposal in question underwent a great deal of (not always entirely civil) conversation on the "member's only" list of Unicode....
I have also posted about the Sharp S before, on this blog, for example here.
The initial post about the proposal came from Markus Scherer of IBM:
Purely personal opinion: I would have expected a proposal like this to see the light of day in a little less than 5 months... Aside from few "discussions" and other curiosities, the majority of the document's samples shows clear _lowercase_ ß in otherwise uppercase text. Using normal ß in this way (like applying simple case mappings rather than full ones) is reasonably common. While German school children might at first scratch their heads about this irregularity, I am pretty sure that there is no pressure at all for introducing an uppercase variant - other than possibly by a local font vendor in search of a market. It might be more likely for Germans to give up on ß than to add an uppercase version. markus http://www.daujones.com/comments_all.php?usrid=3504http://faql.de/eszett.htmlhttp://www.eibe-online.de/schulen/bfs_bensheim/darstellung_bensheim/FachbereichFarbtechnik.htm
Purely personal opinion:
I would have expected a proposal like this to see the light of day in a little less than 5 months...
Aside from few "discussions" and other curiosities, the majority of the document's samples shows clear _lowercase_ ß in otherwise uppercase text. Using normal ß in this way (like applying simple case mappings rather than full ones) is reasonably common. While German school children might at first scratch their heads about this irregularity, I am pretty sure that there is no pressure at all for introducing an uppercase variant - other than possibly by a local font vendor in search of a market.
It might be more likely for Germans to give up on ß than to add an uppercase version.
markus
http://www.daujones.com/comments_all.php?usrid=3504http://faql.de/eszett.htmlhttp://www.eibe-online.de/schulen/bfs_bensheim/darstellung_bensheim/FachbereichFarbtechnik.htm
Michael Everson then weighed in:
Look again. It shows capital sharp esses, though it does show small sharp esses in capital use because nothing else was available. The Duden evidence is not to be ignored. People have been discussing this issue for a century. I think Stötzner has shown clear evidence for e capital sharp s.
Look again. It shows capital sharp esses, though it does show small sharp esses in capital use because nothing else was available. The Duden evidence is not to be ignored.
People have been discussing this issue for a century. I think Stötzner has shown clear evidence for e capital sharp s.
Nobuyoshi Mori took a more technical approach to the analysis of the propoal:
My understanding is: 1) Technically toupper( U+00DF ) should be defined. It is currently defined as : toupper( U+00DF ) --> U+00DF 2) There are several ways to "display" an "uppercase ß" in German: 2-1) "SS" This is what German orthography says. It is also the most usual way to handle it. 2-2) "ß" This is used in exceptional cases when either there is no space for two characters, or for typographic reasons, or by ignorance of the correct orthography. 2-3) "SZ" This is an old variant of 2-1, only very rarely used. The change of the current definition 1) breaks many existing Unicode implementations and data, and will cause compatibility issues. The major issue is that the result of toupper( U+00DF ) becomes Unicode standard version dependent. \I know huge amount of Unicode implementations and Unicode customer data which will run into problems with the suggested Standard change. Most of the database implementations, OS and PC products, Computer language implementations such as Java, C#, etc would be some of the examples. ... I therefore would like to request UTC to refuse the proposal.
My understanding is: 1) Technically toupper( U+00DF ) should be defined. It is currently defined as : toupper( U+00DF ) --> U+00DF 2) There are several ways to "display" an "uppercase ß" in German: 2-1) "SS" This is what German orthography says. It is also the most usual way to handle it. 2-2) "ß" This is used in exceptional cases when either there is no space for two characters, or for typographic reasons, or by ignorance of the correct orthography. 2-3) "SZ" This is an old variant of 2-1, only very rarely used.
The change of the current definition 1) breaks many existing Unicode implementations and data, and will cause compatibility issues. The major issue is that the result of toupper( U+00DF ) becomes Unicode standard version dependent.
\I know huge amount of Unicode implementations and Unicode customer data which will run into problems with the suggested Standard change. Most of the database implementations, OS and PC products, Computer language implementations such as Java, C#, etc would be some of the examples.
...
I therefore would like to request UTC to refuse the proposal.
Mark Davis agreed but had one small correction:
See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992 The Unicode data supports two types of case operations: full and simple. The simple operations are for restricted environments where the number of characters cannot be changed. For any other situations the full mappings should be used. And when a full mapping is used, toUppercase( U+00DF ) --> "SS"
See http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992
The Unicode data supports two types of case operations: full and simple. The simple operations are for restricted environments where the number of characters cannot be changed. For any other situations the full mappings should be used. And when a full mapping is used, toUppercase( U+00DF ) --> "SS"
Markus Scherer was also unconvinced by Michael Everson's response:
I did look at the whole document including at each and every sample. Most of them are clearly lowercase ß between uppercase letters. "People" may have been talking about it depending who "people" is. I spent my first 27 years in Germany and have never heard of any serious discussion of an uppercase ß. (Not sure I even need to qualify this with "serious".) Unless there has since then been an outcry in the population that I missed while visiting about once a year or while talking with my relatives, I don't see that this is on anyone's mind. Real issues in discussion included the spelling of Kaiser (Keiser?) and other beloved words when the spelling reform was published.
I did look at the whole document including at each and every sample. Most of them are clearly lowercase ß between uppercase letters.
"People" may have been talking about it depending who "people" is. I spent my first 27 years in Germany and have never heard of any serious discussion of an uppercase ß. (Not sure I even need to qualify this with "serious".) Unless there has since then been an outcry in the population that I missed while visiting about once a year or while talking with my relatives, I don't see that this is on anyone's mind.
Real issues in discussion included the spelling of Kaiser (Keiser?) and other beloved words when the spelling reform was published.
Michael Everson responded thusly:
Which is what you would expect to find, in the absence of a more widespread availability of fonts with capital sharp esses. The evidence, and the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not. In my view, the Universal Character Set should encode such entities where they exist. They are facts.
Ken Whistler had to respond to that argument, though (I was tempted to respond myself, but I am glad he did instead since he argued the case more convincingly:
You are getting caught up in your own rhetoric about the UCS. The Universal Character Set is *not* the Universal Encyclopedia of Writing Systems -- it is a practical attempt at an engineering solution that everyone can use for digital representation of text. Introduction of a capital ess-tzet just because it "is a fact", and despite the manifest evidence of overwhelming German practice and implementation to the contrary, while utterly ignoring all the kinds of implementation problems that would result -- just hinted at by Nobu -- is just foolish. The problem you are trying to deal with, namely the appearance of an majuscule design in some fonts for an ess-tzet in an all-uppercase context, can be dealt with by other techniques, specific to fonts and to word-processing systems (if there even proves to be a demand for it, which I doubt, given Markus' testimony). It does not require a muley insistence that because somebody shows in some context that it *might* be treated as a distinct uppercase letter, that that resolves all issues and makes it obvious that separate encoding is required in the Unicode Standard for this "thing". I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system. And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations. My prediction is that the UTC is quite likely to turn this one down flat, without a single member in favor of it.
You are getting caught up in your own rhetoric about the UCS. The Universal Character Set is *not* the Universal Encyclopedia of Writing Systems -- it is a practical attempt at an engineering solution that everyone can use for digital representation of text.
Introduction of a capital ess-tzet just because it "is a fact", and despite the manifest evidence of overwhelming German practice and implementation to the contrary, while utterly ignoring all the kinds of implementation problems that would result -- just hinted at by Nobu -- is just foolish.
The problem you are trying to deal with, namely the appearance of an majuscule design in some fonts for an ess-tzet in an all-uppercase context, can be dealt with by other techniques, specific to fonts and to word-processing systems (if there even proves to be a demand for it, which I doubt, given Markus' testimony). It does not require a muley insistence that because somebody shows in some context that it *might* be treated as a distinct uppercase letter, that that resolves all issues and makes it obvious that separate encoding is required in the Unicode Standard for this "thing".
I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system. And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations. My prediction is that the UTC is quite likely to turn this one down flat, without a single member in favor of it.
As I said, much more convincing.... :-)
But it looks like everyone dug their heels in; Michael responded to Ken:
As a student of the world's writing systems, I maintain that what I said is true. The evidence, and the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not. This may be an issue for some, or many, or most, current implementations. That's a concern for industry today. My work on the Universal Character Set, as you know, looks to tomorrow. Not being a complete idiot, my response to Stötzner on this particular character is "Get Germany and Austria behind the proposal." But facts are facts. You recently wrote a piece where you acknowledged that many of Unicode/10646's current "mistakes" will one day be purged. A hack for casing sharp-ess would seem to be one such. Stötzner's Weise, Weisse, Weiße casing to WEISE, WEISSE, WEISSE/WEIßE is a problem German implementations have to deal with now. I strongly suspect that the "solutions" are not AT ALL uniform or satisfactory to the Mr Whites out there. A capital scharp-ess would allow a consistent solution, and would, in my view, be superior to some sort of smart-font hack where a sharp-ess preceded by a capital letter would take on a different shape. That is not very portable, and, if I may remind you, from the 10646 side we are concerned with data preservation and transfer, not just implememtation by big companies. Yes, these are philosophical differences in the two standards, but they are ther nonetheless. >I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system. No, the Germans have been looking at that themselves. In 1902 they did, and Stötzner is doing it again today. That's also a fact.>And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations. I wouldn't encode the character on foot of this one proposal either. But there is a case to be made for this character, and it would be wrong to reject the proposal out of hand.
As a student of the world's writing systems, I maintain that what I said is true. The evidence, and the author's arguments, suggest that the sharp ess is certainly acquiring case, and indeed has done so, whether the use of it is widespread or not.
This may be an issue for some, or many, or most, current implementations. That's a concern for industry today. My work on the Universal Character Set, as you know, looks to tomorrow.
Not being a complete idiot, my response to Stötzner on this particular character is "Get Germany and Austria behind the proposal."
But facts are facts. You recently wrote a piece where you acknowledged that many of Unicode/10646's current "mistakes" will one day be purged. A hack for casing sharp-ess would seem to be one such. Stötzner's Weise, Weisse, Weiße casing to WEISE, WEISSE, WEISSE/WEIßE is a problem German implementations have to deal with now. I strongly suspect that the "solutions" are not AT ALL uniform or satisfactory to the Mr Whites out there. A capital scharp-ess would allow a consistent solution, and would, in my view, be superior to some sort of smart-font hack where a sharp-ess preceded by a capital letter would take on a different shape. That is not very portable, and, if I may remind you, from the 10646 side we are concerned with data preservation and transfer, not just implememtation by big companies.
Yes, these are philosophical differences in the two standards, but they are ther nonetheless.
>I am getting *really* impatient with the kind of rhetorical stance you have taken here. It is not your job nor the job of the UTC nor WG2 to reform the German writing system.
No, the Germans have been looking at that themselves. In 1902 they did, and Stötzner is doing it again today. That's also a fact.>And it certainly is not in the UTC's interest to introduce, on spec, a dubious new German character, without a demonstrated need, but with a horrendous downside potential for screwing up German casing implementations.
I wouldn't encode the character on foot of this one proposal either. But there is a case to be made for this character, and it would be wrong to reject the proposal out of hand.
Others such as Asmus Freytag and Benson Margulies also chimed in agreeing that an answer of 'proposal insufficient' seemed best at this point.
John Hudson then mentioned:
By the way, I met Andreas Stötzner at the recent ATypI conference in Prague, and am familiar with his journal. He is an intelligent and reasonable man, and I doubt if he would be insistant about the encoding of a Capital Double S if the text encoding and processing impact were explained to him. He has documented, in an admirably thorough way, a development in *some* German typography, which needs to be addressed at some level of text encoding or display. It is not obvious that the best way to do this is to encode a new character.
A bit more discussion but it kind of petered out without any real sense of consensus.
Shortly thereafter, at the November 2004 UTC meeting in Cupertino, CA, a bunch of discussion ensued, but in the end Conensus 22 happened:
[101-C22] Consensus: The UTC concurs with Stoetzner that Capital Double S is a typographical issue. Therefore the UTC believes it is inappropriate to encode it as a separate character.
and it was added to the Rejected Characters list with the following comment:
LATIN CAPITAL LETTER DOUBLE S (existence as character not demonstrated; would cause casing problems for legacy German data)
I probably would not have worded the consensus in just that way, but the end result would have been the same....
Three months later, a thread came up on the Unicode List about the Sharp S and uppercasing it, which mainly dwelt on issues other than adding the character. So I will spare everyone the further conversation. :-)
Ok, I am now three posts into this series:
Extending collation support in SQL Server and Jet, Part 0 (HISTORY)Extending collation support in SQL Server and Jet, Part 1 (the broad strokes)Extending collation support in SQL Server and Jet, Part 2 (generating sort keys)
But really have not heard much of anything yet from readers. Is this something that just does not seem useful for developers? Or is everyone waiting for more code? Or am I just describing something in too confusing of a way?
Just wondering.... think of this as a ping. :-)
I thought I would explain a bit more about how surrogates work in Unicode, since it does not seem very well described in a whole lot of places. First, some definitions (all from the Unicode Glossary and the Unicode Roadmap sites):
Ok, it is all as clear as mud now, right? :-)
The problem is that even if the definitions are applied consistently, there is no good feel for exactly how they work, how high and low surrogates combine, and so on.
(Other questions, like why do high surrogates have lower numbers than low surrogates are covered in other posts)
Let's see if we can't do something about that....
(Warning: some MATH content ahead!)
We start with the Basic Multilingual Plane -- it is the code units from U+0000 to U+FFFF. Some of these code points are assigned; and a large subset of those are assigned characters. In all there are 65,536 code units in this and every other plane; you can also think of this as 1000016 or just 216 code units. Whatever you find easiest, conceptually.
Now what happens with those high surrogate code points is that the block of 1024 of them are divided into 16 blocks of 64 each. And each one of those blocks is used for a plane:
By convention, U+[##]FFFE and U+[##]FFFF of each plane are set aside and reserved, never to be assigned. This allows internal processes to use them as sentinels. Note that they should never be interchanged with any other process!
Now the way things are numbered, each high surrogate is used, serially, combining with every possible one of the 1024 low surrogates before moving onto the next high surrogate. Thus for supplementary characters you see the following order:
(I skipped some spaces in there for obvious reasons!)
This mechanism allows for many things such as simple range checking and easy conversions between code point an surrogate pair (it is a simple algorithmic macro to do the conversion when/if it is ever needed).
When combined with the way that scripts are assigned in blocks, it is easy to notice things like the following (not a complete list, just a sample!):
So when you combine the BMP's 216 code units with the 16 planes of 64 * 1024 (which is also 216 code units!), you get 17 * 216 or 1,114,112 code units in total -- which is where that interestingly arbitrary-looking number comes from!
Unicode's Roadmap site has a lot of information about the potential placement of future character allocations in Unicode, for those who are interested.
And for a more reality-based set of links, if you look ahead to Windows Vista three macros have been added to the winnls.h that comes with the Vista SDK:
I would expect that the meanings are pretty self-explanatory, but if not you can look at the VSDK topics to which I linked. :-)
(On a side note, I find it very cool that the Windows Vista SDK is available right now to everyone, whether they are on the Vista beta or not. It really does help to explain features and functions!)
Now in future posts I could perhaps get into other topics, like algorithmic conversion between UTF-16 and UTF-32....
This post brought to you by all of the supplementary planes of Unicode