Blog - Title

December, 2005

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Popularity hurts objectivity

    • 0 Comments

    I have been fairly critical of both Google's and Microsoft's search engines in the area of searching on many different language and Unicode issues (e.g. the Sorting It All Out to Search Engines post). I personally look forward to the day when I can praise the work that either or both search engines do to better support the work that their corporate entities pay USD $12,000 and USD $2,000 a year for, respectively. It is a little embarassing that either one of them are not doing better here....

    Now folks over on Language Log have on many occasions used search engines like Google to quickly look at frequency of use and other linguistic issues. They have never lost site, however, of the fact that there are various limitations to this approach (good examples are Benjamin Zimmer's post entitled Googlinguistics: the good, the bad, and the ugly and Mark Leiberman's post entitled More arithmetic problems at Google).

    I just bumped into another limitation in the last month though, one I thought I'd blog about....

    In this last month I have run across people reporting potential limitations/issues/bugs in language/locale-specific formatting, keyboards, locale data, calendars, and/or collation for Georgian, Armenian, Latvian, Japanese, Korean, Macedonian, and others. And although I am not a linguist, I do have those pesky delusions of linguistic aptitude, so I tried to do a little research on many of these issues.

    What I found was that it is hard to separate what is 'done in the wild' to see if Microsoft is doing the right thing since Microsoft's products are such a large part of 'the wild' in this context. Search engines, which index the web, can't really make that separation since there really is no explicit marking of content to know the difference and even if there were it is not a meaningful distinction since there is no way to separate 'correct' usage from 'Microsoft' usage since the Microsoft usage may in fact be correct!

    I was a bit staggered by the fact that the very popularity of the platform made it more challenging to research questions about the platform.

    Isn't William Shatner the one who said "Irony can be pretty ironic, sometimes" ? :-)

    In any case, I do not mind that I had to do a bit more formal of a job in an actual library to do some of the research, it took me back to when I was in school. And I have kept my delusions as I reported back on what I had found and people were both receptive to and encouraging about what I had found.

    Though I realize that the days where the library will work could also be numbered, as libraries fight to stay relevant in the eyes of people who find a Google or a Wikipedia search to be simply easier than using a library.

    In the wider sense, I realized this is not just a Microsoft problem. I mean, can any search engine hope to answer generic questions about whether Google is finding the correct result sets? If it were not for issues such as Unicode canonical equivalence that are beyond the reach of just the result set, then I wouldn't really have a reason to criticize except on individual search results. And that would seem just plain silly to most people.

    I guess there is a wider truth here that everyone realizes -- popularity makes it difficult to find objectivity.

    If you had told me that a month ago, a year ago, ten years ago I would have said "Duh!" so why it is such a shock dressed in other clothing is beyond me....

     

    This post brought to you by "" (U+0913, a.k.a. DEVANAGARI LETTER O)

  • Sorting it all Out

    Some thoughts about the Indic keyboard layouts

    • 9 Comments

    Last month, Suzanne was talking about The All-India Keyboard and she made several good points related to it usability when compared to some other layouts.

    She specifically talked about the INSCRIPT keyboards:

    This keyboard works well for Devanagari, with its 34 consonants and 12 vowels. The vowels are encoded as both initials and diacritics so that makes 58 letters altogether and a few more symbols. No upper and lower case so all is well.

    I should interupt to point out that although the INSCRIPT keyboard may "work well" for Devangari, in the Windows 2000 timeframe Microsoft got a lot of feedback from native speakers that it was not the one that they preferred. This led to the inclusion of the Hindi Traditional keyboard layout, a layout that is generally considered to be more intuitive (it does has a lot in common with the INSCRIPT layout, but is not identical to it -- and little things can mean a lot).

    (for illustrative purposes, here is the Devanagari INSCRIPT layout followed by the Hindi Traditional one so you can see what I mean:

    (Every shift state has differences for the two keyboards -- in many cases due to what looks like additional characters being added that are not used for Hindi. That makes me wonder about the use of the Devanagari INSCRIPT keyboard for Devanagari script languages other than Hindi)

    In any case, she then went on to talk about a place where the INSCRIPT layout did not even really seem to meet that minimum bar of "working well" for a language:

    Tamil, on the other hand, has only 18 consonants and 12 vowels. These vowels have two forms, as in Devanagari. Because these forms are context dependent there is an argument that the two forms could both be input with the same keystroke. That would make 30 letters altogether. In that case, the basic Tamil writing system could be represented on the keyboard in the unshifted state.

    Using the Inscript keyboard for Tamil means using a keyboard with 4 blank spaces in the unshifted state, while 3 more keys in the unshifted state have Grantha letters on them. These are letters for writing Sanskrit and are not part of the basic Tamil alphabet. Likewise 7 of the basic Tamil consonants are in the shift state.

    Of course I am forced to disagree with the premise that the original keyboard was the right design (based on the customer feedback that led to an alternate layout being preferred!). Of course one of the most common problems that keyboard 'standards' suffer from attempting to capture a perfect technical solution for a language without trying to capture the usability concerns at the same time. :-)

    But Suzanne was not referring to user expectations so much as appropriate layout for a language from a technical linguistic viewpoint. And note that if you are not a native speaker of a language you do not have the baggage of those user expectations, so from that point of view one could even claim that the use of a standard like the INSCRIPT layout across multiple Indic scripts has the advantage of making typing across the various languages easier. However, it is worthwhile to point out that the layouts themselves are considered to be non-optimal by native users of the languages in many cases, even if they are a good technical solution for non-native users of a language.

    Now there is a special usability problem with trying to describe one language in terms of another -- it is easy if you primarily know the original language, but this ease is to the detriment of the target language in many cases.

    (I hinted about similar issues in Korean in this post; I may talk more about that another day.)

    This is the situation with the Tamil keyboard layout in Windows.

    So although I disagree with Suzanne's chain of logic, in the end we agree about the conclusion that the keyboard is not optimal for Tamil. We just took two very different paths to get to the answer. :-)

    All of this does of course lead to some additional questions, which will be topics to post about in the upcoming year....

     

    This post brought to you by "" (U+0b94, a.k.a. TAMIL LETTER AU)

  • Sorting it all Out

    More on cursor movement

    • 28 Comments

    James Brown asked in the microsoft.public.win32.programmer.international and microsoft.public.win32.programmer.gdi newsgroups:

    Suppose I have the following two Arabic codepoints:

    U+0648 "arabic letter waw"
    U+0650 "arabic letter kasra"

    These render as a single glyph with Uniscribe

    When pasted into Notepad, the cursor (and selection highlight) can traverse into the middle of the cluster.

    When pasted into Wordpad, the cursor _cannot_ move into the middle of these characters.

    Which is the correct (or desirable) behaviour?

    Maybe someone can even explain, what significance does it have for the cursor to move into the *middle* of a grapheme cluster - how does the user know which character he/she has selected??

    thanks,
    James

    Excellent question, James!

    The desirable behavior is what you are describing as the WordPad behavior, to a point. Although if I paste a string of 12 of these pairs of characters (وِوِوِوِوِوِوِوِوِوِوِوِ) into WordPad, it will treat them as a single unit, which is not what I would call desirable. :-(

    The Notepad behavior you describe is also not preferred; in all cases other than the BACKSPACE character (for the reasons I describe here), you would want to have movement jump the text element boundaries, which would be those two characters you mentioned....

    The bad news is that I can reproduce the behavior you describe in Windows Server 2003 SP1:

    The worse news is that I can reproduce the WordPad behavior I describe above in Windows Server 2003 SP1 and XP SP2.

    But the good news is that in XP SP2, Notepad behaves correctly and the cursor does not appear in the middle of the character....

    In IE6, I currently get the character splitting behavior. You can test out your own browser and version with the textbox below -- put the cursor in and move back and forth to see what happens:

     

    At least products are getting better though (the Vista version of Uniscribe has all of the XP SP2 updates and more!).

     

    This post brought to you by "و" (U+0648, a.k.a. ARABIC LETTER WAW)

  • Sorting it all Out

    Best practices for keyboards #0: MSKLC warnings in 1.0/1.1

    • 4 Comments

    When MSKLC was first developed, it had no validation step, and no warnings. But then as Simon Earnshaw, Cathy Wissink, and I were getting more involved with the process of creating this tool that was going to make it so much easier to create new layouts, we reached a realization.

    While it may be a bit too draconian to simply block people from doing things that do not fall in the category of our "best practices" (especially when scenarios may alter those practices!), at least adding in warnings would allow a little bit of social engineering to take place. People would be gently encouraged to avoid creating layouts that could cause problems out in the wild.

    Here are the warnings that exist in MSKLC today (this is from the MSKLC help file, in the topic entitled 'Validation Reference'):

    Warnings

    Warnings will not block the building of the keyboard but indicate potential problems that should be reviewed and corrected as appropriate. Warnings include:

    Defined character(s)

    • Character defined exclusively on a key that may not be present on all keyboards.
    • An unpaired surrogate code point was defined.

    Caps Lock/AltGr+CapsLock

    • Marked to be the same as SHIFT, but the characters do not appear to be cased versions of each other
    • Marked to be different than SHIFT, but the characters appear to be cased versions of each other

    Caps Lock/SCGAPS

    • SGCaps and CapsLock are both specified on the same Virtual Key

    Layout table

    • Character is not contained in the Windows code page of the specified language
    • Character is defined more than once in the layout

    Ligatures

    • Not all characters in ligature are contained in the Windows code page of the specified language

    Dead Keys

    • Dead key table is empty
    • Dead key defined on a ligature
    • Dead key without defined name
    • Dead key's base character does not exist in the keyboard layout
    • Combining character is not contained in the Windows code page of the specified language
    • Last entry in a dead key table should use a space as its base character

    CTRL/SHIFT + CTRL shift states

    • Unicode control characters defined in CTRL shift states

    Looking at things today, MSKLC has revolutionized keyboard layouts internally as well!

    These days when subsidiary contacts have new keyboards that they want to be added to Windows based on new standards they have to also provide the .KLC file with the layout. And as we have been picking up keyboards for the Vista release, we have been getting even more ideas for validation warnings -- because when mistakes are easy to make, people will make them and that can hardly be considered their fault.

    All of thses issues apply to customers outside of Microsoft just as much as they do to these internal contacts, so they do make sense to add.

    Future posts in this series will talk about some of these new issues that have come up, and which may well be added to the validation process in upcoming releases of MSKLC....

     

    This post brought to you by "" (U+0b0a, a.k.a. ORIYA LETTER UU)

  • Sorting it all Out

    What's a secondary distinction?

    • 2 Comments

    It was over a year ago that I talked about how They ask me "why is my Korean text in random order?"

    It is a pretty important concept in collation to have items collate with muliple levels. What is interesting to about this concept is how it is so hard to describe to people yet how easily and intuitively those same people will recognize the results.

    The differences are often very subtle rather than the more obvious case of Swedish (which I talked about here). Whether you to meet a letter like ā (U+0101, LATIN SMALL LETTER A WITH MACRON) as a LATIN SMALL LETTER A with a diacritic on top of it or as an entirely separate letter right after a is a different you will only notice if you are sorting several words whose differences include the two letters. Easy to see in a dictionary or a (potentially contrived) word list, to be sure, but not quite as obvious in everyday situations, even when the letters are commonly used.

    To give an example, it is simply easier to see the difference between

    • a
    • ā
    • ābols
    • ala
    • auns
    • āzis
    • b

    and

    • a
    • ala
    • auns
    • ā
    • ābols
    • āzis
    • b

    but a simple listing of letters like

    • a
    • ā
    • b
    • c

    would look identical in the two cases. We you are not dealing with large lists such as dictionaries, you may not notice the difference.

    Now there is one case where you often would have a large list on a computer, and that is an address book. However, even if you are looking at the list, in most cases you are typing the name which will shorten the list. In common usage, you may never notice that the computer is not meeting up to your expectations.

    The end result (in cases where the computer does not match the user expectations) is usually either not noticing a subconscious sense that the computer has it wrong without an explicit understanding of what might be incorrect. If the differences end up being significant enough, they may eventually try to figure out what's wrong. But in most cases they will just report the bug rather than trying to dig into it. Because no matter how intuitive user expectations are, they're not very easily explained.

    In collation terms, the difference between those first two lists that I gave above is whether U+0101 has a primary distinction from the letter a or a secondary one. But if somebody is giving you a list that makes up the ALPHABET for a language that type of distinction is usually absent. So if somebody tells you that their alphabet is:

    aāàáâãäåbcdeēèéêëfghiīìíîïjklmnoōòóôõöpqrstuūùúûüvwxyýÿz

    then you would not have enough information to decide how the letters should sort. Because no real information is being given about the primary and secondary distinctions. In many languages that have letters with diacritics, you can't assume that they are all even handled the same way!

    In collation, this does become crucial in situations like the one I pointed out in You can't ignore diacritics when a language does not give them diacritic weight because of a difference between the users and the computer's expectations, usually because the language settings are incorrect on the computer....

    Perhaps the best reason to make sure that your default user locale is set properly? :-)

     

    This post brought to you by "ā" (U+0101, a.k.a. LATIN SMALL LETTER A WITH MACRON)

  • Sorting it all Out

    Getting rid of your extra yen

    • 14 Comments

    Professor Robert Garfias asked me:

    I did a search and found your page which comes closest to describing my real life problem. I installed a Microsoft Multi Media keyboard and it found some Japanese language materials somewhere on my computer and then installed on its own and without asking, a Japanese language version of Inttelipoint which replaced the \ with the yen sign. I got rid of the keyboard and after great effort got rid of the Intellipoint software (It does not allow an uninstall). The yeb sign however remains.

    I am runing Windows XP Pro and almost everything is OK. I think the missing backslash may be causing problems with some programs. It is disconcerting to look at ones files and see the folders and filenames separated by the yen sign. I can find no way of getting rid of it now that the offending software has been uninstalled.

    Do you have any ideas?

    Yep, I have gone on about various issues with both the yen and the won on several different occasions, haven't I? :-)

    Now the different path separator will not cause problems, though it can cause some confusion. But there is an easy fix.

    Well, the key thing to do here is to change the default system locale (a.k.a. the language for non-Unicode programs, on the third tab of Regional and Language Options). Once you change it out of Japanese (or Korean if  the problem is some extra won!), then you will get back the regular REVERSE SOLIDUS for your path separator.

    It definitely seems like a bad idea for any software to change this setting without even explaining why; that is a problem I will try to follow up on with the Intellipoint people....

    If you still are having a problem with way more yen or won than you know what to do with and it was actually affecting your wallet rather than your computer, you could always send me a check c/o One Microsoft Way.... :-)

     

    This post brought to you by "\" (U+005c, a.k.a. REVERSE SOLIDUS)

  • Sorting it all Out

    globalization vs. localization (a new answer)

    • 8 Comments

    Not sure whether to laugh or cry (or both!):

    what is Globalization/Localization in Microsoft Framework perspective for building multi-cultural applications?
    Answer: If u declare a variable locally, it's localization. If you declare it globally, its globalization.

    (from Adnan Masood, via Mike Gunderloy)

  • Sorting it all Out

    When the roof got raised, and why

    • 7 Comments

    Documentation for API functions can sometimes fall behind and not keep up to the way functions work.

    I know, that is a huge newsflash for everyone.... :-)

    Eventually we fix in though.

    If you look at the Locale Information topic that contains the LCTYPE values used by GetLocaleInfo/SetLocaleInfo, you will notice that for all of the string fields that can be altered by SetLocaleInfo we list the maximum legal size (including the NULL character). For example:

    LOCALE_S1159
    String for the AM designator (first 12 hours of the day). The maximum number of characters allowed for this string is different for different releases of Windows:

    • Windows 95/98/Me, Windows NT 4/2000: nine.
    • Windows XP: thirteen for SetLocaleInfo, fifteen for GetLocaleInfo.
    • Windows Server 2003 or later: fifteen

    LOCALE_S2359
    String for the PM designator (second 12 hours of the day). The maximum number of characters allowed for this string is different for different releases of Windows:

    • Windows 95/98/Me, Windows NT 4/2000: nine.
    • Windows XP: thirteen for SetLocaleInfo, fifteen for GetLocaleInfo.
    • Windows Server 2003 or later: fifteen

    Now obviously this was not always what was there. In fact, if you look at the Platform SDK documentation in the October 2005 MSDN release, you will find a different story:

    LOCALE_S1159
    String for the AM designator. The maximum number of characters allowed for this string is nine.

    LOCALE_S2359
    String for the PM designator. The maximum number of characters allowed for this string is nine.

    (that number includes the NULL character)

    So the new documentation has certainly worked to better describe a problem that has existed in the documentation for years after Windows XP and Server 2003 have shipped.

    So how is it that the GetLocaleInfo and SetLocaleInfo limits are different?

    Well, I guess it was a combination of how the functions work and what they are trying to do in Windows.

    Prior to Vista, the locale data is always what we ship, which we trust. So GetLocaleInfo does not have to verify the length, it can just pick up its null-terminated string from its cache and return it. This is different than SetLocaleInfo, which has to do things with that string like put it in the registry and thd cache. So we have to care about the maximum length a bit more.

    Now some time before XP shipped, the official limit was raised to 13 (12 plus the NULL) because some locale had strings greater then 8 characters.

    Unfortunately, no one noticed that Gujarati had strings that were longer than this:

    • પૂર્વ મધ્યાહ્ન
    • ઉત્તર મધ્યાહ્ન

    They are both 14 characters each. So you can retrieve them, but you can't set them!

    Sometimes we might look less strange/mysterious when we explain what is happening, at the cost of seeming a little more foolish. It is definitely a trade off!

     

    This post brought to you by "મ" (U+0aae, GUJARATI LETTER MA)

  • Sorting it all Out

    Administrator vs. Administrateur, et. al.

    • 7 Comments

    Yesterday, Ashutosh Galande blogged a bit about the dangers of using the string version of the Administrator account.

    The reasons that it is a bad idea are numerous, but I thought I might explain a bit more about the particular problem behing bugs like the one that is described in MSKB article 258163 and how MUI has an interesting effect here on some code.

    Now the MUI version of Windows always has some base language (usually English but for some languages, other base languages are used).

    And I am sure you can imagine the havoc that it would wreak on a computer if literal account names used for logging into the machine could change just by changing the UI language.

    So, there are at least alternate three potential implementation choices here:

    • Windows could accept all language versions of the string and treat them equally;
    • Windows could accept only the language versions that correspond to installed UI languages;
    • Windows could accept only the language version that corresponds to the original installed base language.

    Now there are many reasons that the first two choices can be incredibly problematic and even dangerous from a security perspective, so the third choice is the one that is done.

    HOWEVER, the localized account names (e.g. Administrateur for Administrator) are in the localized resources in many cases, which is probably a bad idea since the localized strings are chosen by UI language even though the account names are not. This mismatch is indeed the cause behind the problem described in 258163, although it probably could have been worded a little more clearly. :-)

    The safest answer is just as Ashutosh indicated -- using the SID to get the name rather than assuming a particular localized string....

    For the actual bug, it is a simple case of misunderstanding one of the subtle (and usually obscure) differences between the MUI version of Windows and the localized version, and of course the problem that happens any time you rely on localization content to control what happens when code executes.

     

    This post brought to you by "" (U+189a, MONGOLIAN LETTER MANCHU ALI GALI GHA)

  • Sorting it all Out

    COleDateTime's ParseDateTime and locales....

    • 2 Comments

    Some time ago, Mike asked:

    I need to convert some strings to dates and then use the dates for some calculations. I was looking at COleDateTime::ParseDateTime() and it had the statement - "Note that the locale ID will also affect whether the string format is acceptable for conversion to a date/time value." Can you give me some direction on what I need to do? Is there an arcticle(s) I can read to get a handle on this?

    A very interesting question.

    The COleDateTime::ParseDateTime() method is an interesting one. The remarks in the documentation give a hint as to what is going on:

    The lpszDate parameter can take a variety of formats. For example, the following strings contain acceptable date/time formats:

    "25 January 1996"
    "8:30:00"
    "20:30:00"
    "January 25, 1996 8:30:00"
    "8:30:00 Jan. 25, 1996"
    "1/25/1996 8:30:00"  // always specify the full year,
                         // even in a 'short date' format

    Note that the locale ID will also affect whether the string format is acceptable for conversion to a date/time value.

    It may remind some of the 'Evil date parsing', Parse, and ParseExact post from last year. And with good reason -- it is indeed the evil date parsing logic in COM that is largely responsible for the uncertainty here.

    When you attempt to parse a date/time value, the method assumes that the data is valid and will do its best to convert it (using the LCID parameter as a 'hint'), even if the conversion is inappropriate. The method would probably be more stable and generally useful if the LCID were used for more than just a hint about issues like the order of date/month/year versus month/date/year, but at this point the behavior cannot be changed.

    So the LCID is just a hint and the hope is that you are passing a date to parse that falls within what is reasonable for that locale....

     

    This post brought to you by "" (U+0da4, a.k.a. SINHALA LETTER TAALUJA NAASIKYAYA)

  • Sorting it all Out

    Locale dependencies in the managed debuuger?

    • 0 Comments

    Ian Thomas asked in the Suggestion Box:

    The person who found the problem, and probably one or more who verified it, have probably taken this to the Visual Studio teams (even though it's 2003 - I understand that SPs will be issued from time to time). It's a little obscure, but the thread on the Australia .NET discussion list can be checked to get detail - eg, Piers Williams on 7Dec2005, thread "Visual Studio .NET" http://www.stillhq.com/aus-dotnet/archives2/msg13489.html

    Here's the problem (found by "SG"):

    "The problem is that Visual Studio seems to ignore any localization settings that are present in the machine configuration, Web.config file and aspx page directives when it comes to dates (and possibly other things I don't know about). This is evident in the Watch and Autos windows, and when you hover your mouse over a DateTime variable in break mode."

    The problem (as explained in the thread, URL given above) is that "it's not consistent across languages :-/

    As far as I can tell C# gets it right, whilst VB displays the non-localised date in the watch window. As expected both get it right in the immediate window. I had to look several times at this before I believed it (you'd assume that the visualisers in the debugger would be language-neutral, surely), and this doesn't appear to back up what you're saying because you're using C#." (quoting Piers Williams).

    NOW ..
    for the blog topic:

    "TDD (Test Driven Development) and Internationalization"

    WHY?

    Many developers (including many in the various Microsoft teams, whose blogs I seem to encounter frequently) are pushing the TDD barrow, from those in the Patterns & Practices arena through to those who are trying to get us to develop secure code. I find that it's a difficult practice / skill to perform - but I'm an average coder, not a Microsoftie.

    How difficult is it to add another strand to the testing, that includes internationalization? I don't know. But I'd think that getting the dates right for a range of cultures (like us Aussies and Poms) would not be too hard.

    I realise that the bug described is quite r, and very obscure.

    It is true that TDD is something that many people are talking about these days.

    I'll talk about that first, then about the problem Ian mentioned (the one he is calling an obscure bug).

    I remember a long time ago, when I was doing work for the development team for Microsoft Access, that not everyone who was on the development team was running the product regularly.

    Now this was not true of everybody (it was certainly not true of the developers who came up 'through the ranks' from tester to wizard developer to core developer!). But there were actually a few devs on the team who were somewhat proud of the fact that they were able to do so much work on the product without reaslly doing work in it.

    When I think of the principal benefit of Test Driven Development, it is indeed (in my mind at least) driven by a pride in nearly the opposite philosophy -- in making working in the product and verifying its functionality as a core part of the development process.

    However, for TDD to work, there has to be a really good understanding of the issues that one is trying to test. And here is where the problems can start....

    It is perhaps easy to imagine that anyone who is writing the code for a feature has a good understanding of the feature and how to use it.

    There are nuances, such as in the MS Access case -- where one may not be as knowledgable in the scripting language VBA so one may not be as effective in testing some aspects of the functionality one is authoring.

    It is also perhaps easy to imagine that inside of Microsoft, where security training is such an important basic requirement and where threat models that help people really evaluate potential risks of features, people could integrate knowledge of potential security threats into a TDD plan.

    However, as I pointed out in Why international test is an art (and why there are few fine artists), the talent involved in being able to really look at any feature and see the dimensions that internationalization can add to it is one that not even all testers have, let alone all developers. So I am not sure how effective developers would be at being able to add such testing to a core part of a TDD regimen.

    With that said, there are specific elements that probably could be added, as Ian suggests. If one is dealing with internationalizable aspects of a feature such as date formatting, one could imagine adding that aspect to the testing,and to the test driven development.

    With that said, the particular bug Ian mentions (the non application-locale-dependent way in which the debugger for VB.Net handles formatting date and number variables) is an interesting one.

    The honest truth is that if it were up to me, it would be a configurable setting.

    I may want the dates and such to be formatted in the way they would be in the actaul application, or I may want them in my own preferred formats.

    In my ideal debugger, I would have a configurable and easily changeable choice for this, just like I do for whether integrers should be in decimal or hexidecimal.

    So I guess from my point of view, both experiences as reportef may be somewhat broken. :-)

    Of course we do not know from the report what the user locale settings are on the machine, and all we know for sure about the dates in VB.NET is that they are using the same pound-delimited format that has been used in VBA/VBScript/COM forever to be the invariant format used for dates and such in code. So the exact behavior here has still not really been described. So this is not really a bug report so much as an incomplete attempt to describe a problem that may be a bug....

    In any case, I do not think that my ideal feature as I state above would work completely, or that I would even want it to -- after all, I would not expect (for example) the SortedList class to follow my preferences and ignore the underlying order that the object itself is using. So it may be a slippery slope to decide which features follow the fictional 'developer's preference' and which ones follow the application settings.

    However, what I would like to see here is less of a burden:

    • a bit more consistency when possible/feasible between programming languages, and
    • a bit more documentation on how things are expected to work in these situations in any case.

    I think that would lead to a better experience, in any case. :-)

     

    This post brought to you by "" (U+30df, KATAKANA LETTER MI)

  • Sorting it all Out

    Fictional could make things less functional

    • 5 Comments

    I have a list of things that i plan to write about at some point. That list includes ideas of my own, questions in email, items in the suggestion box, things floating around in my head, and posts that are half-written but not yet ready for the blog (and if you think for a moment about the variability of what I post, the concept of 'not yet ready for the blog' is somewhat frightening!).

    Sometimes I have a post that I had no plans to post immediately and then a suggestion from somewhat upgrades its importance.

    This post is about just such a topic.

    Maurits asked me the following question in the Suggestion Box:

    Could you comment on either or both of the two proposals to add JRR Tolkein's "Tengwar" alphabet into Unicode?

    http://std.dkuug.dk/jtc1/sc2/wg2/docs/n1641/n1641.htm
    http://www.evertype.com/standards/csur/tengwar.html

    And this is just such a topic.

    Disclaimer: I am speaking for neither Microsoft nor Unicode here -- these are only my own thoughts and speculations, which should therefore be weighed accordingly....

    I remember a few years back, when talking to a member company representative of The Unicode Consortium about his company's decision to scale back from FULL to ASSOCIATE. It seems that the higher-ups at his company had taken as good look on the work going on now and said that from a corporate point of view, Unicode was 'not complete, but complete enough for them' and that they did not need to be as fully involved in the goings-on.

    Although he did not personally agree with the implied judgment of the work that was going on, he had to admit that their expansion plans did not really have a requirement to cover all that was consistent with what Unicode was doing.

    More and more companies may come to the same (or a similar) conclusion eventually. I mean, it is easy for a smaller company to assume that the companies like IBM and Microsoft will stay members and that between the big ticket members and the synchronization with ISO 10646 will keep Unicode in a good enough place no matter whether the smaller member companies are full members or not.

    As this thought starts to occur to other companies, add to that the way that a corporate entity might look at proposals like the [rejected] one for Klingon, or the still pending proposals for Tolkien's Elvish languages (Cirth and Tengwar).

    I mean, if and when we reach the point that Unicode has the time to seriously consider Tengwar and Cirth, any company may come to the same conclusion that Unicode is 'not done, but done enough for them'.

    Member companies could even think that now since they are on the Unicode roadmap, in the Supplementary Multilingual Plane. Though hopefully they are keeping their eye on the work that is actaually going on in meetings. Though some many have trouble seeing the relevenace of Egyptian Hieroglyphics to their business models, where the use is possibly more obvious.

    Even large companies like Microsoft will need to have a framework beyond the current locale model if they ever want to support historic scripts in any kind of built-in way, because the honest truth is that a copy of Windows localized into such a language or even using the date formats in the system tray is just a novelty, it is not a serious requirement. Even when the need to support the scripts themselves is seriously required by scholars.

    So in their own way, companies like Microsoft have the same problem -- they have to extend beyond what they currently do to move to such a model.

    Now Microsoft is actually doing that when it comes to features such as MSKLC and the Text Services Framework and OpenType -- by supporting the ability of people outside of Microsoft to input and display text before Microsoft gets around to the capability (if we plan to at some point), it becomes easier (or at the very least possible) to support such scenarios.

    You may have noticed that I avoided giving my own opinion on the importance of encoding Cirth and Tengwar into Unicode. :-)

    That is intentional -- I have no contact with the community that needs/wants to support them, and thus no way for me to find out

    • if encoding in the ConScript Unicode Registry is sufficient for the user communities
    • if the user needs of the user communities are met by the proposals
    • how urgent for them would be the need to encode the scripts into Unicode
    • whether there are assumptions about Unciode products picking up support for the scripts

    And even if I did make such contacts and was convinced of all of the above (which I have to admit is a huge IF), I would have to be convinced that those needs outweigh the possible PR problems for Unicode I previously mentioned.

    So, having no expertise in the Elvish scripts of Tolkein or in the needs of those who use them, and having a strong desire to see Unicode's reptuation as a relevant standard for computer software in future versions, it is easy for me to see no need for urgency in these proposals.

    Which of course raises another interesting point -- why not reject them and be done with it?

    I would actually be against doing this, to some extent. This is not so much for the sake of the proposals themselves as for the fact that for just about every argument I or anyone else could make for such a thing, an example of a script that actually has a serious need and which was accepted (or will be) could probably be produced.

    In my opinion that is because the best reason to reject such arguments is the one thing that would not (and could not) be in the record. Every other argument is pretty much an excuse, not a reason.

    When one tries to make excuses to get what one wants, it is easy to prove that one is not making sufficiently reasonable arguments.

    Leaving them in the roadmap strikes that balance that it seems everyone can live with.

     

    This post brought to you by "" (U+f8e4, a.k.a. a private use character, or KLINGON LETTER TLH in the CSUR)

  • Sorting it all Out

    What Unicode version do you support?

    • 14 Comments

    When I was in my mid-20s, I lived in Columbus, Ohio. Living next door to me was a nice couple (Robert and Wendy) who were trying to start a family, and they were really having a tough time with it.

    (I promise there is a point to this particular recollection!)

    After a lot of effort and clinic visits and so forth (details are of course not relevant here), they finally managed it; she was pregnant.

    Any time people asked Wendy "Are you having a boy or a girl?", something that was reportedly happening a lot, her answer was invariably "Yes, I hope so! Having a boy or a girl would be great!".

    It has been many years since that time, but let me tell you that I think about Wendy and her answer any time someone asks me the question:

    What version of Unicode does MS [Windows|.NET|SQL Server|Office|Bob] support?

    I think that Wendy, if she is reading this right now, might be proud to hear my new answer to that eternal (or should I say infernal?) question:

    The version released by The Unicode Consortium.

    :-)

    Because there really is no definitive answer to this very non-specific question. The answer always depends entirely on the [usually one] specific issue that the person asking is looking for the answer to. For example:

    • If they are looking to conformance to a particular part of the standard such as normalization or UTF-8, then there may or may not be a specific answer, and we seldom put numbers on versions we support for that very reason (Unicode and our products are on 20 or more very different shipping cycles).
    • If they are looking for UCA support, the answer is Microsoft does not use the Unicode Collation Algorithm so they are definitely asking the wrong question (though they have made some changes to be a little more like us).
    • If they want to know whether ________ is supported (they fill in the blank with the language or script of choice) then the answer is that any version of Unicode supports subsets which means supporting a version does not mean supporting any particular character or characters -- they should ask whether the language or script is supported.
    • If they want to know whether ________ is supported (they fill in the blank with a particular character) then that subset answer I just pointed out applies, as does the fact that it really depends on what they man -- do they mean in fonts, in Unicode properties, in collation, in fallback/linking/shaping, or what?
    • If they are looking to know about our Bidi support, then they should try to understand that Microsoft's support of bidirectional scripts in products predates UAX #9 (The Bidirectional Algorithm), and that in truth UAX #9 has been moving toward our implementation rather than the other way around!
    • If they want to know about Unicode properties, then shucks, I don't know what to tell you -- depends on product version. Though Whidbey is 4.1 not 3.2 and Vista hasn't shipped yet but the latest CTP is 4.1.
    • If they actually have no idea what they mean but are trying to fill in a space in a line item on a form that leaves a space for a number then they can make one up, since there is no answer anyway....
    • (I could go on but you get the point, I think!)

    So the polite answer in the end is IT DEPENDS ON WHAT YOU MEAN. CAN YOU ELABORATE A BIT?

    But for now, I am going to stick with my new answer.

    Perhaps it is ornery.

    But I think Wendy would want it this way.... :-)

     

    This post brought to you by "𝍖" (U+1d356, a.k.a. TETRAGRAM FOR FOSTERING)

  • Sorting it all Out

    New in Windows Vista: OrdinalIgnoreCase for Win32

    • 9 Comments

    I have talked many times in the past about Ordinal and OrdinalIgnoreCase sorting behavior and when you might want to use it (such as here and here and especially here, for example).

    I went to great lengths to point how the reason that the managed 'OrdinalIgnoreCase' functionality was important to help mimic the OS behavior with regard to symbolic identifiers.

    Of course there was one problem: no Win32 function exists that allows support for this type of comparison. So the actual platform that needed the behavior required some extra work to actually happen.

    Now it isn't much work, but still...

    Anyway, that all changes in Vista, with the new CompareStringOrdinal function we have added to the NLS API.

    The documentation notes all of the various linguistic types of comparisons that the function does NOT handle, so that people can work to get the behavior they want, depending on what they are trying to do.

    So now that we can get the answer for Windows and the .NET Framework, all we need is a faster way to make the SQL Server binary collations have this admittedly un-natural comaprison behavior as well!

     

    This post brought to you by "Ʃ" (U+01a9, a.k.a. LATIN CAPITAL LETTER ESH)

  • Sorting it all Out

    We can't win when it comes to those pesky standards, can we?

    • 11 Comments

    The title of this post is a little tongue-in-cheek; I do not actually think of standards as pesky in a generic sense like that. :-)

    But the other day, someone inside Microsoft was having trouble with the XmlSerializer class in the .NET Framework. The problem was something like this:

    I’m serializing a string using the XmlSerializer.  It’s serializing this input:

    "Line\r\nBreak\r\n\tTab"

    And deserializing it into this string:

    "Line\nBreak\n\tTab\n"

    So when I put it back into my Textbox it loses its linebreaks.

    This behavior is actually by design for the XmlSerializer though -- and based on the standard!

    The behavior is defined in the XML Spec, right here:

    2.11 End-of-Line Handling

    XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters carriage-return (#xD) and line-feed (#xA).

    To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

    Funny how a standard only annoys us when the one behavior they choose is what we are using. :-)

    Luckily very cool MSFTie Elena Kharitidi (who I met years ago when she was working on the Jet team) pointed out how you can make sure that a serialize/deserialize will roundtrip a little better:

    Normalizing string values is the default XmlSerializer behavior, but you can override it by configuring your XmlWriter before calling XMlSerialier.Serialize() method.

    You need to use XmlWriter.Create() with XmlWriterSetting.NewLineHandling = NewLineHandling.Entitize.

    There are other unfortunate side effects of choosing XML as your persistence format: there are ranges of characters (most notably the ones from 0x0 to 0x1F without TAB, CR, LF) that are considered illegal in XML 1.0.   The default XmlWriter will write them, but default XmlReader will throw on read.

    But if you use XmlTextReader, you can workaround this by setting Normalization=false; on the reader.

    Okay, so there are ways to get the standards conformant behavior, and ways to get the other behavior when you need it. Of course somebody will still be unhappy with the defaults we choose, which is why we can't seem to win these situations! :-)

     

    This post brought to you by "" (U+0d87, a.k.a. SINHALA LETTER AEYANNA)

Page 1 of 4 (58 items) 1234