Blog - Title

July, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    Pluralization(s) can be singularly difficult

    • 2 Comments

    A tribute to plurals, with fondest memories of the first comedian I ever enjoyed, Allan Sherman (original inspiration of Weird Al Yankovic for those who don't know the name):

    One Hippopotami

    One hippopotami cannot get on a bus,
    Because one hippopotami is two hippopotamus.
    And if you have two goose, that makes one geese.
    A pair of mouse is mice. A pair of moose is meese.

    A paranoia is a bunch of mental blocks.
    And when Ben Casey meets Kildaire, that's called a paradox.
    When two minks fall in love, with all their heart and soul,
    You'll find the plural of two minks is one mink stole.

    Singulars and plurals are so different, bless my soul.
    Has it ever occurred to you that the plural of "half" is "whole"?

    A bunch of tooth is teeth. A group of foot is feet.
    And two canaries make a pair--they call it a parakeet.
    A paramecium is not a pair.
    A parallelogram is just a crazy square.

    Nobody knows just what a paraphernalia is.
    And what is half a pair of scissors, but a single sciz?
    With someone you adore, if you should find romance,
    You'll pant, and pant once more, and that's a pair of pants

    Pluralization is hard.

    Even in English you need a huge dictionary with all of the weird and interesting exception cases (once you convince yourself that sticking an "s" on the end of every word won't do it.

    It came up again in multiple comments to In a much better position to handle inserts by Centaur like this one and this other one.

    And his examples were not really overstatements, believe it or not.

    We'll start with the obligatory Wikipedia link on pluralization, which will help to scratch the  surface enough to make one realize what one has just gotten oneself into.

    Languages like Spanish are considered pretty simple but can still fill a page with the explanation of them.

    English is reasonably simple with its cases for one, many, and uncountable (where the uncountables are usually in singular form, and zero items take the plural form). But all of the rules with subject/verb agreement are the min force behind me not paying attention in English class as a youth and being much more interested in the weird rules of other languages than the rules of my own. You could almost blame my linguistic notions on the crazy orgy of inconsistencies embodied in my native tongue.

    And then in French things are pretty simple (ref: here, here, and here). Then again those page list exceptions up the wazzoo. Oh, and zero items take a singular form, which also sounds weird to me though someone from France would find the converse to be true.

    Then there is Hebrew, whose uncountable words tend to take plural form. Oh, and they add gender to the mix as many others do, each with a different suffix. Then there are those bisexual words like "one" which have both a masculine and feminine suffix form. And some words that are feminine yet take a masculine plural suffix.

    Most Indic languages have singular, dual, and plural forms, though Hindi only has singular and plural while Sanskrit has the dual form too.

    Lots of other Indo-European languages also have a dual form.

    Polish has singular and plural like most of them, but then it also has a paucal form for when the last digit is 2, 3, or 4 (not including 12, 13, or 14).

    Persian (or is it Farsi? Or maybe not!) has many rules a lot like English, other than the influx of Arabic loan words that come with their plurals and make up a lot of exceptions -- which, come to think of it, is also a lot like English. Though with different loan words (and of course the different script).

    And Slovenian has a special purpose "dual" that is used for all numbers ending in two.

    To put into programming a bit, Jeff Boulter has talked about it in his 5 way(s) to pluralize, and I just noticed that Tom White also quoted a bit of that Allan Sherman sing has even made a plea for people with knowledge of other languages to get involved with Java solutions here.

    But C# is out there too -- see dmitryr's Simple English Noun Pluralizer in C#, for example, which has a couple of great comments that delve into additional exceptions and other language.

    Or fun ones like Bradley Tetzlaff's C# 2.0 Ninety Nine Bottles of Beer Example, which shows a very important practical implementation. :-)

    Even my own IStemmer'ed the tide talks about how stemming is involved with pluralization (among other things).

    The rules are very complex even to get any one language done perfectly, so doing lots of languages is staggering.

    Definitely a hard problem to consider. I think I'll leave that one alone, myself, and just try and stem some tides (leaving the stemmering to others!).

     

    This post brought to you by S (U+0053, a.k.a. LATIN CAPITAL LETTER S)

  • Sorting it all Out

    Look out for Font Rage

    • 4 Comments

    It is a known fact that some people hate Comic Sans MS (why else have a website like http://www.bancomicsans.com/ if everyone loved it?).

    Though as Mark Liberman pointed out yesterday in Language Log in his post entitled Font Rage, some people are choosing to be pretty extreme.

    I happen to like the font -- it is my main font in email in Outlook, and just like last year I still pine for the day that they make Comic Sans Fixed a reality:

     

    This despite the fact that holding my breath waiting would likely prove fatal....

    I'll dig up another instance of font rage in a few hours.

     

    This post brought to you by "" (U+0d86, a.k.a. SINHALA LETTER AAYANNA)

  • Sorting it all Out

    What does DAO have that ADO/ADOx/JRO do not?

    • 32 Comments

    BLOG OWNER'S NOTE: All comments to this post are now moderated to keep the volume down. You can post to another blog if you have further comments....

    If you don't care about Jet or DAO or Access then this is a post you can skip! 

    This post is a reposting of an article written over seven years ago, and is still entirely true.

    A few versions ago, the Access team started formally backing away from de-emphasizing DAO and once again was adding it to default references in new databases.

    As a culmination of this slow change, the updated version of DAO that is specifically used by Access 2007 and ACE is now once again under active development.

    Proof positive that rumors of DAO's (and Jet's) death were greatly exaggerated? :-)


    Microsoft has clearly positioned ADO as the replacement to DAO.... many Microsoft representatives have gone to far as to state that DAO is DOA (Dead On Arrival, a term used in the US to describe people who are dead when an ambulance arrives hoping to take them to be saved). HOWEVER a lot of core functionality is supported in DAO that ADO/ADOx/JRO do not support, and might never actually support since Microsoft seems to be pushing customers in other directions. While Jet itself will not "die" it is clear that it is no longer a strategic platform, so there simply does not seem to be enough interest to make things work more effectively in Jet.

    Here, for the full record, is a list of all of the capabilities DAO has that ADO does not:

    • Running transactions that use multiple databases (works in DAO since transactions are at the Workspace level, fails in ADO since transactions are at the Connection level -- and Connections only support one database)
    • Opening a table in a mode that keeps others from opening it read-write mode (works in DAO through use of the dbDenyWrite constant, fails in ADO at the table level since its closest analogue adModeShareDenyWrite only can be set at the Connection level).
    • Opening a table in a mode that keeps others from opening it at all (works in DAO through use of the dbDenyRead constant, fails in ADO at the table level since its closest analogue adModeShareDenyRead only can be set at the Connection level).
    • Creating users and groups in a way that allows you to recreate them in case an MDW file is lost (works in DAO using CreateUser/CreateGroup which allow you to specify PIDs, fails in ADO which does not allow you to specify PIDs).
    • Securing Access project objects such as forms, reports, or macros (works in DAO through the Permissions property on Document objects, fails in ADOx because it does not properly map the expected constants for permissions to execute, read changes, and write changes to these object types).
    • Ability to create a linked ODBC table that is updateable (works in DAO through its call to the SQLStatistics function, fails in ADO which makes no such call).
    • Ability to create "Prevent Deletes" replicas (works in DAO by passing the value of &H4 to the CreateReplica call, fails in JRO, which has no such analogue).
    • Method for determining folder information from Exchange/Outlook folders and columns (works in DAO through the Attributes of the TableDef/Field objects, fails in ADO since this information is not passed on).
    • Capability to set and change Jet options without making registry changes (works in DAO through DBEngine.GetOption and DBEngine.SetOption, fails in ADO, which has no such analogue).
    • Allowing the creation/change/deletion of any and all properties through the JPM -- also known as the Jet Property Manager (works in DAO through CreateProperty/Properties.Append, fails in ADO/ADOx/JRO for almost all properties since there is no hookup of the JPM to ADO).
    • Forcing the locking mode of a database when working from within Access (works in DAO through the DAO.LockTypeEnum constants while using CurrentDb, fails through the ADO.LockTypeEnum constants while using CurrentProject.Connection).
    • Retrieving implicit permissions on an object (works in DAO through the AllPermissions properties, fails in ADO which has no AllPermissions property and requires you to separately enumerate the user and all of their groups).
    • Allowing a separate Jet session to run using a special object in the object model (works in DAO through the PrivDBEngine object, fails in ADO due to no analogous object).  

     

    This post sponsored by (U+ff24, a.k.a.FULLWIDTH LATIN CAPITAL LETTER D)

  • Sorting it all Out

    Losing a title I'm fond of?

    • 7 Comments

    (Nothing technical, other than some blathering about a technical lead technicality) 

    From a very young age, I have been a fan of pipes.

    Though I admit it was mainly because my father would periodically quit smoking cigars to move back to pipes and then eventually quit smoking pipes to move back to cigars, and back again....

    And the pipes smelled significantly better than the cigars he was smoking! :-)

    But this most recent iteration of pipes -- the "pipe" model of management that things have re-organized into, I am not as big of a fan of.

    Mainly because it means I will probably lose my Technical Lead title soon.

    I sort of described the role here for those who haven't been around for years. The thing I have not really mentioned is why I care.

    It is not because losing the title would be a demotion; it wouldn't. Just like getting the title was not a promotion.

    It was just a recognition that what I was doing was not just development but also other work that spanned disciplines....

    It is fun to point out how most devs think of it as program management while most PMs think of it as development, but it does outline a recognition that some of what I do is outside of what they do, so no one has to feel like I am stepping on toes or working outside of my job.

    But once I lose the title, I'll be back to feeling like all of those things that I do that are not thought of as traditional development tasks are somehow not a core part of my job. The title just seemed like symbolic recognition that I was doing the job everybody wanted me doing, so fitting into the pipe will seem like a bit of a symbolic backslide.

    So the job won't change because of it, my commitments will read the same way, and I won't hesitate to do what I believe is the right thing.

    It just seems like it will be a little bit more of a struggle, and a little bit less of a feeling that my job is understood (or at least accepted).

    Anyway, I am probably worried for nothing, to the extent that one could actually say I am worried. Which is not much. Just something I'm not especially looking forward to....

    I have to wonder how I'd feel if some group offered me a job that wouldn't involve a title change, whether I'd take it. Does it bother me that much? I can honestly say I'm really not sure. Though it may be silly to read into that too deeply, since I also wonder whether I'd seriously consider a job offer from Apple given how they stock Limonata in the cafeteria!

    By the way, I did try to smoke the actual pipe once. I decided the smell was not nearly as good when it wasn't second hand! :-)

     

    This post brought to you by ǀ (U+01c0, a.k.a. LATIN LETTER DENTAL CLICK, a.k.a. LATIN LETTER PIPE)

  • Sorting it all Out

    Why can't everyone just speak German?

    • 6 Comments

    Remember when I pointed out how One day, your huddled masses, yearning to breathe free, might have to speak English?

    Well, as Raymond points out in this post, Germany looks to be even further along in their process of smacking down on immigrants who aren't learning the native language....

    Of course the amazing part is the nature of the EU/non-EU distinction being made, especially with countries that are currently candidate countries, such as Turkey.

    So let me get this straight.

    If Sarkozy calms down and Turkey becomes a member, then suddenly Turkish is okay, but for now despite the huge immigrant population that was invited, they have no choice but to go learn German?

    Sigh....

     

    This post brought to you by U+206e, a.k.a. NATIONAL DIGIT SHAPES)

  • Sorting it all Out

    They don't [font] associate with font linkers, among others

    • 4 Comments

    Back in this post and this other one, I have been kind of hinting around at the functionality known as FONT ASSOCIATION.

    Now in the post you are reading, I am going to explain what it is.

    First we'll go with a dictionary definition. I'll go with the American Heritage definition since the book is close by:

    Association (n, ə-sō'sē-ā'shən) -- An organized body of people who have an interest, activity, or purpose in common; a society.

    One thing you notice about many such associations is a sense of exclusivity -- by joining the association you get privileges that others do not. But the members themselves are quite independent and have their own purposes and goals.

    Well, font association fits into this kind of definition quite well. :-)

    Basically, to join the association, you must

    1. be a font (duh!), and
    2. be used on a machine with a Chinese, Japanese, or Korean system locale, and
    3. have font association initialized (as discussed here)

    If you meet these guidelines, then GDI will have a general preference for the preferred font of the main script of the particular East Asian locale (e.g. SimSun for Simplified Chinese, Gulim for Korean and so on).

    And this association will happen even if one does not use a font in the GDI font link chain, which one cannot always rely on otherwise. If it is not turned on then people who are used to it will complain about the bug they hit, and sometimes if the characters are not in the font given to them by their "association" then they will get NULL glyphs rather than characters.

    And once again the LOGFONT lfCharSet member becomes the only way one can break out of the association for a little while (unless your  font already supports the script, of course). Basically one is better off if one sets the lfCharSet to something that you are sure does not contain the characters, the only good escape here.

    Note also from that second post that Gulim has font linking behavior when Korean is the default system locale. But like all good associations, the Fellowship of the Font Linkers are not their partners, and the font link chin that exists for a given font is not respected if the font is chosen through association....

    That is font association -- a legacy technology that leads to a specific popular behavior that customers in East Asia may be relying on now -- so we are kind of stuck with it. Even if it can often be lousy....

    I wish we didn't, and that I could say "I don't associate with those kinds of fonts anymore."

     

    This post brought to you by (U+b36f, a.k.a. HANGUL SYLLABLE TIKEUT EO HIEUH)

  • Sorting it all Out

    Wait til you see my 'O'[EMCP based technology]

    • 5 Comments

    (no, this post is not about a rap or hip hop song, or its lyrics, though I admit the title may have been inspired by one, just like last time)

    Looking at Larry Osterman's post yesterday entitled How do I compare two different NetBIOS names?....

    (A nice brief history of a feature with piss-poor international support that even to this day resists all efforts to improve, by the way!)

    The actual question that prompted the "simplified" question that Larry covered was interesting on its own and I wanted to talk about it a bit.

    There was a need to compare two computer names, but one of them was UTF-16 and the other (like all NetBIOS names) is in CP_OEMCP. And the question was how to do the comparison....

    Obviously there are two possible ways:

    1. Convert the UTF-16 name to CP_OEMCP and compare them;
    2. Convert the CP_OEMCP name to UTF-16 and compare them.

    In both comparisons, one is wanting to use that whole "uppercase+binary" kind of ignore case comparison that we all know and love.

    Given the complications in doing the case insensitive binary comparison in an arbitrary code page, it is much better to go with choice #2 (where there is just the one case table to deal with and there is a handy CompareStringOrdinal function to do the actual comparison with).

    In this particular case they were in a position to instead consider calling RtlEqualUnicodeString directly rather than CompareStringOrdinal or even RtlCompareUnicodeString, which has the bonus of being even faster (though it would be hard to call it enough times to notice the difference making the performance issue most likely theoretical, the function has the benefit of doing exactly what they are looking for (this is the whole issue I talk about in Is RtlCompareUnicodeString used correctly?, where the answer is that by and large, it isn't!).

    Of course fixing the NetBIOS/computer name story to do better than an OEMCP world would be even better, but that seems a lot less likely. :-(

     

    This post brought to you by (U+1ed7, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND TILDE)

  • Sorting it all Out

    GB18030 isn't an ACP, either

    • 0 Comments

    The question went something like this:

    I'm trying to display GB18030 text (say unicode 0x3400 character) using DrawTextA and WideCharToMultiByte. I am using the code page for GB18030 which is 54936.

    Why doesn't this work?  Originally, I thought it had to do with font linking.
     
    Thanks for the help.

    You can see what is going on here -- the general assumption that the non-Unicode Win32 API will handle any/every code page that isn't Unicode. Which we know isn't true from the many times UTF-8 support in "A" functions has been discussed (if you look at Raymond's recent post that points out so many of the times I have talked about it, the subject has come up way too many times!).

    Once I pointed out UTF-8 and Gb18030 at the same time (in UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages).

    Now GDI is fundamentally a Unicode thing internally and was even back in Windows 95, mainly because most of the plumbing is Unicode anyway.

    The issue of which code page is used is not a simple answer like CP_ACP, as I pointed out in What code page does MSLU convert with?. The MS Layer for Unicode was designed to map to what the OS does in so many cases, including the GDI ones that were kind of based on the charset of a device context.

    But all of the underlying code pages that the charset values map to are ACPs, and GB18030 cannot be an ACP, for much the same reasons that UTF-8 cannot.

    Obviously, the quick answer is to use DrawTextW with the original Unicode text that isn't converted at all, rather than converting it and not being able to display all of the data that DrawTextA won't recognize....

     

    This post brought to you by (U+3400, the first CJK ideograph in CJK Extension A)

  • Sorting it all Out

    Not so Lao[d], at least not until Vista

    • 5 Comments

    The other day I got a mail via the contact link:

    Mr. Kaplan,

    My name is Anousak Souphavanh, a Lao software developer with the Science and Technology Department. I am recently trying to develop Lao database using MS SQL but understood that Lao is missing from the MS SQL support list of languages. I really like to add Lao UNICODE but needs your help and support. I read your presentation titled 'Unicode and Collation Support in Microsoft SQL Server' which held in Prague on 23-26 March, 2003. There seems to a doable for Lao language but I need to understand what actually that I need to implement. Please guide me in the right direction, ie. docs, urls to docs and info, and etc.

    I appreciated in advance for your help.

    Well, the proper locale and collation support for Lao was not done until Vista, and has not yet been done in SQL Server. But both pre-Vista Windows and every version of SQL Server that supports Unicode (7.0 to 9.0) gave some weight to Lao so the sort won't be right but it will at least be something.

    Now another option one could consider is a binary collation, which will also assure that characters are findable.

    The third option -- if one does have Vista and does not want to wait for an unknown future version of SQL Server, the solution given in the Extending collation support in SQL Server and Jet series can be used to generate sort key values, and then you can get the right collation behavior, too!

    Finally, there is no font that covers Lao really well on the Windows platform itself until Vista (the font provided in Vista for Lao is DokChampa) but hopefully Anousak already has a font handy.... :-)

     

    This post brought to you by (U+0ea5, a.k.a. LAO LETTER LO LOOT)

  • Sorting it all Out

    Arial Unicode MS effectively [bites|sucks|blows]

    • 8 Comments

    MVP Omi Azad likes to send people from Microsoft email when he runs into bugs.

    Usually they are our bugs, so it all works out.

    Though this last mail was a bit different....

    First he sent a screenshot of some Bengali text:

    The bits in green were the problem. His words:

    Don't go through the content of the image I attached. :-)

    Just have a look at the green marked characters. They are not same as the other one. Green ones are not bottom aligned with the other ones, also not top aligned.

    Is that a font conflict? I have more than one font installed in the system. This doesn't happen with IE7 but it happens with Firefox.

    If face is not declared in the html, one screen should bring all the characters from the same default font. So why they are not rendering correctly in Vista? This doesn't happen on XP/2000.

    But same problem happens with most of the Non-MS products. That is why I'm concern about it.

    Just now I removed Arial Unicode MS font and things become perfect once again. What's is wrong?

    Do you have any idea?

    Well, I do have some ideas.

    First, uninstall Arial Unicode MS!

    I mean, really.

    There are way too many applications that are using it as a default font, since they figure that will get them coverage of things.

    And that is just a really bad idea, given the coverage of this font in terms of both characters and features will seldom be identical to the OS since it is not an OS font (it's an Office font).

    Even Omi noticed that removing Arial Unicode MS made the problem go away in those various non-MS applications.

    When you get down to it, you need to try and trust what Uniscribe is doing rather than trying to start with a font that may not be able to get it all done....

    As a last resort font, it is okay. But as a first resort font? It bites. It sucks. It blows. For a whole lot of other reasons in addition to this one that ISVs are forcing on everyone....

    And if you are an ISV, please consider not using it. Your users may really thank you for it!

     

    This post brought to you by 𒁁 (U+12041, a.k.a. CUNEIFORM SIGN BAD)

  • Sorting it all Out

    Avoiding an international mailto maelstrom

    • 4 Comments

    The mail I got the other day from Wes Miller (yes, that Wes Miller!) forwarding someone else's question:

    Hi all,

    We encountered a problem when localizing the subject of mailto hyper link. The sample html is below:

    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Contact Us</title>
    </head>
    <body>
    <a href="mailto:someone@domain.com?subject=最終更新">Contact Us</a>
    </body>
    </html>

    When user clicks the link, the subject in outlook can not be rendered correctly. How can we resolve/workaround this problem? Your comment will be very much appreciated.

    Thanks in advance.

    (For people who want to test the results in their own browser/mail client, I'll reproduce the link here: Contact Us)

    As regular readers can probably guess (especually if they remember this post!), the text (Japanese for something like "Final Renewal" I think?), looks like: 最終更新

    So the browser was perhaps being smart enough to take the charset info of the page and use it but then the mail client was not being smart enough to see it as UTF-8.

    It can be tempting to look to the new IE7 "International" setting about using UTF-8 for mailto links:

    (more on this setting in the IEBlog post International Mailto URIs in IE7 for those who are curious)

    But I don't think that setting will always solve the problem here since, as the post indicates:

    First, you must have a mail program that correctly handles these mailto URIs, such as Outlook 2007 or Mozilla Thunderbird. Second, you must ensure that the ‘Use UTF-8 for mailto links’ checkbox on the ‘Advanced’ tab of your Internet Options is checked.

    So you have to be running a browser that does the right thing, along with the setting in the browser configured properly, plus a mail client that can read it.

    Though to be honest I am seeing the right results in IE6 too even without that setting (though I still get the wrong results in some mail clients!

    Since the text in question was UTF-8 being read as if it were not, it appears that the first two conditions were being met (or the browser was doing the right thing anyway), and that the problem was that the mail client did not understand what to do with the results it was being asked to make use of.

    Well, I guess the article will help if it encourages you to upgrade your mail client to a program that understandd UTF-8, so you can avoid the international mailto maelstrom!

    And it very nice to see products working to better follow standards, too....

     

    This post brought to you by   (U+0f59, a.k.a. TIBETAN LETTER TSA)

  • Sorting it all Out

    She typed in 'God damn clippy'

    • 9 Comments

    I had a friend complain to me the other day (the way that all folks who have friends working at Microsoft tend to do) about Clippy and how to turn him off in Office 2003.

    Now I have mentioned before that Clippy is off in the default install and has been for a few versions now.

    But I figure if even Charles Simonyi can be confused by it then I suppose anyone can. :-)

    So I remembered an old trick someone had mentioned to me and asked my friend "Have you tried being rude to him?"

    "What do you mean?" she asked me. "How can you be rude to a talking paper clip?"

    "Well," I suggested, "try venting your anger at him. Tell him in a few concise words how you feel about him."

    Many of you may know this trick. I was being a bit vague intentionally, but she actually came up with the same language for her complaint that many others have:

    After telling Clippy this, the first item on the list explains how to change the Office Assistant,and the second item explains how to hide or show it.

    Now this is obviously not the only way to find the message, but I find three different language issues amusing here:

    • An amazing number of people use this exact phrase;
    • There are reportedly many other expressions of negative Clippy feelings that will have the same effect on search in help;
    • There are disadvantages to a formal education that make this method of finding a solution less obvious.

    I wonder how sophisticated the "unhappy user" detection is here in language. And whether it has been appropriately localized.

    Any other language versions of Office users have any data they'd like to share? I'd love to know if Clippy can detect people being unhappy with him beyond his English users....

     

    This post brought to you by (U+2042, a.k.a. ASTERISM)

  • Sorting it all Out

    No UTF-8 in a VARCHAR column

    • 9 Comments

    Francisco Moraes asks in the Suggestion Box:

    Mike,

    Is it possible on MS SQL Server to have columns defined as CHAR (or VARCHAR) and actually store the data in UTF-8 without corruption due to code page convertions?

    I know there is the NCHAR/NVARCHAR type but we'd like to avoid the 2-bytes per character used to store them.

    Francisco

    This is actually quite a common request but the answer is not one that will make Francisco happy -- there isn't such a way.

    UTF-16 via the "N" data types is really the only way things will work here.

    I will forward the feedback on to folks on the SQL Server team (or they might just read about the request here since I think some of them are regulars now!), though it is worth pointing some things out:

    First of all, for pure ASCII text, UTF-8 will always be smaller, but for any other text in Unicode it will be the same size or bigger, which really hurts the size argument.

    I discuss this point a bit in You may want to rethink your choice of UTF, #1 (if the size matters), and I gave the distribution:

  •  U+0000 -   U+007f        1 byte        (128 code values)
  •  U+0080 -   U+07ff        2 bytes       (1919 code values)
  •  U+0800 -   U+ffff        3 bytes       (63,487 code values)
  • U+10000 - U+10ffff        4 bytes       (1,048,575 code values)
  • The above is imperfect since it includes unassigned code values and it also includes high/low surrogates in the three byte group when they are not legal there, but you get the point that UTF-8 is not much of a space saver unless you stay in ASCII, and if you do stay in ASCII then any old code page would work.... 

    Second of all, even if there were a way to make UTF-8 text work, it would not be free since the underlying engine and collation tables use UTF-16 for Unicode, there would be an associated performance hit due to the conversions. Now conversion is optimized as much as it can be, but it is still a non-zero cost.

    And finally, though we do not change code pages, there have really on a regular basis been updates to UTF-8 to conform to the latest Unicode Standard guidelines around conformant UTF-8 conversion, which means that if such a feature existed there would be worries about index corruption if faulty, non-conformant UTF-8 was being stored. This would have to be a real concern, enough to make sure that invalid data was never stored in the database (suggesting a validation pass, which means additional work that would have to happen).

     

    This post brought to you by (U+ff01, a.k.a. FULLWIDTH EXCLAMATION MARK)

  • Sorting it all Out

    Your LCID sucks

    • 14 Comments

    (Apologies to Roger Ebert, whose book I parodied for the sake of this blog post was a wonderful and not just for the great 0 star Deuce Bigalow: European Gigolo review!)

    As a by the way, this post does NOT represent anything beyond my own personal recommendations based on the way I think things are going. I am not even on the team that decides these things any more and I wasn't in charge of the strategy when I was then. Anyone who quotes me with prefacing words like "According to Microsoft..." is a complete and utter moron.

    Anyway, your LCID sucks.

    (This assumes you are still using LCIDs when you could be using locale names, of course!)

    How do I know that your LCID sucks?

    Easy.

    They all do!

    It started back in 2002, when Cathy and I each did one hour presentations for a big group of Microsoft employees about standards and our plans related to them.

    Of the whole twenty two slide two hours worth of presentation, one slide I wrote probably got the most email afterward.

    It was entitled The Death of LCIDs.

    Point after point as to why these things are so awful. From the lack of scalability to the fact that custom locales wouldn't work to their proprietary nature and so on. Not to mention all those problems with Walking off the end of the eighth bit.

    And if you'll recall, other people hate them too -- like Bill Poser whose valid criticisms of these things I talked about earlier this month in I think they might kind of get it now.

    And then in 2004 when I did a presentation to GIFT about locales whose title was "Intro to Locales (and what's wrong with LCIDs)" that practically had more slides explaining why LCIDs suck than anything else.

    LCIDs are just beastly.

    Truly.

    It is time to move to names instead of LCIDs. Because going forward who knows how many new features will only exist in the "name based" NLS API functions rather than the LCID based ones (just as has already happened with all of the v.next versioning stuff and custom locales and so on).

    Because locale names are based on RFCs that are based on international standards.

    Because SAIO would really prefer it that way.

    And the ninth bit too!

    I'll close with some reworded Smash Mouth lyrics:

    Somebody once told me Microsoft's gonna roll me
    I ain't the sharpest tool in the shed
    The code's looking kind of dumb with its finger and its thumb
    In the shape on "L" on its forehead

    Well the numbers start coming but will they always keep coming
    Don't ignore me now and hit the ground running
    Doesn't make sense not to just use names
    C'mon you're smart so please don't be lame.

    So much to do so much to see
    So what's wrong with following the RFC
    You'll never know if you don't try
    Just tell those LCIDs goodbye.

    (Chorus)
    Hey now, names are All Stars got their game on, go play
    Hey now, LCIDs suck so, their so '90's, they'll fade.
    And all that glitters is gold
    Only locale names break the mold....

    Ok, I can't do any more of this. You get the point. :-)

    But truly, you should start moving toward using locale names instead of LCIDs. It's time to stop associating with those shady sparsely allocated DWORD kinds of types....

     

    This post brought to you by L (U+004c, a.k.a. LATIN CAPITAL LETTER L)

  • Sorting it all Out

    Character Map Plus?

    • 15 Comments

    The other day when I wrote We've got a style of glyphs, yes we do; we've got a style of glyphs, how 'bout you?, regular reader Mihai commented:

    ...Character Map is not consistent.

    Select "Angsana New" (or Arial, or "Lucida Sans" or whatever) and using "Character Set: Unicode" you will notice that only the glyphs that exist in the font are shown.

    So I guess the expectancy is that Character Map does not do any font fallback/linking/substitution.

    This is an excellent point; Character Map is really a tool that is built for the display of the fonts, not of the display of other font technologies.

    Though as I think about it, wouldn't a Character Map Plus tool that did all of the fallback/linking/substitution and showed you what you could expect to actually GET if you asked for a given font be a really cool idea?

    or maybe it could be an additional checkbox that would expand the view to this much wider one.

    I imagine that I would use that one more often than I use the one that is there, to tell the truth.

    Anyone else think this would be worth thinking about? :-)

     

    This post brought to you by (U+0aa2, a.k.a. GUJARATI LETTER DDHA)

  • Page 1 of 5 (67 items) 12345