Blog - Title

September, 2008

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    How to miss the point through translation (and on italicizing, or not)

    • 5 Comments

    It was a little more than two years ago in When the font is the boss of you that I talked about how the ClearType font Meiryo took the battle between people who try to italicize Japanese text and people who think that this makes Japanese really really ugly, and took steps to make it no longer a problem.

    Through the relatively simple trick of providing an italic version of font which does not slant the Kana or Kanji in it (in a .TTC so that glyphs could be shared with the non-italicized version), good Japanese typography could be safeguarded, and protected from those who seek to tilt everything over a few degrees even if this generally not considered to be a good idea.

    But not everyone feels the same way about this feature.

    First of all, there are various testers who throughout the Vista and Office 12 noticed that selecting text un Word and hitting CTRL+I did not seem to italicize everything.

    Second there were some external folks on the betas who reported the same problem.

    Generally the most interesting trait shared by the people reporting this as a bug were not Japanese. so for them it was more reporting the difference in behavior than anything else.

    The remaining categor of reports, however, come from actual Japanese users, who tend to have the self-discipline to not italicize indiscriminately though who have decided that for some specific purposes italics can be just dandy for Japanese.

    Thus as reported in blogs like メイリオ (Meiryo) の斜体, this is not necessarily the most appreciated of all possible designs.

    And then in later sites like メイリオフォントで斜体を表示 (hat tip to Slahcolon /: 何かとエラーばかり起こしているプログラム屋の日常) the directions for how to get one's slanted Japanese back while using this new font have been provided.

    I can't really condone or even recommend the process as it does involve breaking the font apart, in fact if you really prefer italics then even the above article points out that MS Gothic is perhaps a better choice.

    Though it does close on a more philosophical note on the nature of Italicization:

    補足:斜体とイタリック体の違い

    上でイタリック(斜体)という表記がありますが、細かくいうと斜体には2つの書体があり、文字を単純に斜めにしたものをオブリーク体と、筆記体を元にした右に傾いた文字をイタリック体と呼びます。イタリック体 - Wikipediaあたりに2つの書体が載っているので、これを見てみるとわかりやすいかもしれません。

    イタリック体はアルファベットの筆記体を元にしている以上、当然日本語にイタリック体は存在しません。そのため、日本語のフォントでイタリック体を表示する必要があるときは、オブリーク体を表示することで対応していました。そんなことがあるため、イタリックと斜体はよく混同されますが、正確には間違いだったりします。まあ、ウェブサイトとホームページを混同したり、似たような例はたくさんあるので、気にしても疲れるだけなのかなあ、と感じます。

    メイリオでひらがななどの斜体、正確にいえばイタリック体を表示しないのは、日本語の書体にイタリック体がないからだそうです。斜体を表示できないようにするのなら、Internet Explorerでウェブページを表示する際に、強調のために<em>と</em>で囲った部分(普通、イタリックで表示されます)の日本語の表示を何とかしろよと、少し思ってしまいます。

    Of course you have to know Japanese to understand what Nakayama-san is saying here, or at least look at an online translation site. :-)

    Comparing the translations from Google and Yahoo (SysTran) and WorldLingo was also kind of amusing. Google's in particular, which takes "上でイタリック(斜体)" as "On italics (italics)" and leaves out some obviously important distinction out that was really the whole point of the text. Not that either Yahoo or WorldLingo, which both claim -- "At on italic (non-commutative field)" -- provide much more insight here for non-Japanese folk, whuch made me sad to see the whole point of the conclusion kind of disappear via the process of translation. :-(

    It actually kind of reminded me of being handed the same problem given to Slovenians in the previously discussed Does bear shit in woods^H^H^H^H^HSlovenia?, only this time I could directly feel the frustration of seeing two words, believing they have different meaning, and not necessarily knowing what the two meanings are to make them different!


    This blog brought to you by(U+4e0a, aka a CJK Unified Ideograph)

  • Sorting it all Out

    Sorting the DPRK all Out

    • 8 Comments

    This blog post is not about trying to "sort out" the political issues in the DPRK -- just the sort, ma'am!

    In true SiaO fashion, this blog is more information about something Microsoft (and most companies) are not really able to support!

    Probably the first time I blogged about North Korean was back in March of 2006, in Traditional versus modern sorts. I contrasted the way that Hangul is collated when you compare the DPRK (Democratic People's Republic of Korea, aka North Korea) and ROK (Republic of Korea, aka South Korea).

    I principally talked about how the biggest differednce as that the "SSANG'ed" (doubled) Jamo were placed at the end rather than after the single Jamo that they were the double of.

    And yes, the preposition at the end of the previous sentence is intentional, and something I have decided to be proud of! :-)

    I did always intend to come back to the topic, but I have been busy.

    I was only recently reminded about it again after Richard Ishida's tutorial at the recent IUC when I had the practically once in a life time opportunity to know something that about language/script that he did not -- the North Korean/South Korean collation difference! :-)

    Many sources talk about the issue, though perhaps the clearest is in Chapter 9 (Information Processing Techniques) of Ken Lunde's CJKV Information Processing (the red emphasis added by me):

    An example that illustrates different sorting requirements for the same writing system s Korean hangul. North and South Korea (DPRK and ROK, respectively), although they use the same set of jamo for constructing hangul, sort them differently. Table 9-17 illustrates the sequence in which jamo are sorted in the two Korean locales, subcategorized by the position in which they appear in hangul: initial (consonants), medial (vowels), and final (consonants).

    Table 9-17 Korean Jamo Sorting Sequences

    Initial  DPRK
     ᄀᄂᄃᄅᄆᄇᄉᄌᄎᄏᄐᄑᄒᄁᄄᄈ
       ROK 
     ᄀᄁᄂᄃᄄᄅᄆᄇᄈᄉᄊᄋᄌᄍᄎᄏᄐᄑᄒ
    Medial  DPRK
     ᅡᅣᅥᅧᅩᅭᅮᅲᅳᅵᅢᅤᅦᅨᅬᅱᅴᅪᅯᅫᅰ 
     
     ROK
     ᅡᅢᅣᅤᅥᅦᅧᅨᅩᅪᅫᅬᅭᅮᅯᅰᅱᅲᅳᅴᅵ
    Final  DPRK 
     ᆨᆪᆫᆬᆭᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆼᆽᆾᆿᇀᇁᇂᆩᆻ 
       ROK
     ᆨᆩᆪᆫᆬᆭᆯᆰᆱᆲᆳᆴᆵᆶᆷᆸᆹᆺᆻᆼᆽᆾᆿᇀᇁᇂ

    In general, North Korean sorts double consonants after all other consonants. The vowels, in medial positions, are also sorted differently.

    Let's ignore the vowels for a moment, I'll talk about those another time (I have different linguistic theories to draw in for them!).

    Should we call them Chosŏn'gŭl instead of Hangul snce we'e talking about North Korean? We can't change the character names to use the more neutral term urigeul, though that would have probably been a good idea, in retrospect. :-)

    One could wonder whether the repositioning of small number of Jamo could really make such a difference.

    But remember, this relatively small number of Jamo are the component pieces of 11172 Jamo.

    if you take the first 28 Jamo the block (you'll see why I chose 28 in a second):

     Hangul 
     USV 
     Choseong 
     USV 
     Jungseong 
     USV 
     Jongseong 
     USV 
     Name
    0xac00 1100 1161

     Hangul syllable Kiyeok A
    0xac01 1100 1161 11a8
     Hangul syllable Kiyeok A Kiyeok
    0xac02 1100 1161 11a9  Hangul syllable Kiyeok A Ssangkiyeok 
    0xac03 1100 1161 11aa  Hangul syllable Kiyeok A Kiyeoksios
    0xac04 1100 1161 11ab  Hangul syllable Kiyeok A Nieun
    0xac05 1100 1161 11ac  Hangul syllable Kiyeok A Nieuncieuc
    0xac06 1100 1161 11ad  Hangul syllable Kiyeok A Nieunhieuh
    0xac07 1100 1161 11ae  Hangul syllable Kiyeok A Tikeut
    0xac08 1100 1161 11af  Hangul syllable Kiyeok A Rieul
    0xac09 1100 1161 11b0  Hangul syllable Kiyeok A Rieulkiyeok
    0xac0a 1100 1161 11b1  Hangul syllable Kiyeok A Rieulmieum
    0xac0b 1100 1161 11b2  Hangul syllable Kiyeok A Rieulpieup
    0xac0c 1100 1161 11b3  Hangul syllable Kiyeok A Rieulsios
    0xac0d 1100 1161 11b4  Hangul syllable Kiyeok A Rieulthieuth
    0xac0e 1100 1161 11b5  Hangul syllable Kiyeok A Rieulphieuph 
    0xac0f 1100 1161 11b6  Hangul syllable Kiyeok A Rieulhieuh
    0xac10 1100 1161 11b7  Hangul syllable Kiyeok A Mieum
    0xac11 1100 1161 11b8  Hangul syllable Kiyeok A Pieup
    0xac12 1100 1161 11b9  Hangul syllable Kiyeok A Pieupsios
    0xac13 1100 1161 11ba  Hangul syllable Kiyeok A Sios
    0xac14 1100 1161 11bb  Hangul syllable Kiyeok A Ssangsios
    0xac15 1100 1161 11bc  Hangul syllable Kiyeok A Ieung
    0xac16 1100 1161 11bd  Hangul syllable Kiyeok A Cieuc
    0xac17 1100 1161 11be  Hangul syllable Kiyeok A Chieuch
    0xac18 1100 1161 11bf  Hangul syllable Kiyeok A Khieukh
    0xac19 1100 1161 11c0  Hangul syllable Kiyeok A Thieuth
    0xac1a 1100 1161 11c1  Hangul syllable Kiyeok A Phieuph
    0xac1b 1100 1161 11c2  Hangul syllable Kiyeok A Hieuh

    Aha. So in every block of 28, two of them will be in a different order (corresponding to the two Jongseong). Add to which for each of the five Choseong (six if you count the initial IEUNG at U+110b) there are entire additional blocks of 28 that have to be reordered, and before you know it a huge chunk of the 11172 will end up somewhere differerent. Then when you add the vowels you will again have large blocks that would be repositioned (like I said I'll get more into the vowels another time). In the end, large chunks will be moved, and clearly the ROK sort will look quite wrong to someone expecting the DPRK sort....

    As I said way back in Traditional versus modern sorts, both orderings have a kind of a linguistic basis.

    Though in my opinion the fact that the 11 non-SSANG doubled Jongseong (and the Jongseong IEUNG unlike the Choseong one) are not ordered differently might hurt the argument a little bit -- there is a clear interest in keeping one type of doubled Jamo interspersed and not the other.

    Another regular reader told me that the CLDR did not include either a locale or a UCA tailoring for North Korean, which might be due to the same reasons that Microsoft doesn't have one (DPRK is not a Wassenaar Arrangement member).

    I do wonder what happens in North Korea (which presumably has pirated copies of lots of software) of for expatriate North Koreans -- is their preferred collation being shaken out of them due to all of these other matters?

    I guess we're in politics again. :-)

     

    This blog brought to you by(U+ae4c, aka HANGUL SYLLABLE SSANGKIYEOK A)

  • Sorting it all Out

    UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

    • 11 Comments

    Previous blogs in this series of blogs on this Blog:

    Okay, so far we have introduced the topic, pointed out that 9/10's of what a person was going to run off and do is probably too much, and then jumped into define the things that are sequences of storage characters that meet the definition of what a user calls a character.

    So what are the things you can do with them, if you are armed with this knowledge?

    Well, since we are focusing on "user" characters, we'll start there -- with users moving through a text stream. You know, using the arrow keys to move either forward or backward through the text, and watching the cursor as they go.

    The ideal behavior that the user expects without thinking about it is not too complicated: if they think of something as a single character, then the do not expect it to tale multiple arrow keypresses to move through it.

    In other words, they want the computer to understand the text in the same way that they do.

    Now although that is simple conceptually, it os not always supportable by software today -- especially when one considers sort elements, where there is no easy function to call that finds those boundaries. The underlying data exists in collation algorithms (for example Microsoft's and the UCA's) and is used in order to define the sorting behavior of those elements, but when they are independent letters like the Traditional Spanish ch or the Hungarian dzs, there generally isn't an easy way ti query for the information.

    Now in this case there has not been such a method for as long as computers or even typewriters have been there, so it may be stretching the definition of expectation to assume that people would expect computers to understand the boundaries of a sort element when it comes to cursor movement. At best they would be pleasantly surprised if this happened, and at worst they would think of it as a bug.

    Finding out whether this is learned behavior or an intuitive expectation would make for fascinating study if the parameters for determining the truth could be defined. It makes me jealous of the ClearType folks when I think of the number of studies they do related to reading when I think about how large the budget is to commission such a study in Windows International.

    $0.

    So realistically we can put that third type of linguistic character aside for now.

    And possibly forever. :-)

    Looking at the other two types of linguistic characters -- surrogate pairs (aka supplementary characters) and grapheme clusters (aka text elements), generally users don't want want or expect to require multiple keypresses to get through what they think of a single character.

    This raises an interesting question for a developer performing an operation that is enumerating a string.

    Should the developer ever care about the answer on the length of a string or substrinfg when they are scrolling through a character?

    I mean, take the word 𐎀𐎇𐎖 (this is not a word so much as a stream of Ugaritic letters).

    𐎀𐎇𐎖

    Now this is three "characters":

    U+10380 U+10387 U+10396

    Using a modern browser like FireFox I have no problem seeing the string treated as three characters, despite the fact that under the covers it is actually:

    U+d800 U+df80 U+d800 U+df87 U+d800 U+df96

    And if you try to click in the middle of a letter you are never given the opportunity -- it always picks a side and puts you in one spot or the other.

    So clearly there are times that a developer might need to care about this fact, and therefore there should be a good way to provide this.

    Unfortunately, generally speaking, there isn't one inline with the text.

    But there are things like .NET's StringInfo class, which will help map the storage characters to the linguistic characters -- something I have talked about before.

    Though as I pointed out Sometimes you need more than StringInfo, there are cases in between the second and third category that actually do have data somewhere.

    Thus in Assamese ম্পা is four Unicode code points (U+09ae U+09cd U+09aa U+09be), two text elements according to StringInfo, but we know from prior "Virama-esque" posts like this one that this is actually a conjunct. So as Sometimes you need more than StringInfo points out, there is a construct that the computer understands that is not being provided as easily to developers.

    Now I am tempted to call this yet another category, and it really is a grapheme cluster that is not a text element,.

    I think in the long run it would be better if Microsoft treated this as a limitation/bug in StringInfo and its definiton of text element and either fixef it or added a new construct to handle this additional understanding of "characterness" that Uniscribe clearly understands even if StringInfo does not.

    In other words. Microsoft ought to provide the mechanisms that it actually does expose in easier ways here.

    Because no method should break up a conjunct, or put a cursor in the middle of one. But how is a developer supposed to support all that without a way to get at the data?

    Now in Sometimes you need more than StringInfo I actually asked if samples for this kind of data would be desirable and nobody responded, but I'll ask again to see if I have inspired interest. Any takers? :-)

     

    This post brought to you by (U+09a4, a.k.a. BENGALI LETTER TA)

  • Sorting it all Out

    Behind the Proposed Change to Tamil in Unicode (five different ways)

    • 6 Comments

    So, I had that Behind the Proposed Change to Tamil in Unicode presentation:

    The encoding of Tamil within Unicode has been the subject of displeasure by the government of Tamil Nadu for as long as it has been there. It has led to a proposal (built up over the last decade) to try to change the way that Unicode looks at Tamil, and the very real questions of why this effort has been so persistent and what will eventually happen have not really been discussed overtly in all of this time. This presentation's goal is to talk about why the proposal exists, why it will ultimately fail, and why the language itself can survive that fact. The broader issues of the view of languages and the "rights" of language owners will also be discussed in this case study of a language that has been both wronged and righted as few others have in modern times.

    The presentation that John Cowan, in a comment to this blog, asked:

     Are there slides for your talk available?

    Anyway, here they are, the many forms of the presentation, in the form available to attendees which is I think the PDF, and four other forms, for yucks (you can pick your favorite, sizes can help guide your decision here!):

    Format   Size (zipped) 
      Size (unzipped)  
     Portable Document Format (PDF
    1,119kb 1,257kb
     PowerPoint 97-2003 Show (PPT
    6,508kb 7,235kb
     PowerPoint 2007 Show (PPTX
    5,906kb 6,119kb
     PowerPoint 2007 OpenXML Show (XML)   6,066kb 8,591kb
     XML Paper Specification (XPS
    1,750kb 2,025kb

    Now this presentation tries to covers 11 years of history that are actually about lots of opinions and beliefs about up to two millenia of history, and how it impacted a proposal that has been around for over seven years that I have had some connection to it. Although it attempts to be as lighthearted as possible, the source material is almost embarrassingly dense. I'll likely blog about some of the more expandable bits of this in the future, including some of the following, as there are a host of issues relating to:

    • keyboards
    • fonts
    • encodings
    • Unicode
    • lingustics
    • standards
    • politics
    • death threats and threats of violence
    • love affairs
    • corporate interests
    • language
    • and more!
    Only some of which I had time to cover in the 50 minutes and in the slides, thus almost begging for some more blogs with the outtakes. :-)

    Enjoy!


    This blog brought to you by P (U+0050, aka LATIN CAPITAL LETTER P)

  • Sorting it all Out

    About the people who work for companies, and *their* fonts

    • 4 Comments

    On a completely unrelated note, can anything ever be really Wright again now that Rick is gone. :-(

    People who read my blog from the other day titled About companies and their fonts the other day who know something about fonts on Vista might be torn between two different feelings.

    You know, the feeling of "wondering if I have any idea what I am talking about" versus the feeling of "being cheated of the full answer."

    I suppose in fairness to that latter group that still have some faith in me, I should come clean and finish the story. :-)

    Now it is true that, as I said:

    Of course, to start with there is that whole On installing and removing fonts series that has laid out many of the real problems with and added some suggestions about the installation, updating, and removal of fonts from Windows.

    And of course from the information in that series one can see the exact permissions that must be allowed if one wants permissions to be broadly granted to users to manage their own fonts:

    • Write permissions to the Fonts folder, and
    • Permissions to add values to the HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Fonts subkey in the registry.

    but if you find a company full of users who know how to do all of that manually then....

    Never mind, that has never happened.

    On the other hand if you find a company that would lock systems down but then go out of its way to open them up like this, then....

    Never mind, that company will fold soon, so what their corporate network security and policy practices may or may not be is only interesting from a forensic analysis of who took the company down. :-)

    What people are looking for any time they talk about managing their own fonts is not the ability to randomly write registry keys via function call, copy files via the DOS prompt, and reboot. They are talking about how to use the tools that are there -- the Fonts folder, dragging files in and out, etc.

    That no longer works the same way -- having permission is not enough now, because of the way UAC/LUA works.

    One example from the people who asked me about the rest of the story I was skipping there came from developer Matthew:

    Indeed it is more complicated than that--at least as best as I can determine.

    You'll recall there was a thread recently (end of August) about pretty much this issue.  I looked at the relevant code (the font dialog and the functions it uses), and, from what I saw, the only way to allow a user to manage their own fonts, across sessions, using documented UI or API, is for that user to be a member of the Administrators group.  This stems from the way the font dialog uses UAC--it requires elevation to Administrator (not just to an elevated token) before attempting any font manipulation, even if the user would have access.

    That's just it -- the handler on top of the virtual Fonts folder, both from the Add Fonts... dialog and from drag/drop with the folder, is expecting more.

    It runs through the "Elevate Me, Please?" code which has some pretty high expectations, as Matt indicates above -- higher than you may have set them to yourselves since in this case that code is not designed to try it and try to elevate on failure.

    Which is just about the only way it could work well in this case.

    And what is worse is that if you turn UAC off (which some people do!) then because of the way the code is structured, you might still be unable to add fonts, due to the way the code was structured to fail the attempt if elevation could not happen (and not properly handling the case where elevation failed due to it being turned off!). Luckily this is not a typical scenario for large companies to be allowing, it is just an additional bad example which can easily hit customers in the non-LORG case....

    So what to do here, exactly?

    Well, I suppose the space is ripe for third party utilities that will do all of the things that can be done programmatically when the main user interface to do the work has been shut down, either by intentionally having the UAC code blocking their way, or by turning UAC off and blocking this case anyway.

    Hopefully someone is looking into making this space better for next version.

    And also hopefully someone is looking at a utility like the one above to handle the cases today.

    And finally, hopefully application developers who have to design applications that do not require Administrative permissions have looked at resources like that On installing and removing fonts series here, especially that last part that goes into the benefits of using font embedding and private fonts to remove the need for Administrative permission entirely.

    Of course it is easy to start going on at length about how this kind of thing makes the OS seem unfinished. Though taking the wider view, I think the better locking down of the Fonts folder IS finished. What is left is working on the overall usability features of the Fonts folder, something that has been lacking for a long time anyway, even outside the problems noted here....

    In the meantime, the pieces are in place for Microsoft to make all of this better; the pieces are even in place for others to step up in the meantime. The glass is, if not half full, then at least not half empty -- and both the bartender and a server behind the bar are on how way back to do the refill. :-)


    This blog brought to you by(U+0a98, aka GUJARATI LETTER GHA)

  • Sorting it all Out

    Vérité (add MUI support in a service pack) ou oser (tell me whether returning 'vrai' was intentional)?

    • 2 Comments

    Do French teenagers play "Truth or Dare" while growing up in France? Or is that just an American thing?

    The actual question seemed innocuous enough. It wasn't:

    We are noticing that on non-us XPSP3 system, when "Regional and Language Settings" (first tab) ->Set standards and formats is set to English, setting HTML input element value to true, still sets it to localized string “Vrai”.

    We found that this happens because with XPSP3, there is a new folder (fr-fr on my French OS) under system32, where the vbscript.dll.mui is present which causes the localization of the string. If I remove this folder, the issue doesn’t happen. I am wondering if this was an intended change for XPSP3. However, the fact that this is inconsistent with how jscript behaves(although there is a jscript.dll.mui as well) and how this doesn’t happen with local variables, leads me to believe that this is a bug.

    Boy, that takes me back.

    It took me back to that software project with the bug in it -- this one.

    Lest anyone doubt that this has indeed been the VB/VBA behavior for so very long, let me personally guarantee that it was.

    Now I was busy being somewhere else getting ready to talk about Tamil, so luckily Paul Dempsey was around to explain what was going on, and why it is new behavior in XP SP3:

    In XPSP3, all of scripting shifted to using a MUI-style model for localized resources, instead of the old satellite DLLs.

    JScript and VBScript, being different languages, use different rules for the conversion between VARIANT_BOOL values and strings. JScript follows its ECMA standard (locale-invariant), and VBScript follows the in the longstanding VB tradition of a localized result. This explains the difference between jscript and vbscript.

    vbscript UI strings follow the system User Interface language, not the conversion locale (Standards and Formats setting). Conversion locale will affect money, date, and time conversions to/from strings. UI locale is used for types and error messages. This explains your other observation.

    By deleting the file with the localized resources, you trigger the internal fallback to English resources. On pre-XPSP3 systems, you would achieve the same effect by deleting the vbsfr.dll (the old satellite equivalent of the .mui file).

    As far as I can tell, what you’re seeing is all by design.

    Very interesting!

    And essentially true, especially given the goal in VBScript of having it be as VB/VBA-like as possible. He did clarify the timing issue, why it was new behavior in XP SP3:

    This is a side effect of fixing something different. In fact, we were enabling MUI more broadly. This specific behavior change was not intentional.

    And obviously this one is an interestingly tough call -- if you think of the old behavior as a bug, then this was just a long awaited (if not entirely anticipated) bug fix. Though it confirms that the design is sound that you get what is overall the expected behavior (even if individual cases were not realized beforehand).

    But it makes for an interesting twist the correctness/compatibility debate I go back and forth on here -- and a great attempt to confuse me by putting an internationalization and MUI issue in there too!

    Now I am a huge fan of MUI, and of international support.

    But given my history with this particular behavior and how big a fan I am of backcompat, I'm going to side with treating this one as a bug....

    And I'll admit that somewhere between 15% and 20% of the reason there is personal. :-)

     

    This blog brought to you by é (U+00e9, aka LATIN SMALL LETTER E WITH ACUTE)

  • Sorting it all Out

    Is the flaw is in the constructs, or in the one who constructs? (aka I shoulda schwa that one coming)

    • 2 Comments

    Over in the Suggestion Box, Gregory asked:

    Hi Michael,

    I stumbled across a piece of code that implements a .NET HttpModule to remove whitespace and other junk from pages as you hit them (as can be seen here).

    Speaking purely about encoding issues in the code, are there any? The particular lines I am worried about are:

      class PageCleanFilter : Stream
      {
          public override void Write(byte[] buffer, int offset, int count)
          {
              byte[] data = new byte[count];
              Buffer.BlockCopy(buffer, offset, data, 0, count);
              string html = Encoding.Default.GetString(buffer);
              ...
          }


    This code makes me uncomfortable. Specifically:

    Is it okay to assume that the request response has the encoding specified by Encoding.Default?

    This code assumes (I think) that the byte buffer range it works with has all the character data - What if a two byte character happens to be split up by the calling method? I can't imagine this could be a good thing...

    Thanks in advance

    Well, Gregory has good instincts, the kind that I like to think SiaO helps to teach people about. :-)

    This code basically takes a chunk of text and assumes it is in the default system code page.

    If you follow that link, you'll see that between the time I got the message and now, the author took some feedback (from Gregory!) from a comment that he left there.

    The code does now gets rid of an extraneous copy operation.

    But it still has the same bad code page assumption, which could easily break depending on the server's settings and really the whole operation should be using UTF-8 here anyway.

    On the one hand it does not matter much since the developer, and the site, are going to be in a single code page.

    But on the other hand, this is advice for a useful bit of code ou can download, with descriptive information such as:

    This article details a HttpModule that removes white space, certain javascript comments, as well as optimising ASPX post-back javascript. This is useful when trying to save on the bandwidth your blog is using, or just plain and simply trying to decrease load time of your pages. I've also implemented a custom configuration section to allow the consumer to enable only the functionality required - you can look here for information on creating your own custom sections.

    And the sample is being posted on a blog on the World Wide Web, which means anyone looking for a code sample might find it, and want to use it.

    Plus with methods like RemoveWhiteSpace and RemoveLineBreaks it is worth considering how incomplete their work might be without all of Unicode to work with (not to mention the additional string copies that each of these methods do but let's stay focused on the international stuff!).

    and then, when all is said and done, the current code is still converting stuff in and out of the default system code page a lot when it does not have to. At the top of the Write method:

               string html = Encoding.Default.GetString(buffer, offset, count);

    and then at the bottom of the method:

               byte[] outdata = Encoding.Default.GetBytes(html);
               _sink.Write(outdata, 0, outdata.GetLength(0));

    instead of (like I said) going through UTF-8 here -- which will only lose invalid data, rather than potentially losing anything off the (comparatively small) code page.

    But (taking a step back) where is the flaw?

    Is it truly in the people who misuse the tools, or is it in those who design tools that so easily suggest usages that are not ideal?

    Or as I put it in the title:

    Is the flaw is in the constructs, or in the one who constructs?

    Which then puts me in an odd sentence, where identical words are being used and the only difference is in the pronunciation, you know:

    kən-strŭkt' vs. kŏn'strŭkt'

    I like reminders of this when people act skeptically about how when I mention Han ideographs have multiple pronunciations, since it clearly happens in English too. It helps keep people humble.

    But to get back to the encoding question.

    Gregory is right, the code is wrong on a few levels, though primarily the concerns I have would be:

    • The encoding passed to the method and then converted back out friom it
    • The operations inside the method that are employed 

    The lots-of-copying stuff is something that others probably are much more interested in. I mean, I care -- but not for the purposes of this blog...

    I think I am probably going to have to jump into ASP.NET a bit here, and see if I can put some samples together. I'll probably have to dig up a website I have a bit more control over that runs managed code, but I think it might be a useful exercise to do, if described later from soup to nuts.

    Anyway sorry to dēkŏn'strŭkt' things so much. Or actually, to dēkən-strŭkt' them. I'll try to kən-strŭkt' a better kŏn'strŭkt' if I can, later.... :-)

     

     This blog brought to you by ə (U+0259, aka LATIN SMALL LETTER SCHWA)

  • Sorting it all Out

    UCS-2 to UTF-16, Part 2: A&P of a 'linguistic character'

    • 14 Comments
    Previous blogs in this series of blogs on this Blog:

    A&P in the title stands for Anatomy and Physiology, since in some alternate universe I went ahead and got a medical degree and made a good friend (a friend who, in that alternate universe, is still alive) proud of me. Ignore it, the deeper meaning of the title, even when it exists, isn't really important. :-)

    Now that I've made everyone thinking "Let's update to support UTF-16 instead of UCS-2" they need to just back the hell off a few steps with the previous blog, I thought it might be good to go a little deeper in so you can see that even though you may have been completely and totally wrong, that there is a good basis for you thinking the way you were, and that you can use that knowledge to feel better about future steps. :-)

    In theory, there is very little difference between the general case of linguistic character as I defined it last time and the specific case that got everyone freaking out about UCS-2 (surrogate pairs).

    In practice, all linguistic characters fall into one of twothree categories:

    • A Surrogate Pair (two code units, a high and a low surrogate), neither of which is itself a character, linguistic or otherwise. The cheese may stand alone, but surrogate code units didn't teach it how, if you know what I mean;
    • A Grapheme Cluster (to use Unicode's term) aka Text Element (to use Microsoft's in the .NET Framework), made of two or more code units, at least some though not all of which can be independently thought of as being linguistic characters themselves;
    • A Sort Element (to use my term, via this blog) aka Compression (to use Microsoft's term) aka Contraction (to use Unicode's), made up of two or more code units, all of which can be independently thought of as being linguistic characters themselves.
    To show an example of each:
    • 𐎀, aka UGARITIC LETTER ALPA, aka U+10380, aka U+d800 U+df80 -- this one is four bytes in UTF-8, two code units in UTF-16, and one code unit in UTF-32 -- interestingly, always four bytes!
    • , aka the fully decomposed form of U+1e78 (LATIN CAPITAL LETTER U WITH TILDE AND ACUTE), aka U+0055 U+0303 U+0301 -- this one is five bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32;
    • dzs, a sequence of letters that collates together in Hungarian, aka U+0064 U+007a U+0073 -- this one is three bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32.

    Now one can argue at length on relative consequences of truncation of any of these sequences of code units. You might even make an argument that truncation is most serious in the first case and then gets less and less serious as you go down the list.

    Truncation in this case is a superset of any operation that splits apart the component pieces before a user's eyes, including cursor movement through the string, deletion of a single "character" via the delete key, cutting off the end to fit in a buffer, or whatever. Anything that would show a lack of respect for a linguistic character's boundaries. Everyone gets involved here -- fonts, keyboards, you name it...

    From one point of view you might be right if that is your argument.

    But as long as we are choosing to call them linguistic characters I am going to channel that Spock-with-a-beard version of me that managed to avoid the scandal with the Dean's daughter and got a PhD in linguistics, and claim that they each have the potential to have unique meaning to a user who took the time and effort to put them into data.

    In my opinion, you get no points for vicious truncation just because it doesn't look as bad.

    And in which case anyone with eager willingness to truncate should consider themselves to be a bloodthirsty linguistic character murderer. Sentence suspended by me since there really isn't a competent court with the authority to punish for this crime. :-)

    Because if you are working on or using a computer program displaying or storing or in any way using data then you have a right to not have someone change the meaning of that data in the name of expediency.

    And truncating a linguistic character has the potential to do just that.

    Okay, now that I have been all crazy about this, I'll point out that only the first two of these three categories have any supported way for a program looking for safe truncation points to detect them.

    Which means if I made you feel guilty, you can take some solace in the fact that just about everyone is going to be doing it some of the time....

    But it is worth considering that fact when one carefully does one's best avoiding problems with the categories that you can easily help with.

    Okay, that's it for now, next time I'll talk about those various operations an how to go about them....


    This blog brought to ou by(U+1e78, aka LATIN CAPITAL LETTER U WITH TILDE AND ACUTE)

  • Sorting it all Out

    Hi, I'm a PC. And I have a MAC. Wait, isn't that backwards? No worries, we're talking Bidi here!

    • 5 Comments

    The other kind of MAC, in this case. :-)

    Another from the list of bugs from that cool presentation from the folks over in Intel localization....

    This one is kind of about Bidi.

    Wait. Scratch that.

    It is all about Bidi.

    First some info from Loïc describing the screenshots:

    There are 2 screenshots using Hebrew settings, 1 using Arabic. I must say those have so far puzzled our developers, and I wish we could set some sort of override for those MAC address fields. They're passed properly among applications, but their display can confuse the most experienced user ;-)
     
    Please do not hesitate to share any suggestion about a possible fix.

    I'll start with the screenshots, that were also part of the presentation itself, the examples of problems:

    Okay, just so we all see it -- do you see what is happening to the Mac address?

    It is easy to say that these are application-specific bugs, bugs that you should fix with Bidi control characters.

    But that is really taking the easy way out.

    Yes, you guessed it, I am thinking about those same thoughts from the recent blog The Bidi Algorithm's own SEP Field.

    Let's explore the space a bit.

    We'll start off with plain old English US settings.

    And then to plain old Notepad, looking at some MAC addresses.

    With both a left-to-right and then a right-to-left reading order:

    Everything looks good. Cool.

    Now we'll mix it up a bit, and go to Saudi Arabia:

    We'll look at both orientations again.

    Ready?

     

    Well, we're half good. And half not so much.

    Oh wait, I just remembered....

    Context sensitive digit substitution.

    We should also look at it with National digits:

    This will even change the main dialog's look once it is all applied:

    And we'll see it what it does to our Notepad stuff, too:

     

    Now of course by rights there is another set of cases to look at -- with an Arabic UI language.

    But I'll spare you the trouble, you can probably imagine -- or just use the original three screenshots and extrapolate what would happen.

    The key here?

    All of these cases, we are talking about plain text, the kind of scenario that Unicode was designed for. Did you know that according to Google, the words "plain text" appear a shade under 4,000 times?

    Now for this bug, sure -- make the control have the attributes you want, and embed the control characters you need.

    But what about documentation talking about the MAC addresses? What about there?

    It is the job of documentation writers to embed Unicode control characters in the text, worldwide. Since one never knows what the user locale or the user UI language will be later on for the one reading the documentation.

    We need something better here, for the whole world.

    Because once again, Bidi is not doing so well in the world of mixed language text. Easier solutions are needed here....

    For now, Loïc's request for a workaround? I suppose a forced LRE/PDF surrounding the text should do nicely enough here for now.  But this will hardly scale to the general case, or the documentation case, or any of the other myriad of plain text cases. So what do we do?


    This blog brought to you by މ (U+0789, aka THAANA LETTER MEEMU)

  • Sorting it all Out

    About companies and their fonts

    • 2 Comments

    Over in the Suggestion Box, Chris Nichols suggested and opined:

    From the comments in response to your post About the Fonts folder in Windows, Part 3 (aka What changes in Vista?), it seems that there are some things to be desired in the way that Vista handles fonts.  Like, lots of things.

    I suggest a post that talks about way for companies to manage installed fonts.  Can we enable standard users to do this?  Are there some scripts available to help manage this.

    From a corporate perspective XP font management sucked,  Vista font management REALLY sucks.  The tools for managing this area are non-existent, what is available seems to come from Windows 3.

    Hmmm....I guess I'll avoid the more judgmental side of the observation (I prefer to make sure that when I expressing things judgmentally that I limit it to my own personal opinions!).

    But even so, there are some interesting technical issues here.

    Of course, to start with there is that whole On installing and removing fonts series that has laid out many of the real problems with and added some suggestions about the installation, updating, and removal of fonts from Windows.

    And of course from the information in that series one can see the exact permissions that must be allowed if one wants permissions to be broadly granted to users to manage their own fonts:

    • Write permissions to the Fonts folder, and
    • Permissions to add values to the HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Fonts subkey in the registry.

    But it really is more complicated than that.

    Fonts, as complex data structures that many operations have to be performed upon are in essence almost like little bits of code one puts on one's machine.Because although they aren't exactly code themselves, they lead code around and tell the code what to do, what decisions to make, what branches to execute.

    Just like one who owns a dog and who chooses to walk him every day can be said from one point of view to be owned by that dog given the way one acts as an indentured servant each day in order to meet his needs. :-)

    Like with any code that can be put on a machine, those with evil intent can take advantage of an "open" system to allow a malicious font to be installed. And thus a company opening the floodgates to allow any font to be installed is really not in the best interests of that company....

    Though that knowledge, combined with knowledge of:

    • security certificates one could place on installation packages;
    • certificates one could place on machines in the company to cause those installation packages to be considered trustworthy;
    • programs typically used by large corporations to manage machines in the enterprise like SMS.

    can certainly lead a smart person in the IT consulting staff to come up with a solution to allow the safe installation of fonts in the corporate environment....

    Could this be made easier?

    Certainly!

    But really the "easier" bit is just to package up the options here so that all of these steps can be taken without knowing as much about the details; the steps themselves are there for very good reasons, and it is much better to avoid tripping on these steps....

    One could charge the Typography team with this task, but in my opinion this is kind of silly -- since all of the above steps actually already have experts who know more about how to do them (especially in the IT space!), making another team learn them just for the hell of it rather than just getting these other smart people working on better solutions and better tools just sounds wasteful.

    Certainly one of the best places to start is in the tools surrounding the installer technology and making sure all of those pieces are in place. does anyone know if they are yet, or if there improvements that should be happening here?


    This blog brought to you by(U+0edc, aka LAO HO NO)

  • Sorting it all Out

    Johab to be kidding me!

    • 0 Comments

    From the list of bugs from that cool presentation from the folks over in Intel localization....

    The bug? Well, it seemed that Korean was "randomly" not working!

    By randomly I mean it was not working on some machines but everything was just fine on others.

    No real discernible pattern at first glance, but they tracked it down eventually -- having a particular font installed was causing it.

    Wow, talk about the power of typography, huh? :-)

    The font they found behind the problem was Arial Unicode MS, which I have mentioned before as not being the best possible choice for font in blogs like this one and this one and especially this one.

    Though to be fair, it takes more than just having Arial Unicode MS installed to cause troubles. In fact a unique constellation of attribute is required to cause problems!

    • First, it requires code in the application that is setting the LOGFONT.lfCharSet to JOHAB_CHARSET;
    • Second, it requires that you have a font on your machine that claims support for code page 1361 (Korean Johab);
    • Third, it requires the appropriate support for code page 1361

    It turns out that Arial Unicode MS is just such a font:



    The way that a font gets this is setting but 21 in the Code Page Bitfields, and if a font has this set and the code specifically requests the JOHAB_CHARSET then it is aslmost unfair to blame the Font Mapper in GDI for finding a font that matches...

    Of course there are probably other fonts out there that have this bit set, though note that none of the Korean fonts that ship in any version of Windows do this.

    In fact out of the 863 fonts on this machine, only one font other than Arial Unicode MS has this bit set: Code2000 from James Kass:

    I don't know why mega fonts would do this specifically, though I have a guess.

    For mega fonts, setting bits in the Code Page Bitfields and the Unicode Subset Bitfields with a paint sprayer seems to be their way of saying "we support a lot of stuff!" even though as this point nothing whatsoever is encoded in this specific weird code page....

    To be fair to these two fonts, they are just being promiscuous, and that is not a sin in and of itself.

    For sin to take place, you have to request the JOHAB_CHARSET in your code that is loading the font, which I suppose (to continue the less than appropriate metaphor) requires your code to put the $100 between it's teeth looking for a promiscuous typographical partner, which the GDI Font Mapper then facilitates -- it is only doing its best to see that both provider and customer are both satisfied, after all. :-)

    And there are machines out there that have that code page included on them (I guess this would be an available back seat somewhere or something?):

     

    Interestingly, this can happen with Unicode applications as well.

    Because the whole point of the JOHAB_CHARSET processing is more than just a code page or a charset, though please feel free to add 1361 to the list of code pages that suck:

     

    But like the ISO 6937-based code page 20269, the Johab code page actually works under a different character encoding philosophy, as described in several places, from Ken Lunde's CJKV Information Processing to Richard Gillam's Unicode Demystified. It basically has the intent of breaking down Hangul into its constituent Jamo in ways that don't really tend to completely match the way Jamo work in Unicode (in the latter case I speak of the code page on Windows, which is conveniently left out of the list on either Microsoft's or Unicode's sites, except one link under obsolete code pages, here).

    Suffice to say that anything that GDI does here, it is only doing because the application has specifically requested it.

    At this point Johab is widely deprecated, though one can suppose that some text editor might still be using it, which makes it harder to just remove the code page from Windows (either on the NLS side or the GDI side -- since the latter can impact Unicoe applications, too!).

    But at a minimum, you should never specify it, either in your dialog resources (via the FONT statement) or in code (using the LOGFONT structure), unless you are specifically expecting things to be processed Johab-style (which is not the same as Unicode Normalization Forms D or KD).

    In any case, a great obscure globalization issue, made harder to track down due to the lack of good documentation describing the behavior of the seldom-used Johab support on Windows)....

     

    This blog brought to you by(U+ac09, aka HANGUL SYLLABLE KIYEOK A RIEULKIYEOK)

  • Sorting it all Out

    Where the boys aren't garden path sentences

    • 0 Comments

    Regular readers (and devoted couple!) Don and Tammy asked via the Suggestion Box:

    We found another interesting language issue in adult films. This time it is in the title.

    There is a long-running series from Vivid entitled _Where_the_Boys_Aren't_. As you can probably guess if you have never seen any of them, it is an "all girls" series. It started some time in the early '90s and new ones are still coming out to this day. Some are good, some not. Just like with mainstream movies.

    But the point is that there are 19 different releases. The meaning of "Where the Boys Aren't 16" and "Where the Boys Aren't 17" and "Where the Boys Aren't 18" and so forth seems like a confusing use of, doesn't it? We were joking about this  while watching the latest one, and decided to forward it to you since I think you haven't covered it before.

    Don (and Tammy)

    Well, as Don allowed for, I was unaware of the specific series. My experience is these things is much more limited then theirs, but what is the sense of reader feedback if not to afford me the chance to live vicariously through them? :-)

    Of course if the goal is to allow for the confusion (you know -- to have fun with it) then perhaps it is intentional, though if every one of them since the beginning had a number on it, I doubt that it started that way -- no one is really expecting anything sordid out of a title like Where the Boys Aren't 3, for example. Plus with lower numbers it is much more common to use the spelled out number which would be even less confusing, I think? Again, I'm kind of guessing on some of this.

    Though this perhaps kind of a garden path sentence, like in this blog or this one or this one or this one or this one or even this one. Though in this case for many it is not so much a garden path at all -- it is just a regular old path that one may choose to be walking down if one likes. Either one successfully parses it and thinks of

    Where the Boys Aren't 20

    as if it were

    Where the Boys Aren't, #20

    and not, as if it were

    Where the Boys Aren't 20 years old

    depending (I suppose) on one's frame of mind or movie preference.

    The two paths here suggest two completely different types of movies with two completely different target markets -- each of which might contain members very willing to skip the movie if they read the sentence the other way, which would suggest that Vivid would be better off putting the # in the title since the "pun" is dumb and the misunderstanding doesn't really help them make extra sales (in fact, customers expecting a "no guys" feature are almost certainly going to run away from a "guys of a specific age" feature!).

    Though I doubt that the company producing the series has on-staff linguists to control title quality, and not only am I not a linguist but even if it were okay under the moonlighting clause it would probably fail for "moral turpitude" reasons. So there is no sense trying to ask for the job....

    With a series running for so long that (assuming 1992 to 2008 with #18 as the last one, a little over one a year) people might really know the series well enough to have no confusion, it may never come up as an issue in practice.

    I know that it might be awkward to go the video counter to complain about being misled, at least....

    Either way, it is sort of a garden path sentence kind of thing I guess, though the path may not be heading toward the most appropriate destination for all of us (as is the title of this blog, only moreso)! :-)

     

    This blog brought to you by(U+26a2, aka DOUBLED FEMALE SIGN)

  • Sorting it all Out

    One small step for me, and one big leap for Microsoft?

    • 6 Comments

    Okay, I know I promised I was done, but after It is the [unexpected] gratitude from people you respect that makes a [stubborn] Bulldog feels best!, where I mentioned:

    Rick then said some very nice things and I accepted the award that I think I was too dazed to remember (I'll wait till hey update the Bulldog page to find out for sure -- I now know why Oscar winners forget stuff in their speeches!)

    Sarasvati manifested and in Her divine effulgence took pity on her lowly servant and forwarded the text used by Rick when the award was presented.

    Here it is, for the curious:

    It's been a few years since we have given out a Unicode bulldog award. 
    This year, we would like to honor someone who has been a long-time 
    presenter at these conferences, and many of you probably know him.

    Michael Kaplan is such a true friend of Unicode, and so dedicated to 
    supporting the cause, that his tiny company once joined at the full member 
    level, so that he could have a vote in the UTC on topics of importance. 
    (This makes him one of two individuals who have ever joined as full 
    members... For tonight's trivia quiz: who was the other person?)

    Even without his vote, Michael's voice and his passion would be hard to 
    miss. He has blogged about Unicode and supported the Consortium for many 
    years, first through his own company and later through his work at 
    Microsoft.

    Please join me in welcoming Michael to the ranks of the Unicode Bulldogs.

    For the record, the cost of full membership in Unicode was USD$12,000 at the time; it is now USD$18,000.

    The maximum approved spending limit to prove a point in my comparatively smaller company is USD$14,995, so it is lucky that that Unicode didn't raise the rate until recently, or I'd have only had a half vote. :-)

    Another interesting bit here.

    Something that I had not realized initially (a colleague down the hall pointed it out to me later) and that I thought at first was an error....

    If you look at the Unicode Bulldog Award page that lists the past recipients:

     

    Martin Dürst Presented September 1997, San Jose, CA
    Misha Wolf Presented September 1997, San Jose, CA
    Ed Hart Presented September 1998, San Jose, CA
    Matsuoka Eiji Presented September 1999, San Jose, CA
    Tatsuo Kobayashi Presented September 1999, San Jose, CA
    Thomas Milo Presented March 2000, Amsterdam, The Netherlands
    Michael Everson Presented September 2000, San Jose, CA
    Isai Scheinberg Presented September 2000, San Jose, CA
    Arnold Winkler Presented January 2002, Washington DC
    Markus Scherer Presented September 2002, San Jose, CA
    Eric Muller Presented September 2003, Atlanta, Georgia
    Tex Texin Presented March 2004, Washington DC
    Sandra O'Donnell Presented September 2004, San Jose, CA

    It appears that I'm the only person from Microsoft to be granted the award. Well, so far, at least.

    Kind of cool!

    For that reason, I think I'll mention it officially now, since a lot of the effort took place while I was an employee and Microsoft has been very supportive of it for the bulk of that time -- this means that they deserve some of the glory too as much of it would have been if not impossible then at least less feasible without Microsoft's involvement here.

    Okay, now I officially done. I am way over quota for this kind of thing! :-)

     

    This blog brought to you by 𐂍 (U+1008d, aka LINEAR B IDEOGRAM B109M BULL)

  • Sorting it all Out

    U+1d31d -- "She's So High" lyrics, as requested

    • 2 Comments

    It came about as a result of The issue became moot as I decided to stand mute, when I commented about the songs that I was thinking about singing:

    • Tal Bachman's She's So High -- the theme was going to be about a Unicode character with a lower number that feels unworthy of the romantic attention of a character with a higher code point value;
    • Len's Steal My Sunshine -- with the help of a friend I met at a previous IUC who has a wonderful singing voice, the key tag line in the chorus "If you steal my sunshine" would have been replaced by "If you show my letters".

    but then decided not to do either after sizing up the competition and deciding I couldn't really top them. But then as I mentioned in Getting a globe through the airport, with Phylyp and John asking for some lyrics I figured I'd post at least one of them.

    Now this is the one I was probably not going to do without help -- the Tal Bachman song. His vocal range is much better than mine.

    You can look/hear it on YouTube here. His original lyrics:

    She's blood, flesh and bone
    No tucks or silicone
    She's touch, smell, sight, taste and sound

    But somehow I can't believe
    That anything should happen
    I know where I belong
    And nothing's gonna happen

    'Cause she's so high
    High above me, she's so lovely
    She's so high, like Cleopatra, Joan of Arc or Aphrodite
    She's so high, high above me

    First class and fancy free
    She's high society
    She's got the best of everything

    What could a guy like me ever really offer?
    She's perfect as she can be, why should I even bother?

    'Cause she's so high
    High above me, she's so lovely
    She's so high, like Cleopatra, Joan of Arc or Aphrodite
    She's so high, high above me

    She calls to speak to me
    I freeze immediately
    'Cause what she says sounds so unreal

    'Cause somehow I can't believe
    That anything should happen
    I know where I belong
    And nothing's gonna happen

    'Cause she's so high
    High above me, she's so lovely
    She's so high, like Cleopatra, Joan of Arc or Aphrodite
    She's so high, high above me

     Now the themes here are fascinating from a romance kind of standpoint.

    In fact I suspect we've all been there before -- like when you have a crush on someone but you figure you don't stand a chance so you don't try. But then she suddenly approaches you and you are amazed and delighted though you know that in the end you may never get comfortable. Like the way Rob Gordon thought of Charlie Nicholson in High Fidelity.

    Well I've been there at least. If you haven't then take my word for it -- stressful. :-)

    Anyway, without further delay here are the lyrics I put together for the contest:

    From v4.0, a sight to see
    one -- dee -- three -- one -- dee1
    ess oh, zero, and oh enn
    2

    But somehow I can't believe
    That she'd hang out with a G
    3
    I'm from the BASIC LATIN block
    oh-oh-four-seven's in ASCII

    And she's so high
    High above me, 1d31d
    She's so high, like DESERET, CUNEIFORM, or even KHAROSHTHI
    She's so high, like in LINEAR B

    Plane 1 and fancy free
    She's supplementary
    With attributes in the UCD

    What could a char like G ever really offer?
    I'm in code pages but she's not. Why should I even bother?

    'Cause she's so high
    High above me, 1d31d
    She's so high, like AEGEAN NUMBERS, LYDIAN, or MUSICAL SYMBOL FINGERED TREMOLO-3
    She's so high, like DIGRAM FOR HEAVENLY
    4

    The call comes for her to render with me
    But we're not in the same fonts you see
    I am so used to being in ASCII

    So somehow I was surprised
    That we'd be seen together
    I must thank Unicode
    Cause we can be together

    And she's so high
    High above me, 1d31d
    She's so high, like OLD ITALIC, CARIAN, or OLD PERSIAN NUMBER TWENTY
    She's so high, like COUNTING ROD TENS DIGIT THREE

    And she's so high
    High above me, 1d31d
    She's so high, like OSMANYA, SHAVIAN, or CYPRIOT SYLLABLE E
    She's so high, like MATHEMATICAL BOLD FRAKTUR CAPITAL P

    Yes she's so high
    High above me, 1d31d
    She's so high, like MUSICAL SYMBOLS, MAHJHONG TILES, or TETRAGRAM FOR ETERNITY
    What do they call that? Interoperability!


    1 - A sounded out U+1D31D, aka TETRAGRAM FOR JOY
    2 - A sounded out bit from UnicodeData.txt -- 1D31D;TETRAGRAM FOR JOY;So;0;ON;;;;;N;;;;;
    3 - Our hero, U+0047, aka LATIN CAPITAL LETTER G
    4 - Actual name is DIGRAM FOR HEAVENLY EARTH, but the earth is silent to allow the rhyme to succeed.

     There you have it. Play the video and you'll see what I meant about the voice range stuff -- Tal goes pretty high on some of it.

    High above me, ironically enough! :-)

    Anyway, the LEN song was more likely (I've even done it for geeks before, you my recall!), especially since it is easier to have someone do the high parts without as much planning ahead required. But between them I thought the above one was actually cooler and more fully formed.

    There you have it....

     

    The identities of the sponsors of this post are left as an exercise for the reader

  • Sorting it all Out

    Getting a globe through the airport

    • 4 Comments

    So anyhow...

    You may recall how I won that award the other day.

    It came in a box.

    Well, actually, it came as an award. Magda put in a box so as to make taking it home easier.

    Here is the box:

    I was thinking about how lucky I was that I couldn't find a smaller bag to hold my stuff in -- I was pretty sure it would fit in my bag and I could avoid checking anything.

    I should have realized that if everything seems like it is working out easily, then i don't fully realize what's going on....

    You see, there is TSA.

    The security folks at the airport.

    We were approaching September 11th, back on the evening of September 10th, and I put my backpack into the x-ray machine.

    The lady asked if the bag was mine, and after I admitted as much she told me she'd have to look in it.

    "Before I do that, is there anything sharp in there?"

    No, I assure her. Though it is very full....

    So she opens the box to look inside:


    Turns out the image in the box inside the backpack made them nervous.

    The base is apparently made of some kind of lead-based acrylic that the machine casn't see through.

    This tends to make them nervous.

    She eyes the globe suspiciously, lifting it carefully while holding it with the care one would use to hold up a bir soon bound for the land where birds are eternally blessed.

    "I tlooks like there's liquid in there. A lot. maybe?

    Uh oh. I see where this is going.

    "Not very much," I offer. Just enough to leace the sphere inside able to float free."

    She is looking for seams in the sphere. I can tell she is wanting to open it up, but realizing this might be a one-way trip.

    She looks at the globe again and mentions this seems like an unusual item to be traveling with.

    I explain about the award and how I wouldn't usually be traveling with it; I'm heading home with it, in fact.

    She nods.

    "Are you Michael Kaplan?" she asks.

    Oh crap. I think. If the TSA frisker reads my blog, my head will explode. In an airport. On the day before the anniversary of September 11th.

    No, it turns out she just read it on the inscription on the base.

    Thank God. and god. And all of the other gods and goddesses. My head really would have exploded, and an airport is terrible place for that since that would make me get used in some kind of al-Qaeda conspiracy theory or something.

    She decided to let me go with the award.

    I was sure she was going to make me check it or something.

    I made it home okay.

    The next day I unpack the award and place it on the desk.

    I decide to take a picture of it.

    This proves to be difficult, though. I want the inscription but I can't get it without getting caught in the reflection:

    or else missing the bottom of the inscription:

    Ah well, I give up.

    Here is the globe part.

    You know, just so you can see it for a moment:

    And there you have it.

    You might see how I subtly made sure India would show up; it turnws out that this is harder than you might expect of this globe that floats inside this sphere.

    You kind of have to sneak up on it.

    Notice how the globe has lines for country borders? That's awfully brave, ain't it? :-)

    Anyway here ends the story of getting the globe from point S to point R.

    I decided to blog about this since I had no real idea if anyone inside MS was gonna mention the award, since I was kind of "off the clock" when it was presented (if not when it was earned!). This way some people in the group get to hear about it either way!

    Now I will get back to work. And blogging about actual stuff.

    Though maybe I'll post some song lyrics over the weekend, what with Phylyp and John moving for it to happen, there are advantages to avoiding the vote....


    This blog brought to you by(U+25d3, aka CIRCLE WITH UPPER HALF BLACK)

Page 2 of 4 (50 items) 1234