Blog - Title

September, 2007

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!
  • Sorting it all Out

    First the music, then the lyrics -- and make it rhyme!

    • 8 Comments

    The question I received the other day by email:

    This is probably not your kind of question, but it has to do with music and with language so I decided it would be worth a try.

    Why do song lyrics always have to rhyme when poetry doesn't?

    Well, it is the kind of question I find interesting, though I suspect someone over on Language Log might be able to do more here with it than I could, since I'm not really sure.

    I guess I could start with the throwaway answer -- that most poetry does tend to rhyme, as a part of the structure in which it sits. And even more throwaway that the vast majority of people seem to have no real knowledge of poetry that doesn't rhyme (or at least as familiar as they are with the rhyming sort!).

    But obviously that says nothing, since we know that poetry that does not rhyme does in fact exist, even if it is less common. So what about songs?

    Well, one could argue that songs with lyrics are not aimed at the crowd who would appreciate ones that do not rhyme. But that feels kind of throwaway too, doesn't it?

    Now after talking to a lot of different singer/songwriters over the years and reading the words of even more in interviews, at least one relevant patterns seems to consistently come from damn near all of them:

    • It is almost always music first and lyrics later
    • It is occasionally music and lyrics at the same time
    • It is just about never lyrics before music (occasional exceptions around a few couplets in someone's head or a poem later put to music)

    When I think about poets who often don't rhyme and get away with it (e.g. Cummings, Eliot, Whitman, Pound) I also tend to see structure that is much more subtle (and I'll admit sometimes so subtle that I can't even discern it without reading someone else's analysis) and structures in songs certainly do not seem to be as deep within a single song.

    I think about songs in movies or television shows from artists like Aimee Mann or Rachel Yamagata and the thing I am almost always struck by is that the themes in the lyrics of the song almost entirely fail to match what is going on in the movie or television program. Often the music fits, but the lyrics do not. and this makes a kind of sense since most people do not listen closely to the song lyrics while they are watching a movie or TV anyway (someone who was once watching me stay in a theater to watch for a song title after a movie pointed out to me that it takes a real "Dawson Leary" to stay and watch the credits, since nobody cares about that stuff).

    So if in a different medium (where people ARE listening to the song) maybe they do only respond to where it is like in a club for dancing, in which case the same rules apply -- if the words are not important except a part of the structure so that there is no glaring break like couplets without a rhyme, then the only thing important to some listeners is in fact that they do rhyme.

    When people choose to delight in random factoids like about Cyndi Lauper's "She Bop" was actually about masturbation and then closely listen to the lyrics to verify this, again they aren't thinking about the structure of them. When my ex-girlfriend's kids wore out a Jagged Little Pill CD by Alanis Morissette through continuous rewinding/replaying because "You Oughta Know" has the question "And are you thinking of me when you fuck her?" while totally ignoring the perhaps equally suggestive earlier question "Would she go down on you in a theater?" it is again clear that the lyrics only reach the listener's notice in very bizarre circumstances.

    Does 'Til Tuesday's "Voices Carry" and it's "Hush hush, keep it down now, Voices Carry" line make more sense in it's original never released form (about a woman who is not out of the closet taking to task her female lover who is perhaps not to secretive) than in the song's released form (about a married man lecturing the woman with whom he is having an affair?). The reasoning behind a record label feeling more comfortable with the latter is depressingly obvious and is certainly a theme that more people can personally identify with, societal pressures being what they are, even if you ignore the fact that I am very far from ever being able to understand what it would feel like to be in the former. But most people never even read to deeply into the themes in the lyrics or even knew about the original lyrics, they just saw an MTV video (you know, back in the 80's when MTV used to have videos) and they just listened to the music, and of course the lyrics rhymed. Don't they all?

    No one really knows what they are listening to in the music at a deep level, either -- who but a musician or an obsessed fan pays attention to what key the song is in? Or what chord progressions? Or whether the song is being driven on a bass or a guitar or a keyboard? Who but a Led Zeppelin fan knows about the unusual way that the guitars and the drums are playing to different beats and actually heard the difference before they read about John Bonham getting credit as a co-writer for the song for it? Almost no one....

    We hear a generally pleasing structure and go with it.

    Okay, maybe you see what I'm getting at. For most people lyrics are just another piece of the song like the meter, one that they do not listen to at a deep level. And it does not allow one the time to quietly contemplate like a poem does and to look for those deeper meanings that may be in the one that does not rhyme.

    Maybe someone who is an accomplished poet or songwriter or linguist might have some slightly more elegant/organized/astute/accurate. Plus they might even spellcheck their result before they publish it. But I figure for a draft of an answer in a blog post, it'll do. :-)

     

    This post brought to you by (U+266c, a.k.a. BEAMED SIXTEENTH NOTES)

  • Sorting it all Out

    Upgrade no machine[ to Vista] before its (or your) time

    • 8 Comments

    Years ago, I worked on the Access team.

    Even though the world had moved to NT4 Server at the time, one of the servers with build shares ran NT 3.51 back then, and it did so even once Windows 2000 Server builds were available and most machines were being upgraded. Maybe there were critical processes running, maybe no one wanted to take the time to do the upgrade, maybe the hardware was old and people were afraid there were no drivers and no one wanted to replace the box. Whatever -- there was some reason no one updated the box, for years.

    The simple fact is that as much sense as upgrades make, there are times that upgrades have to wait. For whatever reasons.

    Now I like Vista.

    I worked on it as a product from the very beginning, I have a Ship-It chiclet for it, it is on 75% of the eight machines I use, 80% of the five machines I own, and it makes up more than half of the 21 partitions I have set up across those eight machines.

    But at the same time, notice that none of those categories above are 100% and there used to be only one machine that ran it all the time (it now spends half its time running Server 2008, which is also as product I like).

    And there is a very simple corollary to those numbers.

    I do not run Vista on all of my machines all of the time.

    If there are specific reasons you have to not be running Vista (you like Mac OS X/Server 2008/Server 2003/XP better for what you have to do, you have a hardware conflict, you have some really awful bug affecting resume for hibernation (as my brother-in-law is dealing with on a few machines with no patches fixing the problem) or getting security updates (as I am dealing with on one machine that a Tier 3 MSIT engagement has done nothing for) or your ability to automatically sync with your cellphone never works (as my sister is dealing with, no info on updates anywhere) or you are finding games that fail consistently due to the inability turn off Advanced Text Services off (as I discussed in Vista turns on everything) as Thanendar told me in a recent contact link message or ANY OTHER PROBLEM that makes Vista usage painful beyond the normal "learning a new product" stuff or even within that range if you don't have time yet to learn it then I have only one piece of advice:

    Don't Install Vista yet.

    It seems like obvious advice, and I managed to shock a few family members when I told them this, but the truth is simple. Computers are made to make our lives easier and more productive; if Vista is not doing so yet due to some blocking issue then I would recommend that you report the issue(s) to Microsoft or the OEM you got Vista from and then install the operating system that works.

    I am a fan of Vista.

    But I am an even bigger fan of customers getting what they want out of their computers.

    So if Vista is not doing that but you wish you could run it since you like some of what it has, then wait to upgrade and let people know what the problem is. Maybe it can get fixed so that one day you can run the latest version of the OS.

    And to be perfectly honest feel free to be a little suspicious of the motives of anyone who is insistent that you must do anything else here prior to those fixes bring provided.

    I know I do....

     

    This post brought to you by U (U+0055, a.k.a. LATIN CAPITAL LETTER U)

  • Sorting it all Out

    I didn't write it, sorry!

    • 0 Comments

    IDisposable asked via the Suggestion Box:

    Regarding this very helpful KB page, http://support.microsoft.com/default.aspx?scid=kb;en-us;939949

    Why, OH DEAR WHY, couldn't those CopyCulture stubs been automatic?  The only ones that could EVER be unsafe are the div-MV and en-CB, and  those are only unsafe should either of those EXACT codes be used.  For div-MV, that's just not going to happen... and when en-CB returns, it's likely going to _BE_ what en-029 represents.

    In short, why introduce such HORRIBLE breaking behaviour when the workaround is obvious, necessary, and non-intrusive?

    p.s. I assume you didn't right the article, since you didn't blog about it... but if you did, EXCELLENT work. It reads really well.

    I didn't write the code in the KB article.

    If someone asked me, I find the notion of solving this problem with custom cultures to be along the lines of curing dandruff by keeping your head shaved bald!

    (In both cases the overhead of the solution is overkill and may well be even be worse than the problem!)

    I agree with IDisposable that this should have been solved in a way that caused no client to break and that requires no code change -- if not in the immediate release causing the break then in the soonest release thereafter.

    I didn't write the code in the KB article though -- I think Shawn did (I know he is a proponent for using custom cultures to solve problems such as this).

    The code seems okay to me, though I probably would have bullet-proofed it for the case where the update had not been loaded and thus only the old names were valid -- since a non-specific catch for obvious, known and explainable errors is never ideal. But that may have affected the readability so I have no strong feelings for code that shouldn't ever need to be run more than once per machine....

    But even now I would rather the problem was addressed in the product at all times, automatically.

     

    This post brought to you by 𝑖 (U+1d456, a.k.a. MATHEMATICAL ITALIC SMALL I)

  • Sorting it all Out

    Wouldn't you bet, Fret (aka You've got 50 ways to fix your characters)

    • 4 Comments

    Via the Contact link, 'Fret' comments:

    I've been debugging why my app doesn't accept 0x0218 and 0x021B characters from the keyboard on Vista.

    You say it's not a windows bug. What I'm seeing in the debugger is that first I get a WM_KEYDOWN, which I translate to 0x0218 via ToUnicode and then after that I get a WM_CHAR, and the wParam is 0x003f, where is where our question mark comes from. So I'm thinking that windows is trying to convert the WM_KEYDOWN into a WM_CHAR and failing because it's not in the ANSI codepage? Applications that process just the WM_KEYDOWN work properly, but if you happen to like processing WM_CHAR's then your out of luck.

    At least thats what I think is going on. I could be wrong.

    I'm seriously considering rewriting my apps so they don't use WM_CHAR at all. But maybe that is overkill and there are less drastic fixes available.

    Regards

    No need to fret, er... Fret! 

    Of course it isn't a Windows bug, and you definitely do not have to stop processing WM_CHAR notifications and move processing to WM_KEYDOWN.

    All you have to do is compile your applications to use UNICODE/_UNICODE, and then you won't see everything being converted out of Unicode and into a code missing the characters you want to use.

    Now if you really don't want the applications to support Unicode then you are out of luck a little and have to go to those extreme measures (I say extreme since you either have to define the mappings yourself or you have to still call a bunch of Unicode keyboarding functions within your non-Unicode applications to be able to query what characters the keystrokes map to -- a painful fallback plan, all things considered....

    It is one of the reasons that the issues in Not everyone does the right thing for Romanian are so painful and why the moves in usage from

    • ş (U+015f, LATIN SMALL LETTER S WITH CEDILLA) to ș (U+0219, LATIN SMALL LETTER S WITH COMMA BELOW)
    • Ş (U+015e, LATIN CAPITAL LETTER S WITH CEDILLA) to Ș (U+0218, LATIN CAPITAL LETTER S WITH COMMA BELOW)
    • ţ (U+0163, LATIN SMALL LETTER T WITH CEDILLA) to ț (U+021b, LATIN SMALL LETTER T WITH COMMA BELOW) 
    • Ţ (U+0162, LATIN CAPITAL LETTER T WITH CEDILLA) to Ț (U+021a, LATIN CAPITAL LETTER T WITH COMMA BELOW)

    have problems for so many applications that were never moved to Unicode since people figured they never had to make the move....

    So moving the fonts and the keyboards to use the newer characters due to the linguistic pressure but not putting that same pressure on the applications is bound to lead us to where we are now.

    It is time to fix those applications!

    In the end, people in Romania are as unlikely to speak fluent question mark (which is that handy 0x3F that Fret is seeing) as anyone else in the world,which is why it is unfortunate that there are so many common applications out there that still aren't doing the right thing in regard to Unicode....

     

    This post brought to you by ? (U+003f, a.k.a. QUESTION MARK)

  • Sorting it all Out

    Don't look directly at the 951 code page if you can avoid it

    • 0 Comments

    K. M. Leung asks over the Suggestion Box:

    Big5 Unicode conversion in .net 2.0.

    I have read your article "Kowloon 951" and Ji Cheng's question. I know that there are a couple of ways to twist .net 2.0's encoding, such as changing the EncoderFallback. What if I am using BizTalk 2006's flatfile PipeLine encoding?

    From the latest unicode version, Big5 9563 should converted to UTF16 8137 (As Cheung said, it is the case with Framework 1.1 and HKSCS). As you have said, .net 2.0 came with its own encoding tables. Can we say that it is a bug for .net 2.0 to convert Big5 9563 to UTF16 E77F and Microsoft should provide a fix? Below is the code that you can run under (1.1 + HKSCS) and (2.0)

               string tsource = "95 63";
               string[] ahex = tsource.Split(new char[] { ' ' });
               byte[] source = new byte[ahex.Length];
               int i = 0;

               foreach (string hex in ahex) {
                   source[i] = (byte)ushort.Parse(hex, System.Globalization.NumberStyles.HexNumber);
                   i += 1;
               }

               byte[] target = Encoding.Convert(Encoding.GetEncoding("Big5"), Encoding.BigEndianUnicode, source);
               string tTarget = "";

               foreach (Byte bb in target) {
                   tTarget += bb.ToString("X") + " ";
               }

               Console.WriteLine("Big5: " + tsource + " is converted to UTF16: " + tTarget);

               Console.ReadLine();

    (The Kowloon 951 post can be found here.) 

    This is actually by design.

    The whole "code page 951" hack for HKSCS was just a hack, and not at all intended to be the way that HKSCS should be supported in Windows (the real HKSCS solution needs and has its own solution that does not involve returning results not in the original code page 950, which treats the code point in question as part of the Big5 and Unicode private use areas for EUDC, respectively....

    It was really a short term mistake in Windows that .NET 1.0 and 1.1 inherited by accident of all the code page support coming from Windows.

    I would definitely recommend against trying to use the Encoding support in .NET >= 2.0 to munge any code page....

    On a not entirely unrelated note (and perhaps partially in recognition of issues I pointed out in posts like this one!), just this last month, the Chinese Language Interface Advisory Committee (CLIAC) made it clear that their intent is to in the future only assign code points to HKSCS when they have been assigned to Unicode (you can read about this in the NOTICE entitled Revised Principles for the Inclusion of Characters in the HKSCS, or you read the 通告 in Chinese entitled 修訂《香港增補字符集》字符增收原則).

     

    This post brought to you by (U+8137, a CJK Unified Ideograph that is appears to be in neither the Taiwan CNS-11643 standard nor Microsoft's Big5 code page)

  • Sorting it all Out

    Where does the time go? And when I said where, I meant when?

    • 2 Comments

    The question I was asked via the Comment link by William was:

    Very offtopic since this isn't an internationalization question, but I see you have answered questions about time zones before. So maybe you can answer mine.

    Why does the time change happen at 2am?

    Well, I could start by making it more of an internationalization question, and point out that it does not change at 2:00 AM for all time zones!

    The biggest degree of difference is a time zone that has its daylight savings change happen at noon, and there are several that happen right at midnight or an hour or less from it.

    Thankfully neither happens in the US, for very good reasons in both cases:

    • Having it happen during normal waking hours is a nightmare for the working world, with one work day an hour shorter and another an hour longer. And how are store hours handled? A real awful situation twice a year, even if you make it less of a pain by doing it on the weekend.
    • Having it happen an hour or less from midnight means that normal calculations for what day it is can easily either skip days/repeat them, or oscillate between days (since the day will change, then DST can make it change back for a bit.

    Assuming that 2 AM is the cutoff everywhere can cause problems as serious as those caused by It doesn't always happen on the hour problems, due to an algorithm making bad assumption on the day or the time....

    Although this does not follow the usual definition of a localizability bug, it is not too far off of one given that its primary cause is code assuming that the time zone rules could never be different.

    That and the fact that such issues care in a more literal way about remembering that the world is round (the kind of phrase that is used more speculatively to talk about localizability/internationalization issues!), and it ends up feeling entirely relevant and on-topic as a localizability issue! :-)

     

    This post brought to you by Ť (U+0164, a.k.a. LATIN CAPITAL LETTER T WITH CARON)

  • Sorting it all Out

    If you're reading this as soon as it's live, you're not only wasting your time but you might not have been watching the Indians beat the Mariners

    • 0 Comments

    Nothing technical and if you enjoy any of what I usually write about then I doubt this will interest you. Truly, you will probably want to skip.... 

    One never knows when I'll write something in this blog about sports -- it is pretty rare and kind of unusual in what sport it might be, like remember the golf thing last year?

    It is hard to avoid sports when I am around my parents though, since they are such huge fans of Indians and the Browns.

    For fun, the Indians are playing a double header against the Mariners, and the Indians get the home field advantage in the Mariners home field (the magic of make-up games; you lose a bit on the home field crowd but who cares if you get to bat last?).

    Watching their sports interest always makes me wonder if I was switched at the hospital or something? :-)

    Anyway, you get the point here, I'm sure -- even though I have no interest in either baseball or football, it is like when Jules said My girlfriend is a Vegetarian, which pretty much means I'm a Vegetarian, too -- it is like in the air here, so I end up having to soak it all in. I know more about the Indians right now than I probably did when I lived near Cleveland.

    So the day before my birthday (today) they play this double header. If they lose, I hope everyone's mood is not affected (if I had a choice I would probably root for the Mariners but I doubt I have a choice in this case if I don't want to sleep in the car or maybe in the Sukkah tonight. :-(

    In unrelated news1, the story of Antonio Heston and his $19 adventures on the corner of North High and 6th (less than a block south of my old apartment when I was living in Columbus over a decade ago) caught my eye for a totally different view on sports (conceptually no worse than a weekend in Cincinnati with the Bengals). It is weird, when I was around OSU it never seemed like the football players had trouble finding willing companionship, which made the story seem that much stranger to me. I admit I don't know much about the prices for such things but this is like when Bart Simpson sold his soul for $5 -- it seems like both should have been for more.

    In even less related news2, the fact that Case Western Reserve assistant professor and Fullbright scholar Marixa Lasso not being allowed to renew her visa or to come back from Panama really sucks, and is a trend that I hope is not being expanded. Granted this country is unpopular enough throughout the world that wanting to be here might well be grounds to suspect people3, but in this case she has a job and a husband here. Who are they kidding?

    (I have two friends who went to school at Case, neither of whom knew the professor but both whom interestingly enough liked her name, one even called her name fabulous though this may indicate she does not get out enough4!)

    Back on topic (well, sort of!), my own disenchantment with sports was less exciting, though. It actually started with football, when back in high school the fact that people thought I looked a bit like Bernie Kosar used to mean that I would get my ass kicked from to time in a season when every play was a handoff to Kevin Mack and there were many more losses than some fans were willing to put up with -- it is easy to blame a very short running game. It is when I decided to find something other than sports to hold my interest, most of the time. If you know what I mean.

    I guess I find myself hoping that the Indians win both games today, not because I want them to (I don't), but because I don't really care that much either way and it will make folks around here happier if they do.

    I could start talking about Chief Wahoo and the negative stereotypes of Native Americans but that likely deserves its own blog post which I probably won't write; it is bad enough that some yahoo will erroneously claim that Microsoft thinks hookers aren't charging enough5 after reading my oblique reference to the OSU new story.

    I'll post this up for tonight after one (and probably both) of the games is over even though it was written before either started, which might help put it all in perpective....

    Good luck, Antonio. I hope it all works out5. And please let Marixa back in5.

     

    1 - The downside of being in Cleveland is that I end up reading the paper.
    2 - I really need to stop reading the paper, soon.
    3 - This is a joke, not a serious statement.
    4 - Hopefully she will realize I am joking about this or I may owe a dove bar or a latte here!
    5 - To my knowledge, this is an official position of neither Microsoft nor any of its officers.

     

    This post brought to you by ½ and ½ (U+00bd and U+00bd, a.k.a. VULGAR FRACTION ONE HALF and VULGAR FRACTION ONE HALF, a.k.a. HALF AND HALF)

  • Sorting it all Out

    It is better when geopolitical issues turn out to not be a Thorn in one's side

    • 4 Comments

    There are many out there who think I am some sort of paid shill for Microsoft, but I am not.

    I will admit that posts like How to avoid stepping in it help these people prove their point, since I felt free to poke a bit of fun at Google for making a bit of a geopolitical boo-boo in Google Earth.

    But on the other hand, I also have been known to point out now and again when Microsoft products make mistakes, so when regular reader Mike pointed out in the Suggestion Box:

    I just installed Family Tree Maker 2008, which has a new Places feature linking births/deaths/marriages etc to Microsoft Virtual Earth map data. I was somewhat bemused to see that suggestions for placenames (presumably derived from the MS geodata) included "Wales, England" and "Ireland, United Kingdom". I think it would come as a shock to both Wales and Ireland to find that they are parts of these entities. Wales and NORTHERN Ireland are part of the United Kingdom, and Wales and all of Ireland are part of the British Isles, but it's a bit of a concern to see the data generating such politically-sensitive nonsense.

    I am not afraid to say that if it repros then it is a bug that should be fixed (no matter whose bug it is), though to be honest I don't know if it is true since I cannot make either show up for me in http://maps.live.com/ .

    Now I can type Wales, England into the search box with the Map option and see Wales, but the search string is no longer there so it could just be them trying to be helpful for someone ignorant of the political situation. They certainly find a msp either way so for the site at least it looks like they are just being polite and not calling someone who visits ignorant? :-)

    I do note that a similar search of Ireland, United Kingdom can't find anything while Northern Ireland, United Kingdom gives me a map so they do seem to have either fixed this problem and not added any "helpful yet maybe offensive to a few people" hints!

    (I did get a single tack without a map that said Ireland, Shetland Islands, Scotland, United Kingdom in it -- clearly it recognized the ill-formed nature of my query?)

    So without knowing how to query the data the way Family Tree Maker 2008 does, I do not honestly think I can determine with certainty whose bug this in fact is (if it still repros?).

    Though to be honest, as long as the map that wsa displayed was of Northern Ireland rather than all of it, I would have a hard time calling the Ireland, United Kingdom offensive -- since it kind of silently corrected you. It is definitely worse to show the incorrect text than to respond to a query, though if Family Tree Maker 2008 is doing this then it should be fixed no matter whose bug it is (and if it is Microsoft's I still think fixing it would be a good idea!).

    The Virtual Earth side of things from the site seems to be working well now, in any case.  But if something isn't then I am certainly not above the occaional nudge in the right direction now and again, whether in a Microsoft product/service, or the products/services of anyone else. :-)

     

    This post brought to you by þ (U+00fe, a.k.a. LATIN SMALL LETTER THORN)

  • Sorting it all Out

    Some random off-topic, non-technical, and generally obnoxious observations from a man on vacation (aka A largely symbolic gesture)

    • 2 Comments

    New reader William Overington asked over in the Suggestion Box:

    In various recent articles in your blog you have mentioned lots of postings and discussions about the encoding of emoji and emoticons in Unicode.

    Yet those postings are not in the Unicode public mailing list.

    Could you please consider writing a blog article about what is in fact happening in this field as I, and maybe some other readers who are not representing an organization which is a member of the Unicode Consortium, would like to become aware of what is the present situation.

    William Overington
    20 September 2007

    Some of the conversations in question are happening on the "core" Unicode list for members though I usually do not quote from there for general "trying to respect some kind of implied confidentiality of stuff there" thing, which I only feel comfortable violating when people multipost and/or crosspost with a bunch of other lists.

    Luckily people do that a lot, especially on the silly mails of the sort that a person such as myself might quote. :-)

    On the other hand I am not really a list secretary or ambassador and only talk about pieces that I think might be interesting to cover. My interests, though wide and varied, are neither consistent nor predictable, so it is hard to really know what one might get on a given day.

    Also sometimes I say things like "tomorrow I'll talk about _________" when the blank is filled in with underwater basket-weaving or international happy fun ball tryouts or a woman I went out with years ago or whatever. When I actually mean that I will write about it tomorrow -- and the actual post could be tomorrow or in a few days or next week or even never. And I am only mildly ashamed to admit that the biggest factors gating this are

    • how many posts are in there already (vacation is building up a backlog that is increasing the count while perhaps decreasing the quality and entertainment value!), and
    • whether or not I have thought of a title yet.

    Back to William's question, if you look on the List of UTC Subcommittees, you may see the Symbols subcommittee. I do not know the precise rules by which membership happens, though I do know that each subcommittee has their own mailing list. Thus far its principal defining characteristic has apparently been that every single message has been posted to the core Unicode members list as well, meaning twice the mail for me since the list server does not consolidate those duplicates.

    You know the expression "If you take out the curse words, he didn't say anything at all." ? No? Well, it exists. 

    I have been posting periodically about the work there when issues seemed interesting, but it has been a while since they have (in my opinion) so if I filter out all of the uninteresting stuff then I have nothing to report. :-)

    If that changes I'm sure I'll say a few words....

    I am probably being much less respectful about symbols than I could, probably out of fear for the really tortuous things that at least the conversations are attempting to do to (if they have their way formerly) ironclad encoding rules. I feel that the non-letters need to be kept in their place, which includes sitting in the back of the bus and entering via the back door to the kitchen if they want food. The front of the bus and the tables in the diner are reserved for real letters and numbers!

    Or what did the barkeep at the cantina in Star Wars say? Something about how we don't serve their kind here. Your symbols? They'll have to wait outside" or somesuch.

    You get the point, I'm sure. You probably did many metaphors and just kept reading to humor me on this one.

    I do apologize if the words offend some symbols, but I have to assure you that it was almost certainly intentional. :-)

    But that's just my opinon, I could be right....

     

    This post brought to you by(U+25cc, a.k.a. DOTTED CIRCLE)

  • Sorting it all Out

    What's Mac 2008 got, you wonder?

    • 1 Comments

    Regular reader Tom asks:

    I see the new Office for Mac will have 3 additional localizations, which is great. But how about input/display support for Arabic, Hebrew, and Devanagari -- any hope of seeing that?

    Regards, Tom

    To be honest, I wish I knew. I tried to find a beta copy of Office Mac 2008 to install on my MacBook Pro on Microsoft's internal corporate network and have been unable to find it. I have literally no idea what is there, at all.

    (If any internal MS folks know where I ought be looking they should feel free to send me a bit of email on this -- I am admittedly still finding my around on the CorpNet with the Mac!)

     

    This post brought to you by (U+0936, a.k.a. DEVANAGARI LETTER SHA)

  • Sorting it all Out

    They say 'No news is good news', right? Unfortunately, here is some news

    • 0 Comments

    Not too long ago, Ivan Petrov had a banner day in the Suggestion Box (I answered two other questions from that day here and here). His very first question was about something slightly different:

    Hi Michael :-)

    Did you know something about any future release of the Keyboard Convert Service?. I mean will it have future version that will add Windows Vista Support?

    Regards,

    Ivan.

    I guess I could call today Answering Ivan day or something. :-)

    Now I have talked about the Keyboard Convert Service many times before, like in:

    And I think I have proven that any time something is going on with this tool I'll be talking about it...

    Because I'm a fan so if I know about it, all of you get know about it (whether you care or not).

    There is nothing quite like a captive audience, and all of you -- all 30 of you -- are it!

    But I have my ear to the ground and have heard no rumblings about update plans for this fine tool -- so the news for right now is that there is no exciting news about future plans.

    Though more info about what you mean specifically would be feedback I could give to the team that owns the tool, so feel free to add such info to the comments. :-)

     

    This post brought to you by (U+0e1c, a.k.a. THAI CHARACTER PHO PHUNG)

  • Sorting it all Out

    Also *not* new in Vista SP1 -- Bulgarian keyboard layouts

    • 14 Comments

    After mentioning this issue, regular reader Ivan Petrov asked over in the Suggestion Box:

    Hi Michael,

    Is Windows Vista SP1 going to add support for the Traditional (Legacy) Bulgarian PHONETIC Keyboard Layout?

    Regards,

    Ivan.

    Kind of the same answer as the other question, but Vista SP1 has a very hardcore philosophical principle -- no features, only bugs, and even then only serious regression bugs with no backcompat consequences.

    This definitely leaves adding new keyboard layouts of unknown lineage out of the mix.

    Sorry about that.... :-(

     

    This post brought to you by (U+0e1b, a.k.a. THAI CHARACTER PO PLA)

  • Sorting it all Out

    *Not* new in Vista SP1 -- the Add font dialog....

    • 7 Comments

    Regular reader Ivan Petrov asked over in the Suggestion Box:

    Hi Michael :-)

    Are there going to be any changes in Vista SP1 to the UI of the Add font dialog?

    Regards,

    Ivan.

    Sorry Ivan, Vista SP1 has a very hardcore philosophical principle -- no features, only bugs, and even then only serious regression bugs with no backcompat consequences.

    Given that, the Add Font dialog (which has for the most part sucked since its design was finalized back in Windows 3.1 and although some updates have happened as I talked about in About the Fonts folder in Windows, Part 2 (aka Adding Fonts), hardly qualifies as a serious regression (indeed as any kind of regression) or as anything but a feature.

    The whole folder and the dialog actually belong to the Shell team (not the Typography team), and as such has really not had much opportunity for growth or change other than some of the behind-the-scenes changes mentioned in that post above. Until and unless that ever changed, it is hard to see how the dialog really could do much....

     

    This post brought to you by (U+0e1a, a.k.a. THAI CHARACTER BO BAIMAI)

  • Sorting it all Out

    Documented, schmockumented! It's still kind of cool....

    • 5 Comments

    (No, this post is not about my social life or anything related to it, though I suppose there may have been times the title might have been partially descriptive1; this is a technical post and also a world premiere discussion of an obscure but accidentally not-yet-documented-but-nevertheless-included-in-the-SDK flag for two of the most important GDI functions for rendering text!)

    I still have a few posts left in that series I've been working on, though being out of town has caused a minor break in the rhythm there. It will be resuming soon.

    Meanwhile over in the Suggestion Box, posts have been building up, and I figured I should pick a few of them off....

    Fer2 instance, there's Tihiy's question:

    Hello Michael!

    Can you suggest me a way to covert font character glyphs string back to Unicode string? I'm intercepting ExtTextOut to be able to read word under the cursor and i want to have human-readable string when ETO_GLYPH_INDEX is passed!

    Funny question there, one that just came up almost in the end game of Vista before it shipped.

    The answer won't exactly be a direct answer to Tihiy's question, but it will supply the method and go from there....

    Regular readers may remember when I was talking about device fonts in posts like Printing TrueType as graphics and Device fonts are people too.

    Well, one of the things that happened in Vista is that a lot more printing via glyph ID values was happening, a factor which (among other things) forces device fonts to not be used.

    Glyph ID values are strongly tied to specific fonts, you see -- so device fonts were being taken out of the running.

    Now while this is all well and good for Uniscribe and especially complex scripts, it is not so good for many of the East Asian scripts that I was talking about in that second post, where people were relying on device fonts that were fully loaded (containing all the glyphs that were needed) and were more performant than the system that had now become the default in so many circumstances.

    The difference in performance for some cases was bad enough to be considered a legitimate regression, and therefore something had to be done.

    I took a look in the Vista SDK header files and saw that the constant being used to trigger this effort was not documented, and so with the permission of the folks behind the code3 in Uniscribe and GDI, I am going to break the news here -- I am sure it will be in some future update to the SDK docs.

    The constant is in WinGdi.h, circa line 185:

    #if (_WIN32_WINNT >= _WIN32_WINNT_LONGHORN)
    #define ETO_REVERSE_INDEX_MAP        0x10000
    #endif

    This new in Vista constant, ETO_REVERSE_INDEX_MAP, will basically (in a TextOut/ExtTextOut call to a device like a printer) try and convert a bunch of glyph ID values back to characters again, using a map that it builds up from the font's "Format 4" Microsoft/Unicode subtable of the CMAP.

    This code works great with simple fonts that don't do other mappings -- because

    • any time a string you pass has no such simple mapping back to a character for every glyph ID, the code will just cut out and use the glyph ID values;
    • any time a string you pass has multiple characters mapping to the same glyph ID4, the code will detect this ambiguous case and again use the glyph ID values as they are;
    • any time the glyph ID values would actually have been obtained via other means (like the GSUB glyph substitution table or via more advanced features like VERT for vertical writing), no characters will be found.

    Of course for the specific scenarios that inspired the work to be done, the feature is sufficient, but for even mildly complex cases involving ligatures or glyph substitutions or complex scripts in general, it will not assist at all, really.

    And of course Tihiy's question of how obtain the results of the mapping are not helped at all (unless one is a printer driver!).

    But for all the limitations here, it certainly does provide a roadmap to how one might do the work to try to reverse the process and convert glyph ID values to characters if one wanted to handle more complex cases, whether the ones I suggested above or more complex ones like digit substitution based on settings, reordering found in some Indic scripts, or even accepting ambiguous mappings and taking the first mapping as it is, etc.

    One would have to dig into OpenType a bit, but a few GetFontData calls and some code that starts as a reverse to the code in KB241020, opening up additional OpenType tables and subtables as desired for the text in question, and one is in business!

    Now in the long run, this is the kind of thing I would love see built in, but obviously features can't be decided solely on the basis of what people like me (or maybe people like you!) think is cool; there have to be real scenarios like measurable performance issue found with Japanese device fonts not being utilized. Until then, this could be an exciting project for some ISV to work on, or maybe a sample I could try to and put together at some point (my last bit of digging into OpenType stuff was over 18 months ago in Getting all of the localized names of a font, I think I might be due at some point!).

    Also, keep the concepts behind the post in mind, they will provide assistance in a quite unrelated feature I will be posting about over the next few weeks sometime....

     

    1 - Perhaps fodder for future, non-technical blog posts if there is sufficient interest!
    2 - Intentionally misspelled to try to give the illusion of SiaO being "jus country"5,6!
    3 - Thanks, Sergey and Mike!
    4 - An exception to this rule is some specific characters commonly mapped to the same glyph, like U+0020/U+00a0/U+2002/U+2003/U+3000 mapping to the space and U+002d/U+00ad/U+2010/U+2011/U+2012 for the hyphen; in such cases, the mappings will not be considered ambiguous, even though the text mapped may or may not match the original code points that were converted to glyph ID values.
    5 - Another intentional misspelling.
    6 - Also, an unintentional misuse of the idiom due to lack of knowledge of how to represent it and spell it!

     

    This post brought to you by  (U+2012, a.k.a. FIGURE DASH)

  • Sorting it all Out

    A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

    • 2 Comments

    Previous posts in this series:

    Now that I have been talking about collation in Windows across ten separate blog posts, I thought it might make sense to talk about the characters in Unicode that take up more space in the standard than any others -- ideographs.

    Whether you call them Han or Hanja or Kanji, they are all basically Chinese characters used in either Chinese, Korean, or Japanese.

    The story for collating these items was not created in a vacuum, but there also were not simple uncomplicated sources that were used in their creation.

    The collation story is in fact kind of a messy one, due to many different factors:

    • The tables were mostly not updated for multiple versions of Windows despite the fact that more and more characters were coming into general use;
    • Most of the characters that were not added to the tables had some weight, just not the one to put them in the correct order;
    • Some of the characters actually had no weight, with the predictable results thereof;
    • In the case of pronunciation based sorts, the "most common" pronunciation of some characters actually changed over the course of the last 10+ years.

    But the goal is quite simple:

    1. In the default table, put all of the ideographs after almost everything else in Unicode -- first regular CJK, then Extension A, then Extension B, in code point order for each section.
    2. For each specific East Asian language, put the relevant ideographs in the expected order for the expected sort in question.

    There are many different collations across the various locales, and I have talked about various issues in many different posts, from Why is there no pronunciation-based sort for Japanese? to Supporting a pronunciation based sort for East Asian languages... to Is it Macau or is it Macao? to 'Acceptable' Japanese sort order? and more.

    The simple fact is that trying order over 70,000 items is going to be complicated, though hopefully as intuitive as it can be....

    Now prior to Vista there were several specific problems in the tables:

    • Missing ideographs
    • Some overlap between the language specific table and the extras meant to be put in the end
    • A few mistakes
    • A few changes in official source data (like for most common pronunciation)
    • Missing support of the expected repertoire in several national standards.

    Though even with addressing all of these problems, there was a problem (in some people's minds) starting with Vista -- an issue I hypothetically discussed in If you add enough characters to a sort, intuitive distinction can suffer and then more directly in On distinctions that are primarily with [and without] difference. That latter post even had a nice high-view narrative of several the various East Asian sorts,

    In the next post I'll dig in a bit and provide some examples with different sort keys across different locales....

     

    This post brought to you by (U+247e, a.k.a. PARENTHESIZED NUMBER ELEVEN)

Page 1 of 5 (71 items) 12345