Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Two of the most funnest parts or speaking at this year's Internationalization and Unicode Conference were:
1) I got to talk about some of the things I do now as a part of my work.
2) I got to talk about some of the cool things that are a part of the upcoming version of Windows.
I imagine talking about some of the more interesting facets of these two things might keep me busy for some time!
You can think of this as the pre-blog for many of the future blogs discussing the very features that you can see more f if you have the //Build Developer Preview.
One important issue to note is that although both the presentation and discussion over the coming months will in some ways seem quite organized, a lot of the actual work was much more disjointed, based on the needs of many different teams and partners and customers.
The apparent organization has much more to do with applying a consistent set of rules and principles across a true Hungarian Goulash of different requests and features and bug reports....
Also, while I'll be able to speak of some of the rules we're following as obvious, if i simply look at many of the items as I did just a few years ago, I can promise you that we learned a lot of the core methods that define the nature of this critical sliver of the job in this very version. Many of the lessons learned are in retrospect obvious and would have been invaluable to know five years ago, and I am completely prepared to forgive every single member of the original NLS team for not recognizing these "obvious truths" since they were terribly obvious to me either, at the time!
I'll also set down some really important ground rules:
I will not be talking about any of this work as innovative.
While I am not judging the individual uses of The"I" Word by many people involved with Windows 8, it is hard not feel like the word is being at least overused if not misused. So I'm going to emphasize the common sense nature of muck of the work, and since the only thing innovative about it is being right where I and others used to be wrong, it feels inappropriate to use that word in the context of what I'll be discussing.
I am not going to talk about creating experiences or sharing experiences or anything like that.
Perhaps the nature of this Blog is in some sense a form of marketing of International and World-Ready features, but my view of some of the basic facets of locales and keyboards and formatting and parsing and collation does not allow me to re-purpose the way I talk about these things. It may be a lack of creativity on my part, but it would feel fake to me to talk about things using these particular "buzzwords". I think I'm a bit too grounded in what I think of as 'the experience of what feels broken" to really feel comfortable talking that way.
Perhaps these ground rules cause my presentation to feel very different that many other, even those covered by others at the recent IUC. I don't find them to be insincere, but I know I'd feel insincere if I tried to emulate them....
One last rule:
I have neither inclination nor desire to violate either non-disclosure agreements or marketing news cycles related to Windows 8.
This last rule seems obvious, but I don't want anyone to misunderstand my intent here, or what I want to accomplish. Any time I talk about stuff you haven't heard before, it is only due the fact that they are doing other things right now, not because I am disclosing anything that you couldn't have found yourself by spelunking through the //Build Developer Preview, or eventually the Beta.
I hope you enjoy the things I talk about,and the thing that I point out. My goal is to enjoy the trip, and further I hope you enjoy it, too.
Okay, we can now let the games begin!
Previous posts in the series are Improving genitive. Or not.... (part 1) and Improving genitive. Or not.... (part 2): Explaining the point of Part 1. and Improving genitive. Or not.... (part 3): The hazards of "off label" usage.
Now in response to that third part, Van commented:
So I guess the question is why we can't just leave the genitive month algorithm as is, but incorporate two new month name variables - an invariant nominative and genitive - that will implement regardless of or contrary to the genitive month algorithm. I mean, if your old assumptions don't work anymore, keep the old stuff around for when it's needed and implement a new way that will work, no?
Okay, so in that one question there are actually several ideas:
Obviously point #1 is what we have been doing to date. Unfortunately.
Point #2, if the eventual fix was to be done by Microsoft, has no specific purpose since all that data is already available. Now unless the goal was to give customers or ISVs an easier way to do their own parsing and formatting, which also as little point since that is possible now by formatting just "MMMM" vs. "d MMMM".
The central ideaof trying to fix the bug would be to extend the current support, in ways that do not break expectations.
It seems unlikely that they will do that, given the wide variety of uses that are happening (which is a lot of what this series is about). Changes would have to exist for both parsing and formatting, and perhaps the only hope of that happening is some future version where there ere compelling scenarios to fix.
Up until now, he judge has consistently dismissed the case due to lack of substantive evidence of a problem that needs fixing.
Put simply, there has to be a "smoking gun" to make that happen.
At some point, I'll talk more about what probably would be required here, were such a campaign be launched. At this point, I'm not launching one, or even advocating it happen. But it can be a useful exercise to walk through what would be needed to get resources allocated....
Previous posts in the series are Improving genitive. Or not.... (part 1) and Improving genitive. Or not.... (part 2): Explaining the point of Part 1.
I thought I'd do something different today.
Now all the way back in 2004, near the end of the second month of this Blog, I wrote What the %$#! are genitive dates?, which was remarkable for several reasons:
The first and third parts are connected -- despite the fact that I did well on every test in school that explained the genitive case, the fact is that I really was unable to usefully do anything with it until I learned something about a language that had a separate spelling for the separate case (Russian). It just seemed too theoretical, you know?
My experience with reflexive verbs was the same; until I learned about them for Hebrew, passing tests didn't prove I understood the concepts. It only proved that I took tests well.
Many people have described their introduction similarly, so I know it wasn't just me....
Anyway, there are several locales representing languages that, like English, did not have spelling changes associated with genitive month names.
Some of these locales found another use for the "genitive months" feature.
Like if they wanted the month name on its own to be capitalized but in a sentence they wanted it to be lower-cased.
Locales like Portuguese, for example.
And as you can imagine, this "off label" usage of the Windows "genitive months" feature doesn't always work as people would perhaps wish for.
It is even occasionally reported as a bug, this "inconsistent" capitalization.
Just as genitive months don't always work as customers might hope, this "off-label" use of the feature has significant limitations, too. Since it wasn't designed for the scenario, etc....
But just as in the pharmaceutical industry, it can make sense to figure out whether these other usages can be helpful too, and they can tailor the "prescription" to work better. If the Dev team knew about this alternate usage, they wouldn't have invented the next Rogaine or Viagra (conceptually speaking, I mean - we are just talking about software!), but they would have had yet another reason to try and make it work better!
Previously, in Improving genitive. Or not.... (part 1), I didn't say very much.
But I suppose in a way I didn't have to, if I wanted you to read between the lines!
By pointing to example of a specific limitation in the way that the "detect if genitive is required" algorithm works, one can craft a format string that is mis-detected as requiring the genitive form.
And the supposition I made in that blog was that no prior format ran into this problem in any locale we shipped and thus was previously a non-issue.
I mean, a custom format could cause it, but such customizations are really really really really really really really rare.
And the fact that no one ever reported it before supports my supports my supposition.
My further supposition was that once a format existed that runs into the problem, that no one wanted to fix the algorithm to make it better at detecting the need for genitive months.
I think this second supposition is in some measure more provably true since the algorithm is largely unchanged from the original one written back in 1993.
The first supposition might be incorrect if people had been complaining but we never heard about it, right? :-)
I guess I'm making the argument that the algorithm should be improved. The current test is (essentially - I'm not going to post source but for people in access who want to follow along it is in base\Win32\winnls\nlslib\datetime.c) that if:
Then it decides your genitive.
Okay, so Latvian trips them up here with its format.
They have a period, which is an obvious way to say "new sentence" to a small Latvian child, though this algorithm that is old enough to buy liquor in some jurisdictions can't tell the difference when the two items near each other span a sentence boundary.
Of course that character that is sometimes also a date separator, too, so it easy to criticize but much harder to fix!
Ultimately no one ever considered whether to tweak the algorithm to better catch this case.
It seems like it would be a worthy bug to fix at some point, right? :-)
Previous parts in this series:
I have a T-shirt with the caption:
There's no place like 127.0.0.1
which is kind of fun, and a nice way to help distinguish geeks in some situations.
It makes the topic today even more topical, which I guess just makes it a better topic?
The limitations I mentioned in Part 10 above related to the way the hope of Unicode support in DNS is dashed by the requirements that NetBIOS, and more to the point the way the DNS and NetBIOS names are kept in sync by a convention that many people depend on.
That limitation is a pretty bitter one, since the whole point of IDN is really to allow names that support a significant chunk of Unicode, yet the "default" name for just about everything you might do with a machine is largely kept in the world of code pages.
Moving forward, it is easy to assume that backcompat requirements will keep the situation from ever changing.
Thankfully that assumption ignores the one way we can make sure we can get out from under Microsoft's implementation of NetBIOS's ugly limitations.
(gratuitous embedding of the song performed by the secret-wg in the closing plenary of the RIPE 55 conference follows)
There is one very cool thing about IPv6.
Well actually, there are a buttload of cool things about IPv6, but most of those things aren't relevant to this Blog and only one is relevant to this blog.
It is the fact that the standard defines no direct relationship to NBT, aka NetBIOS over TCP/IP, aka NetBIOS.
So, in a pure IPv6 world, unless Microsoft specifically adds such a relationship, and thereby snatches defeat from the jaws of victory, an IPv6 only world is a DNS only world.
And thus an IPv6 only machine does not need to run NetBIOS -- or at least not connect the two together.
And the machine name no longer has to be dependent on the default system locale, or more specifically the CP_OEMCP.
Now there's "many a slip twixt a cup and a lip", as they say.
But one of the new commitments I will now have as a part of my larger IDN responsibilities is to make sure no one adds back that dependency....
I drew a picture of youyou and your anchor tattooand saw the face that I knewcovered in shame.You drew a bird that was herea kind of sweet chanticleer but with a terrible fearthat the cage couldn't tame.That's how I knew this story would break my heart—when you wrote it.That's how I knew this story would break my heart.So, like a ghost in the snowI'm getting ready to go'Cause, baby, that's all I know—how to open the door.And though the exit is crudeit saves me coming ungluedfor when you're not in the moodfor the gloves and the canvas floor.That's how I knew this story would break my heart—When you wrote it.That's how I knew this story would break my heart
This blog you are reading here right now is, I will admit, not one I am entirely comfortable with.
It isn't about the song lyrics above, at all really. It is just what was playing when I finishing up the blog.
Though there is a passive thing that is vaguely topical for the song near the end, though it would be a huge stretch to call the story a heart breaker.
In any case, it isn't the thing I'm uncomfortable about that this blog is referring to.
It's about the targeted ads that are used by Facebook.
Now I admit I mostly find them amusing, though this due to a personal policy of mine:
These first policy is to make sure that my friends don't use the unusual and inconsistent nature of my "likes" as proof that I'm deranged when they would otherwise see lots of my likes that the second policy inspires.
It's an imperfect system, mind you. I mean, some of my weird likes will bleed over and people will see them from time to time.
On the other hand, my actual likes can sometimes catch people unaware, so I don't worry about it too much.
That isn't what is making me uncomfortable.
I have noticed that it is doing more than using my "likes" to find out about me.
It pays attention to networks I belong to, like when Amazon job ads point how more convenient work would be in South Lake Union than my commute to the East side.
No doubt this is gleaned imperfectly since I live on the East side but come downtown a lot, and I work for Microsoft which is mostly on the East side.
Silly, to be sure, especially since a little info exists on my profile that clearly indicates I live on the East side, no matter how many nights I try to avoid ending up there sometimes.
But hardly a big deal.
That also isn't what is making me uncomfortable.
From a schooling perspective, I list two high schools, only one of which I graduated from but both of them list same graduating year, despite them being 363 miles away from each other.
And although I have some college (spread across three states, four schools, and ten years), I have exactly *zero* college degrees, though I do have some (now long expired) credentials as a phlebotomist, CMA, and most of an R. EEG/EP T. Plus for purposes of Facebook, none of this post high school schooling/credentials are listed.
I also have Multiple Sclerosis, as regular readers no doubt know.
And I obviously work for Microsoft, thus the location of my Blog.
So I don't know entirely what to make of the most recent bunch of ads that seemed targeted on three different possible meanings of M.S.:
This is frankly kind of creeping me out.
Now I do claim to like some universities and colleges, though mostly they are schools that people I was dating were at, or ones I visited to see friends. Maybe I was guessing I had a degree that I wasn't claiming? It still seems odd.
But those who see my events I go to know that I go out a lot, and have lots of female friends, a few of (but not most of!) whom I have even dated or wanted to date, and none of whom claim any sort of disability.
Of course I'm still single, so maybe Facebook's targeted ad system is suggesting I'll have more luck changing my hunting grounds?
Perhaps it is the just the dating service's website that thinks so.
The closest I perhaps ever came to dating someone who also has a disability, I didn't answer an email sent more than five years prior and didn't connect her to the woman I later met until almost 48 hours later.
I'm pretty sure I let that one slip away!
Anyway, for some reason I find some of these ads make me pretty uncomfortable, and not just because they don't seem to apply at all to me but because a combination of Facebook's targeting and the advertisers' targets all seem to make it appear several others are thinking I'm someone I'm not....
So, the movement fromLCIDs to names continues apace (ref Your LCID sucks and It is true that your LCID sucks, but your LANGID sucks more).
Of course not everyone looks at LCIDs the same way.
Thus Decimal vs. hexadecimal LCIDs, backcompat, and being weird, for example.
Now I know Office and DevDiv and SQL Server tend to use decimal values for LCIDs and LANGIDs.
Which I do not.
And ordinarily I am not judgmental about alternate world views.
Again, I'm not.
Though in this case I think they are definitely wrong.
I realized just a few days ago when Murray told me that over in Office they had to change the size of some data structures that they were storing LCIDs as strings.
Because they were going off the end of the four character buffer they were using to store them in with some of the bigger LANGID values.
If they were just using hexadecimal digits than the buffer would have been enough.
On the other hand, they could have stored them as numbers and it al would have just worked irregardless. :-)
Today's blog's title should be read with a Groucho Marx accent....
Over on the Unicode List, Andreas Prilop asked:
There are three so-called "Yiddish digraphs" in Unicode: U+05F0 wawayim U+05F1 waw yod U+05F2 yodayim
What is specifically Yiddish about these digraphs?They can be used in the same way in Hebrew.But this isn't done. Why not?
Why should Yiddish be written with special digraphsbut Hebrew with sequences of two letters?
But even in Yiddish, the digraphs are not really used:
The Unicode Standard says:| ... to distinguish the digraph double vav from an occurrence| of a consonantal vav followed by a vocalic vav.
By that reasoning you would need an English digraph "sh"to distinguish "sh" in "***" from "s-h" in ***hole. ;-)
Ah yes, the Yiddish digraphs!
Lots of people jumped in and the consensus was alon the lines of "I'm not sure, but I think it'd legacy".
Thankfully the guy who should be writing the "Every Character Has a Story" book jumped in to add some surety:
On 10/19/2011 12:08 PM, Mark E. Shoulson wrote:> I think the issue here is (probably) a matter of legacy encodings, > though someone else would need to confirm that.
O.k., as self-appointed historian of the standard, I guess I need to bethe one to answer that. ;-)
The Yiddish digraphs were added to the basic set of Hebrew letters forUnicode 1.0 on behalf of the Research Libraries Group, for compatibilitywith their existing usage on the Research Libraries Information Network (RLIN).
Digging very deep in the old mailbox, I located email from Joan Aliprandof the Research Libraries Group, dating from July 11, 1991 confirmingthis, and noting that "I pushed very hard for inclusion of the Yiddishdigraphs tsvey vovn and tsvey yudn."
It is my recollection that the 3rd digraph was added during the discussion ofthe addition of those two.
At any rate, there is your legacy encoding source for these. Whether or notthe digraphs are used in *current* Yiddish data (or would even berecommended for such use) is not relevant to reasons for the original inclusion.
And there we go -- ever digraph has a story, too!
This blog is sponsored by our three Hebrew/Yiddish Digraph friends....
Okay, first the joke. My dad sent it to me....
An Englishman, a Scotsman, an Irishman, a Welshman, a Latvian, a Turk, a German, an Indian, several Americans (including a southerner, a New Englander, and a Californian) an Argentinean, a Dane, an Australian, a Slovakian, an Egyptian, a Japanese, a Moroccan, a Frenchman, a New Zealander, a Spaniard, a Russian, a Guatemalan, a Colombian, a Pakistani, a Malaysian, a Croatian, a Uzbek, a Cypriot, a Pole, a Lithuanian, a Chinese, a Sri Lankan, a Lebanese, a Cayman Islander, a Ugandan, a Vietnamese, a Korean, a Uruguayan, a Czech, an Icelander, a Mexican, a Finn, a Honduran, a Panamanian, an Andorran, an Israeli, a Venezuelan, a Fijian, a Peruvian, an Estonian, a Brazilian, a Portuguese, a Liechtensteiner, a Mongolian, a Hungarian, a Canadian, a Moldovan, a Haitian, a Norfolk Islander, a Macedonian, a Bolivian, a Cook Islander, a Tajikistani, a Samoan, an Armenian, a Aruban, an Albanian, a Greenlander, a Micronesian, a Virgin Islander, a Georgian, a Bahaman, a Belarusian, a Cuban, a Tongan, a Cambodian, a Qatari, an Azerbaijani, a Romanian, a Chilean, a Kyrgyzstani, a Jamaican, a Filipino, a Ukrainian, a Dutchman, a Ecuadorian, a Costa Rican, a Swede, a Bulgarian, a Serb, a Swiss, a Greek, a Belgian, a Singaporean, an Italian, a Norwegian and 47 Africans walk into a fine estaurant....
The maître d' scrutinizes the group one by one and bars their entrance saying, Sorry, you can't come in here without a Thai."
So, now that you have Thai on your mind....
I wrote it just the other day near the end of June.
An irresistible force walks into an immovable object (aka the Thai that binds us), I mean.
You know, the thing about the PUA character hidden on the Thai Pattachote keybord:
After due and careful consideration, the campaign to remove the PUA looks like it will win over the rule about changing a keyboard.
But the important question then is what to put there insead....
Like there is the link to the Oracle site (http://download.oracle.com/docs/cd/E19253-01/817-2521/asian.supported.locales-246/index.html) that appears to be Phinthu -- U+0e3a.
Or Peter found another possibility:
This site, which sells keycap stickers, shows a Pattachote layout with basic keys plus three extras not found on all hardware. One has nikahit with nothing on its shifted state. There isn’t any phintu or lakkangyao, though both are very rarely used characters....I’d probably replace the PUA code point with lakkangyao: I think there’s more likelihood of someone typing that than phintu. E.g., you need to type it if you’re enumerating the complete alphabet, but not phintu.
Fair enough, perhaps lakkangyao sounds unreasonable.
I'll run it up the flapole and see if anyone salutes it....
I'm at the 35th Internationalization and Unicode conference in Santa Clara this week.
A great "Day 1" (aka Tutorial Day) which among other things included an awesome set of conversations witn Thomas Milo, as well as several others.
Anyway, we're at Grand Hyatt in Santa Clara.
We are just a few blocks away from Yahoo, which has a special attraction for me on the first night of the conference.
They host a welcome reception with free food and drinks!
How better to support the abusive relationship I'm in with my liver? :-)
Anyhow, the time arrived, I started to roll over to Yahoo, up on two wheels (me I mean, not Yahoo!).
It's a funny thing rolling at 3.5mph, the way that I was slightly faster than the other people. I got to catch up with Rick McGowan from Unicode, and the keynote speaker (Laura Welcher, Director of Operations of The Rosetta Project). I saw another Berkley PhD linguist Unicode colleague (Debbie Anderson), and anyway we all headed into the party.
Shortly after I got there, and before I even had my first drink (thus proving alcohol had nothing to do with the incident!), I accidentally ran over a power strip, and suddenly the slides being projected stopped.
Brave Yahoo employees tried to fix the situation, daring to step in at great risk to themselves since they had no idea if I might run over them too, but they were unable to resolve the situation.
Summary: an employee of Microsoft went to a reception at Yahoo and broke Yahoo hardware.
I want to assure Yahoo that I am responsible for the broken hardware and will fully reimburse for i.
I mean, if I can dun Google for failing to pay expenses1,2, I need to be willing to pay when I make Yahoo have to pay to fix stuff that was my fault!
So Yahoo, I'll pay. Just let me know via the "Contacting Me" link, and I'm sorry for any inconvenience!
1 - ref Attn: Google - Amount due: USD$307.50 (FOURTH AND FINAL NOTICE)2 - by the way, Google did reimburse me shortly after that blog. They didn't even ask for the receipts and if I had been a less honest person I could claimed more. Lucky for them. :-)
I will be the first to admit that my job is kind of unusual.
Right now? Officially, I'm a Program Manager.
But I am not on a feature team -- I'm on a "central" team. One that works with colleagues in test to own the World-Readiness tenet in Windows 8.
If you look at the CSP definitions of a Program Manager, the atypical nature of the team's role would ordinarily be a bit of a liability.
Thankfully, our management has worked to define much of team's role and they do not just ignore those differences at review time (this has happened to me on other teams in the past, from time to time!).
Okay, so we are an odd bunch.
Then I also have other work on the Language Enablement team.
I essentially own the website that provides a view to the locale data -- and that web server sits in my office.
There are even bugs assigned to me periodically to fix bugs when they come up. A web developer?!? Yikes!
I and one of my colleagues are the principal owners of the locale data in Windows 8. This data eventually ends up being widely distributed across the company.
There is also the checkin of locale data changes/updates/additions -- I own those checkins.
This means I can often expect to have code bugs assigned to me.
In many cases we are waiting for information from language and market experts. So sometimes I have to sit on bugs while waiting for an answer.
And that means that I can often expect to see mail coming in suggesting the problem of "active code bugs not assigned to development."
I guess they're talking about me. Sigh.... :-)
Oh yeah, I also assist some of the PMs from my former team who wanted to increase language coverage in keyboards.
Basically I'm their developer. Between that work and new languages we've added more keyboards to this version of Windows than have been added in some time....
More "code bugs not assigned to developers" mails!
Also, I answer a lot of questions -- to folks all over the company. Guru?
And since I slip in, answer quickly, and leave, some kind of Ninja Guru!
This takes up a lot of time....
Oh and I'm speaking at the IUC this week, too.
Okay, so somewhere between program manager and webmaster and web developer and speaker and developer? I guess that's the job.
It's a living....
Now that I'm done, I remember that there are a bunch of other things, like blogger and guy doing the IDN plan and so on. Jeez I'm glad my review appeared to be a bit more organized!
Genitive month name usage is a feature that has been in Windows for a while.
But the original feature hasn't completely done the trick. I mean, in the long run.
Put simply, we simply didn't "grow" the feature the way we were growing other areas when new locales challenged our older perceptions.
Like we had a maximum documented size for a field in a locale but then added a new locale that needs a bigger value?
That has happened many times.
But then when you consider genitive date support, what about bugs like the one in Latvian. Genitive. Oops, specifically?
Now in looking at the earlier versions of Windows when the whole "genitive months" feature was added, it appears that the first versions of it worked properly.
But over time we ended up with new locales, and in some cases those locales commonly used formats that were harder to detect
Ideally the algorithm would be tweaked (this was my #1 in that earlier blog).
But as we get further and further out no one reports the bugs, we lose track of things. And it is easy to have people give us the two different data items -- the genitive month names and the various formats, and not pay as much attention to the interaction between them....
As a feature I don't expect our architectural support to get worse, but I worry that we don't invest enough to keep up with new issues in the existing support. Like this bug....
I have a few other issues I'm curious and/or concerned about here, which I'll cover in future parts. Stay tuned!
Over on the Unicore List, the question was a familar one:
I am converting text in an ANSI-encoded document to Unicode using search and replace in Notepad on Windows Vista SP2. The source document contains text in the 8-bit CSX+ encoding for Indic transliteration. A chart of the CSX+ encoding is available at: http://homepage.ntlworld.com/richard.wordingham/10646/CSX+.htmIn the CSX+ encoding, ASCII 254 'þ' represents the character 'ḥ' (h-underdot). When I perform a search and replace of ASCII 254 'þ' to Unicode 'ḥ' U+1E25 LATIN SMALL LETTER H WITH DOT BELOW, the operation not only converts all instances of 'þ' to 'ḥ', but also all instances of 'th' to 'ḥ'. For example, the word 'rathaþ' gets caught in the replace and is changed to 'raḥaḥ'.This is rather unexpected behavior. I would consider this an error, but perhaps a very well-intentioned one, given that the phonetic representation of 'þ' in Old English is in fact /th/.Is there some internal Windows mechanism that treats ASCII 254 þ as being canonically equivalent to 'th'? Or, perhaps is the equivalent rule the dastardly deed of some Old English language enthusiast turned techie? :)Best,Anshuman
Yet another "misuse" of Notepad beyond the old UTF-8 BOM? :-)
This one is kind of my fault., too.... not directly since I am not the one who changed Notepad, but I am the one who added the function and then pushed them to use it in Notepad (fixing the problems I pointed out in blogs like When Notepad's Find doesn't and The fallacy of comparing out of context and so on more than half a decade ago).
In Vista, bringing FindNLSString brought the full power of Windows collation to the Find/Replace capabilities of Notepad.
So all of the various Unicode canonical forms will always be equal and so on.
This is a good thing.
Unfortunately for Anshuman, it also brings our EXPANSIONS along.
In particular, the following two entries:
0x00de 0x0054 0x0048 ;TH0x00fe 0x0074 0x0068 ;th
The only locale whose sorting negates this equivalence is Icelandic.
Perhaps if one is running on Vista or later, switching to an Icelandic user locale (aka "Standards and Formats") will provide a workaround for the Thorn in your side.
Well, this one, at least!
Over in the Suggestion Box, long time reader and friend Ted asked:
Metro/Windows 8 - actually I'm kinda surprised that neither Metro nor WinRT show up here yet (at least when I use the search). There must be something interesting relating to i18n to talk about in these areas. At least that's what it looks like (judging from the BUILD conference).
Well then how about Visual Studio vNext for Win32 desktop - like in msvcr110 they've finally moved to using locale names instead of locale IDs internally.
Thinking back to the earlier days of this Blog and how there were so many things I talked about here first before anyone else, it may seem odd that I am taking a back seat now as Windows 8 is now available in the form of the Developer Preview.
Okay, maybe it is a little odd.
But since then I've grown up a bit!
Perhaps it's a subconscious reaction to the fact that I had received mail from Frank Shaw (VP in charge of corporate communications) since then. Nothing accusatory or anything, which might have been cause for a freak out I suppose. But even worse, it represented intelligent questions. This is worse because it implies I am a little on the radar, which makes being a little less of a cowboy a responsible idea.
Or maybe it is that Chris Capossela who I first knew back in 1997 when he was a junior program manager (and he owned the Access Setup Wizard, which I was the dev on) is now the company's Chief Marketing Officer -- and Frank's boss. No directives once again, but the idea of being a little more responsible just comes to mind because I'd rather not be mucking in the message and randomizing him or anyone who works for him.
More likely there are good explicit reasons to stand back and let things unfold as my division President Steven Sinofsky is blogging himself and there aren't necessarily good reasons to be writing about things before there is enough to write about that people can use. I mean, I'm hardly against sliding in the mustard, but I need some context in which to slide it.
Another big difference -- we aren't talking about slow platform evolution like adding a new function or even a few new functions - we are talking about a new platform whose best practices are best to talk about after people are confident they have a good first cut of what is best.
Not to mention we are in that phase that DCRs and such can easily happen as small gaps in scenarios are identified. While such gaps might be natural for me to cover here ordinarily, it would require me to be a jerk to use the Blog as a way to try to force DCRs to happen in a certain way. And pehaps I really would like to not be a jerk if I can avoid it.
I have had a role in Windows 8 related to locales and keyboards, and at the upcoming Internationalization and Unicode conference in Santa Clara, I'll be talking about some of those things -- things that are available now if you have the Developer Preview though I'll be able to discuss things in as more focused way since there isn't much of an externally available roadmap.
I won't be talking so much about Metro or WinRT since that's pretty far afield of my topics, but there are many things that are natural results of the way things have been working and how things should be working.
Everyone else is excited about that new platform -- and they should be. It is pretty cool! But even these "nominally less cool" pieces are pretty interesting.
I'm pretty excited about all of it, because some of it is really cool stuff. Perhaps not cool compared to modern touch based apps, but cool from the point of view of things I think are important. And that some of you probably think too....
There are even some potential lessons for those who use and/or contribute to and/or manage the Unicode CLDR project, from their older brother who was doing this before they were even an idea -- lessons that in some cases we learned the hard way. Perhaps we can spare them some extra pain.
But Metro and WinRT are largely topics for the future, as the final shape of both of them and of their best practices are defined and it makes sense to start putting my spin and take on things.
So stay tuned. And if you will be at the IUC say hi and let me know this blog was one of the reasons you came! :-)
Yesterday, someone was having problems with Locale Builder.
And with the locales themselves, in some cases.
The problem goes something like this:
Short date format #1, that has period delimiters, works just fine -- see the Preview?
And short date format #2, which has dashes as the delimiter, also looks good. See the Preview?
But if you look at the two added formats at the bottom of the list, things get more confusing:
What the hell is going on here, exactly?
Well, it comes down to the fact that (since Vista RTM) derived LOCALE_SDATE happens to be U+002e (FULL STOP).
Now its just so happens that / (aka U+002f aka SOLIDUS) is a special charactr to NLS.
It is the placeholder in which to insert the LOCALE_SDATE value.
How can you get the SOLIDUS to not be misunderstood as the placeholder?
Ah, that part is easy -- just create a format that specifies the character as a literal, thus:
In other words, adding
when you want
and so on....
Now remember that with LOCALE_SDATE being dervied, it is always going to be safer to specify the format with litearals....