Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
So over in the Suggestion Box, Steve commented:
For fun purposes: The "Apple logo" PUA Codepoint 0xF8FF mentioned in http://unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT maps to a Klingon empire "Glyph" in the Code2000 font ;-))
Well, thus wasn't really a secret Easter Egg for Apple....
Is it perhaps more of a way of helping Michael Everson's ISO/IEC JTC1/SC2/WG2 N1643 (Proposal to encode Klingon in Plane 1 of ISO/IEC 10646-2) come alive despite the eventual rejection of the script in Unicode?
You can see all of the characters ufrom the proposal if you look at the font in Character Map:
You can see lots of other stuff too, if you'll see. Seem familiar, any of it?
The encoding here is not a Klingon Easter Egg of some sort; it and the other characters here are all from the Conscript unicode Registry....
I guess there are some places even Apple and the Klingon Empire can't reach.
But let's not take any chances.
Ensign, put blogs.msdn.com on Red Alert, raise shields, and call Shawn Steele to the bridge!
Previous parts in this series:
Now I'll admit it has been a long time coming, but it is now here....
As mentioned by Lisa Shieh in IDN now supported by Google AdWords:
Now those last two links are pretty important ones.
The first one says:
How Much Text Can I have In My Ads?
Ads can show, including spaces, 25 characters for the title, 70 characters for the ad text, and 35 characters for a display URL (or approximately 17 for languages that use non-ASCII(multi-byte) characters).
On Google, text ads are displayed on four lines: a title, two lines of ad text (each with 35 characters), and a URL line. However, the format may differ on Google partner sites.
Some Eastern European and Asian countries also support longer text ads containing up to 30 characters in the title and 76 characters in the ad text.
I. Ad Text
If your ad text contains any wide characters, such as certain capital letters and punctuation marks, fewer characters may fit on the line. The system will notify you if you exceed a character limit. Also, some of Google's syndication partners may not display non-standard characters if you include them in your ad.
If you create text ads using non-Latin characters, please be aware that the character limit may vary. Ads in languages with non-latin (double-byte) characters, such as Chinese, Japanese, and Korean, can contain the following number of characters, including spaces: 12 characters in the title, 17 characters in each line of ad text, and 17 characters in the display URL. Countries that support longer text ads have higher double-byte character limits.
II. Display URL
Google can only display up to 35 characters of your display URL, due to limited space. If your display URL is longer than 35 characters, it will appear shortened when your ad is displayed. WAP mobile ads can show up to 20 characters in a display URL, so any longer domain will be truncated to fit within those limits. For non-ASCII (multi-byte) languages such as Japanese or Korean, the width of these characters can vary, so the display URL might be shortened if it’s longer than 17 characters.
If your display URL is longer than 35 characters (or 20, for WAP mobile ads), you may consider using a shortened version of your URL, such as your homepage. Please be sure that your display URL accurately represents your destination URL, the page within your site to which users are taken via your ad. The display URL should have the same domain (such as example.com) as your landing page.
Also, please note that your display URL must be an actual web address, appearing in the form of a valid URL. It must include the extension (such as .com, .net, or .org,). It does not need to include the prefix (such as http:// or www).
Since your ad space is limited, try to create compelling and targeted ad text that is highly relevant to the products or services you're promoting. You can optimize your ad text to create the most effective ads.
And the second one? It says:
You can use non-ASCII (multi-byte) characters (such as those used in Japanese, Korean, and Chinese) in your URLs, but note that some of these characters need nearly twice the display space as single-byte characters. So, the exact number of characters you can use in a destination URL might be less than the character limit shown in the preview counter. To mitigate against URL spoofing, non-ASCII characters will be displayed only when the user’s interface language matches the characters in the visible URL. In all other cases, the URL will render as ASCII punycode. For example, if your Google interface language isn't in a language that uses Cyrillic characters (e.g. Russian), these characters won't render (e.g.http://пример.испытание will display as http://xn--e1a...).
Aha, very informative! And there is now a little insight into why it took so long for us to see this....
Google, like Microsoft, is a big place, and it takes time to get every team interested in something new, no matter how important you might think it is....
Given how long URLs were limited to just LDH (letters A-Z, digits 0-9, hyphen), it's easy to see how any given technology might have such limitations in its own DNA, and how un-eager they would be to make changes that could lead to service degradation or customer confusion.
Overall, I think it's good that AdWords has taken these steps.
Though I will feel better when the very natural annoyance with provincial assumptions like "each person knows only one language" also penetrate the AdWords folks and they take the next step -- like finding a more intelligible way to show IDNs that are in a different script.
The current solution is a great rudimentary first step, but it can't be the last one. Showing Punycode so readily is never the best answer, so hopefully that is a temporary plan (this one, moderated by UI language, has some obvious flaws in it)....
Over in the Suggestion Box, Van asked:
New possibility for the Every Character has a Story series. Fantasai wrote in to the Unicode list asking about Gc of various characters, but one group really popped out to me, which is Music Sharp in Misc. Symbols, and two of the white triangles in Geometric Shapes are all Gc=Sm, while Music Flat and the corresponding black triangles are Gc=So. Just a thought.
Ah, the Unicode List.
Let me quote Fantasai's full message:
So I've been doing some very close reading of the Po and So categories, and there are a few things that aren't making sense to me. I was wondering why is: - per cent, per mille, per ten thousand classified as Other Punctuation (Po) not as Other Symbol (So)? http://www.fileformat.info/info/unicode/char/0025/ http://www.fileformat.info/info/unicode/char/2030/ http://www.fileformat.info/info/unicode/char/2031/ http://www.fileformat.info/info/unicode/char/066a/ http://www.fileformat.info/info/unicode/char/0609/ http://www.fileformat.info/info/unicode/char/060a/ http://www.fileformat.info/info/unicode/char/fe6a/ http://www.fileformat.info/info/unicode/char/ff05/ - number sign, ampersand, and commercial at classified as Other Punctuation (Po), not Other Symbol (So)? http://www.fileformat.info/info/unicode/char/0023/ http://www.fileformat.info/info/unicode/char/0026/ http://www.fileformat.info/info/unicode/char/0040/ http://www.fileformat.info/info/unicode/char/fe5f/ http://www.fileformat.info/info/unicode/char/fe60/ http://www.fileformat.info/info/unicode/char/fe6b/ http://www.fileformat.info/info/unicode/char/ff03/ http://www.fileformat.info/info/unicode/char/ff06/ http://www.fileformat.info/info/unicode/char/ff20/ These characters each symbolize a concept, and are not used as punctuation (except maybe in URLs, but that shouldn't count). So why are they punctuation and not symbols? - music sharp classified as Mathematical Symbol (Sm) not Other Symbol (So), while music flat is So, not Sm? http://www.fileformat.info/info/unicode/char/266f/ - certain white triangles are classified as Mathematical Symbol (Sm) while their corresponding black triangles are classified as Other Symbol (So)? http://www.fileformat.info/info/unicode/char/25b7/ http://www.fileformat.info/info/unicode/char/25b8/ http://www.fileformat.info/info/unicode/char/25c1/ http://www.fileformat.info/info/unicode/char/25c2/ These just seem really inconsistent...~fantasai
Well, let me start from the top.
All of the items that are Po are correct, by the way I have learned them -- they are punctuation, they are not symbols.
Perhaps the way I learned these things is flawed, and I won't defend my education on this point.
But I will point out that obviously some other people might have learned the same way....
As for the various musical symbols, such twiddling to try to build up consistencies can be interesting, of course. But to consider it useful and productive, the difference between the two has to be significant in some way for the processes that make use of them.
You know, like the kind of differences I mentioned in Why are there MODIFIER LETTERS that are not in the Letter, Modifier category?, looking at Sk vs. Lm differences.
Such an argument would likely have to be made here, too....
But here it looks like some effort was put into making the symbols that people thought might have other uses the So vs. the Sm for math-only characters.
None of it looks inconsistent in any real sense. It just looks like an effort was made to categorize that perhaps not everyone agrees with.
Van's question I kind of put into that same category....
A few years ago, I wrote The incredible missing Language Bar.
It was about the complex story of the Language Bar and the times that it would stop appearing -- when it went missing.
And then a few days ago, jmdesp commented to that blog:
I'd love a Windows 7 update of this story. Especially for the case of a language bar that is missing for all applications, is still missing after a reboot (so the minimize theory doesn't seem to be working), but *does* get visible for elevation privilege requests (and only for them)
Well, new version makes for new rules -- and the old rules may no longer apply.
Now as near as I can tell, they had kind of a "no way to win" scenario down in Language Bar Land for Windows 7.
I mean, there was a slowly building groundswell of people who didn't like the shall we say "harder to support" features that the Language Bar provided.
People hated how their preferences were forgotten when they crossed the "1 to many" border for keyboards, in either direction. They hated how the UI was underfoot when they asked it not to be. And they hated how it simply disappeared seemingly without warning.
So what the owning team apparently did was they changed the rules....
For the people who never change any of the Language Bar settings, they get the old behavior.
On the other hand, for those who want it remembered that they enabled the Language Bar and asked for it on the Desktop, from this dialog:
The UI will always remember that you wanted to be on the desktop.
Even if you got rid of the extra keyboard(s) and then added it/them back later.
And for those who want it to be hidden?
Their Language Bar will always stay hidden, no matter how many input languages they add.
It was now smart enough to remember when you told it to GET LOST!
Until/unless they change this UI again, that is.
Okay, so from now on, the "missing Language Bar" can be deterministically returned to visibility, at will.
This is definitely an improvement, by the way!
Now this leaves one last problem, of course....
The one that jmdesp pointed out, about the privilege elevation requests.
The problem there is that everything that happens there happens in Session 0.
And though after painful bugs and time they have improved the communication between the session from which the elevation request came and Session 0 in regard to the available keyboards, several of the other "per user" settings such as the Language Bar visibility setting really couldn't be brought over as easily. So they basically didn't bring it over....
This can still lead to confusion occasionally, though a lot less often than the previous way.
In my opinion, all of this adds up to a net improvement. More deterministic, easier to fix problems when you get them.
But usage patterns vary. A LOT. So good for me may or may not be good for you, right?
So, is it good for you? How do you feel about all this?
The other day Karl Williamson asked a great question about Unicode stuff:
Subject: Why are the shorter cjk names second in PropertyAliases.txt?
The comments say that the first field is the short name, and the 2nd field is the long name. But in the case of the cjk properties, the short name is longer than the long name. Why?
And Ken Whistler was around to give a classic answer to this "drive on a parkway, park on a driveway" question for Unicode geeks.
So, in the fine tradition of:
I thought maybe I'd immortalize it here for all of those like me who delight in these sorts of tales.
The reponse was:
a) To make Karl ask questions? ;-)b) Because there always has to be an exception to any rule (including this rule)?c) Because the UTC said to do so?
Well, the short answer is "c". See Consensus 120-C23 in the minutes from UTC #120, L2/09-225.But wait, you will then ask why the PropertyAliases.txt for the CJK properties aren't actually like what was agreed in Consensus 120-C23? For that you need the long answer.To bring the normative CJK properties into the PropertyAliases.txt context, where property identifiers appropriate for regex and such are defined, the UTC originally decided that it would make sense to just prepend "cj" to the front of the existing Unihan tags from the Unihan Database, so that they would be more self-identifying as "CJK" properties to the uninitiated, and then make the resulting labels (cjkIRG_GSource, cjkRSUnicode, etc.) be both the long and the short identifier in PropertyValues.txt. Then, the Unihan tag itself (without the "cj" prefix) would be added as a (third) alias, because people might well be using the exact Unihan tag for matching, and it wouldn't make sense to disable that. And the UTC wanted the labels with the "cj" on them in the short (abbreviated) field, because everybody agreed that there was no utility to attempting to shorten the Unihan tags further; they would just turn into unmnemonic gobblydegook.Then when the consensus was actually acted upon, and PropertyAliases.txt with the changes started to see review, it dawned on everybody that, rather than have the official Unihan tag be a third alias, it should be the "long" name in the PropertyAliases.txt. You only needed two values: "cjkIRG_GSource" and the original tag value "kIRG_GSource". So why have two identical labels and also have to add the original tag as a third alias?But then the conflict of two contradictory purposes kicked in. The first field is supposed to be "short(er)". But the second field is supposed to be "official(er)". See UAX #44, Section 5.8.1:
The long symbolic name alias is self-descriptive, and is treated as the official name of a Unicode character property. For clarity it is used henever possible when referring to that property in this annex and elsewhere in the Unicode Standard.
Because the Unihan tags in the Unihan Database have longstanding status, predating by at least a decade the decision to tack "cj" onto the front of them for PropertyAliases.txt, and because the Unihan tags are used everywhere in the Unihan Database and its documentation, it became clear that those had to be the official name of those properties.In this case, the UTC tossed up the dice, and they came down as "official(er)" trumps "short(er)" for the CJK properties.So there is the long answer. Now I suppose people will want a short(er) version of that answer added to Section 5.8.1 of UAX #44, so this strange aberration will be seen for the Solomonic judgement it actually was. ;-)
Personally, I would have chosen "e", which is "all the above" after the response from Ken is made "d".
Or maybe, to avoid the next time and the time after that for the same question to be asked, I might change the original text to come up with descriptions that are true "except for CJK". Because "Unicode except for CJK" is like saying "quarters liveable except for no Oxygen"....
The problem with Unicode is that it is quite complicated.
Well, that in and of itself isn't a problem; the problem is that as a standard it is driven by many competing forces.
And one of those forces made the XKCD world the other day!
The strip is easy enough:
And then past this problem, there is another one....
The problem here is simply stated:
Characters are chaotic.
In fact, they are chaos itself, at times.
We try to categorize and bucketize and prioritize everything, but inevitably we find that the tightly interwoven standard has yet another issue that the neatly defined bucketizations and standardizations missed.
And software, depending on those bucketizations and standardizations, has to pick up the updates to fix the problems that come with treating so much of the information in the Unicode Character Database as "Ultimate Truth" when really it just, to borrow fron Unicode's ISO 10646 brethren, "Good enough for governments work."
And so, there (and here) we are.
Actually, now that I think about it, there probably are 14 different reasons that Unicode is complicated. Reasons like: Michael Everson, Mark Davis, Ken Whistler, and so on. Every character may have a story, but it is also true that every one of those characters has a bunch of stories, too....
In this blog's title, cuffs and collars are metaphors for fonts and keyboards. It's not a very effective metphor or even a terribly funny joke, so if the meaning doesn't come to you, let it go. You aren't missing anything....
Yesterday, in response to my blog We got the keyboards now! (Windows 7 edition), regular reader Santhosh Pillai asked:
How to get atomic chillus in Malayalam keyboard layout?
The short answer is the easiest, so I'll start there:
You don't.
However, I doubt this answer will be as very sastisfying one to Santhosh, or to anyone really.
So perhaps I should explain.
It started in the beginning of 1998. And continued, on and off, until the middle of 1999.
When Microsoft added Indic support to Windows, and in particular it added keyboards for Hindi and Tamil.
Now over the next few years, in XP and XP SP2 and Vista, Microsoft would periodically add new keyboards for new languages -- Bengali, Telugu, Malayalam, Oriya, and so on.
Each of these different language keyboards supporting different scripts and different languages had two simple things in common:
Now Tamil added a Sha and a Zero and some symbols. Bengali resolved issues around the Khanda Ta. Malayalam added atomic Chillu characters. Etc., etc. For this to happen, Unicode was being periodically changed/updated for Indic support.
And although our fonts and shaping engines were updated to stay conformant to Unicode, the keyboards were never touched.
This may beg the question of how the font/shaping engine changes were tested -- I guess they just had documents already created and they were not testing fresh inserts?
Essentially this means that the last decade of improvements and changes for Indic have been well supported on the font side and ignored on the input side.
I think someone should work to correct this oversight at some point....
Remember To get Korean wrong you have to SHIFT things a bit... from early last month?
In it I talked about a couple of longstanding bugs in the Windows "GoGlobal" Keyboard Layouts site's layout for Korean.
Now in a comment, longtime friend and regular reader Mihai said:
That web page (with keyboard layouts) used to be handy, sometimes.
But now it is so outdated that I am not sure about usefulness.
For instance there is only Romanian instance (while there are 3 of the them now: Standard, Legacy, and Programmers).
And that instance is ugly (3, if not 4 different fonts) and wrong (showing the cedilla forms for s and t (U+015F and U+0163)). Unless that is the "Legacy" layout, in which case is not wrong, just outdated :-)
I did respond to him.
Well, kind of.
The thing I didn't say was that there was already a project that was nearly complete to build those layouts!
Anyway, after a brief delay to get it all done and tested and files uploaded, the keyboard layouts are updated!
Of course the Korean was fixed, too.
And also all the other keyboard layouts included in Windows 7, ready for people to look at!
I suppose I could have skipped the earlier blog on the Korean bug, but the Korean layout wasn't fixed via the tool so it seemed like a shame to bury info about the bug. We were fixing the site anyway!
Now as a bonus I think a bunch of other problems (like one Mihai mentioned about different fonts used in the same keyboard for Romanian) also got fixed; I'm sure Mihai will let me know if there is still a problem....
And if anyone notices any bugs or problems they should feel free to let me know.
We got the keyboards now! On the Windows "GoGlobal" Keyboard Layouts site!
To close this Blog, I'll tell you two stories that are both entirely true but that if I were slightly smarter I probably wouldn't tell....
But through it all, I have to be me!
story #1: If you go back many years, the site's keyboards were originally built by a tool that K.D. Chang of the Typography team put together. However, K.D. left under somewhat mysterious circumstances that did involve his entire share of useful tools on the Sparrow server being [in retrospect not so] mysteriously deleted, a fact not realized until long after it was too late to get the share contents from a backup.
Oh well, I guess you could say something may not have been on good terms. Just a guess, no one ever told me anything on the record....
Getting it back is a good thing!
story #2: while this work was not done due to a single comment in a blog leading to my boss's boss's boss approving a bunch of funds, I was easily able to convince my boss's boss's boss to approve the project, right in the tail end of getting approval in a big room for a totally different project for an internal website.
I've love to take credit for it, but the only impressive thing I did was point out a need at the right time, shortly after a colleague and I made the need to do some other work really obvious. If it is all about timing than we were spot on. Zuzka rocks!
But obviously I shouldn't get too carried away here -- it's not like I was given a long-term staff or an annual budget line item or anything.
Either way, it was really cool to get it approved in April but even cooler to see it done on the public site in July, just a few months after the tool creating them was approved.
But in any case, we're all set now and up to date. And we can easily make updates and fixes and additions any time it would be appropriate.
I'm loving this. :-)
If you work at Microsoft and are a part of the Windows team, you have an interesting ability.
You can build versions of the product that don't even exist!
Like remember when the IA64 client was no longer being shipped? Well, you could still build it.
Hell, even after they stopped building the server you could still build one for a while (it takes time to unravel some of the work there).
You could build both client and server versions of Server 2003, long before the XP x64 version was ever a thing. And you could even build an x86 version, that would never see the light of day....
Also, more relevant to my neck of the woods, you can build language versions that don't exist!
But, to give the example relevant to today's point, you can build a server version localized into Hebrew or Arabic!
Now in many cases, this doesn't do much -- the resources don't exist, so you can't magically get all the work done to support a language on your own.
i mean, you can't magically fill in grid entries from The Locales of Windows 7, divvied up further, just because you wanted to.
But there is one special case that is an exception to this.
The pseudo mirrored build.
First described to everyone as a locale in Shawn's Pseudo Locales in Windows Vista Beta 2, it later became a great tool to help find all kinds of different bugs in mirrored builds long before the formal localization process starts.
Basically, even though we do not ship any mirrored languages as server SKUs, you can create a pseudo-mirrored server!
Now some server specific components that are never translated into Hebrew or Arabic will still come out English, though many will be "pseudo localized" unless specifically marked to not have that happen.
This particular build has very little use, mind you.
Just about every bug you see can't ever happen in a real world scenario.
So you can put in bugs, sure.
And if a developer agrees that this is a bug and fixes it, then that's great.
However, in the (much more likely) case where the developer resolves it as cannot affect customers, the truth is that the developer is probably right.
This ends up happening, perhaps more often than one might expect. It's just that sometimes not building is more of a pain than letting it build. And even if they don't build it in the build labs, you can always build it yourself.
And as a [largely] untested build
Anyway, one of the interesting parts about working for Microsoft is that you can create things that don't exist!
Kind of cool, if you ask me....
I am an expert on some things.
Not all things mind you; just some things.
Further, I am expert on the way most things interact with the things I am expert on.
Not an expert on most of those things; just on the way they interact with the thing I am an expert on.
Up until just recently, I had been able to pretty much steer clear of a particular breed of reader -- the nitpicker.
This fairly insane need to make a point, even at the expense of missing the point.
Then on Wednesday, that all changed.
In It's ultimately your call, but your PowerShell cmdlets really don't need to SUCK this much, I decried a terrible situation in PowerShell cmdlets.
A situation where, to quote myself from another, tangentially related email thread about a similar bug affecting the ISE itself:
“The promise of the Graphical PowerShell is moving beyond the CMD boundaries, and bugs like this betray that promise”
But, as circumstances unfolded, my contrasting of the "ISE-based PowerShell" versus the "CMD-based PowerShell" became what virtually every comment was about.
The fact that many people have the same (apparently incorrect) assumption about the legacy PowerShell being CMD-based is irrelevant.
The fact that the property sheets for the two of them look amazingly similar while the ISE's looks more like a regular app shortcut does not:
And the fact that the main app propsheet for the first two (right down to the "Defaults" and "Properties" entries) also look the same while the third did not have either option or look in any way similar did not:
The fact that the "Properties" and "Defaults" options for the first two that did not exist on the third at all was also not relevant:
And finally, and most relevant to the point of Unicode and complex script support, the fact that the first two behaved identically limitedly useful while the third did not was also, once again, irrelevant.
The only thing that was important was that I was saying that these two different things that acted just like each other, had property sheets that were the same, had right click menus that were the same and shortcuts used in consoles, or anything else, had nothing to do with CMD.EXE but instead was a CSRSS.exe/conhost.exe thing.
Um, okay.
As rjcox pointed out:
Of course a s/cmd.exe/Win32 console subsystem/ would still make sense and be more, pedantically, accurate.
In other words, there was a single mistake here to do with labeling that would make the entire post accurate. A single pedantic misapprehension on my part that has nothing important to do with the point in question.
Now clearly if I were billing myself as an expert on the console subsystem this would have been bad.
But I'm not, I'm just someone who points out bugs to experts in it to do with Unicode (ref: Ā was unexpected at this time).
The same way I'm not an expert in conhost.exe but have helped fix problems and bugs in its TSF input component (contsf) compiled right into it.
The same way I'm not an expert in security but know more about the localizability of account names than some experts in that.
Or that I'm not an expert in a number of things, though in explaining how they work with Unicode or with internationalization or localizability or keyboards or whatever, I am.
Funny how until I got to the PowerShell experts (who were sick of dealing this CMD.EXE mistake long before they got to me) did I become such a target for pedantic nitpickery!
And not from team members for PowerShell, either. Or any of the other internal folks I've been dealing with. Perhaps since when I am talking to them I am usually helping them with an actual bug, they see no purpose or reason in correcting me on my terminology....
Now it would be my hope that most people will suss all of this out correctly, and recognize that not only my lack of expertise on the legacy console subsystem's pieces but my lack of desire to obtain expertise on the legacy console subsystem will simply keep me from finding it important to make such fine distinctions of things that behave identically and neither of which should be used when one can help it.
All console work should happen in the PowerShell ISE, not just the cmdlet work -- since everything behaves better there.
A Unicode console app behaves perfectly in the ISE.
This fact is true whether you are hosted by CSRSS, conhost.exe, CMD.EXE, or Console Fred, or anything else. This whole collection of components that I called "CMD.EXE based" is bad whether or not what I called it was the exact right name.
So, will I update the blog at some point? Maybe. Given that the software behind the Blog can send trackbacks every time, I'm not especially eager to. The comment spooge is there, it just doesn't seem that important....
If anyone has trouble with this still, then they, like commenter Blake, should likely be excited about one less Blog in their RSS reader. This Blog is clearly unsuitable.
You probably shouldn't be here -- like Blake....
The title of this blog certainly suggests one theoretical etymology for the word Tweel, though it is, by being largely a product of my own random thoughts without any supporting research, unlikely to be true.
Tweels do have a source, though, Michelin!
You can read about them on Wikipedia, here:
The Tweel (a portmanteau of tire and wheel) is an experimental tire design developed by the French tire company Michelin. The tire uses no air, and therefore cannot burst or become flat. Instead, the Tweel's hub connects to flexible polyurethane spokes which are used to support an outer rim and assume the shock-absorbing role of a traditional tire's pneumatic properties.
They even have some public domain art:
I had a colleague at work ask me if I was going to get tweels put on my iBot, especially given that the iBot was originally introduced with tweels as its wheels.
He even pointed me to a video from Michelin from a few years back that mentions the iBot and shows one riding on Tweels:
This is, I will admit, pretty cool.
After his inquiry, I did ask one of the few remaining folks manning the technical support desk for my iBot at Independence Technology and after looking into a bit he said that although they were initially used, there were complaints about noise from the tire and also heat that it would generate -- both present even at the lower speeds of an iBot relative to a car.
It is unknown if Michelin is still working on this project, but even if they were it is unlikely they would have anything before January 2013 when Independence Technology shuts down, and even if they did it is highly unlikely Independence Technology would offer a new option for tire replacement from the one they have now.
Those who would dare to dream of a tweet about me, the non-tween riding on two Tweels, will likely have to keep dreaming. I do have plans to make some changed to the iBot after the warranty expires (I'll write about those more fully after said expiration), buying tweels from Michelin is not currently one of those changes....
Prior blogs in this series:
Now we start with IDN.
Internationalized Domain Names.
This eries has been covering several aspects and issues of them.
Next we talk about EAI.
Email Address Internationalization.
Now there is a definite relationship between the two.
I mean, since every email domain name is like domain name.
And every email address is in the format {account name} @ {domain name}.
However, we are talking about two different sets of standards, with two entirely different sets of purposes.
The guiding principle for IDN has been the paranoid fear of the Internet going down due to an attempt to send non-ASCII domains for lookup -- all of the effort to go through NamePrep and Punycode is around providing a canonically stable enoding of Unicode that is representable in ASCII. Since the Internet has been using an "LDH" (letters A-Z/digits 0-9/hyphen) form of ASCII.
No one wants the Internet to go down, and a little paranoia can be healthy, so this was a compromise everyone could live with.
EAI, on the other hand, suffers from some negative points -- like the fact that there is a lot of spam.
And there are a lot of similar people with similar mail clients and servers out there.
And a lot less consistency between the folks sending spam and the folks using email productively.
So EAI did a lot of work to keep International email adresses in UTF-8 -- and not in Punycode.
Now there are many approved top level domains (TLDs) for IDN, such as:
Bangladesh (bg): বাংলাChina (cn): 中國 (traditional); 中国 (simplified)Egypt (eg): مصرHong Kong (hk): 香港 (same in simplified and traditional)India (in): भारत, بھارت, భారత్, ભારત, ਭਾਰਤ, இந்தியா, ভারত Palestinian Territory (ps): فلسطينQatar (qa): قطرRussian Federation (ru): рфSaudi Arabia (sa): السعوديةSri Lanka (lk): ලංකා (Sinhalese); இலங்கை (Tamil)Taiwan (tw): 台湾 (simplified); 台灣 (traditional)Thailand (th): ไทยTunisia (tn): تونسUnited Arab Emirates (ae): امارات
And there are many people out there who have registered domains with the appropiate registrars so that they can have web sites that used those domains.
However, when you talk to the customers and governments that worked so hard to get TLDs for use in IDN, very few are making the same push for EAI support of email addresses that use thiose domains (e.g. шеъмаѕтея@яцѕѕіа.рф as an email address is onsidered a lot less important to support ast the moment than the http://яцѕѕіа.рф website.
Part of this may be that one standard has been stable for some time and has both IANA approval and registration authorities. And several browsers that hav supported it for years at this point.
The other is not even a final standard yet officially, and there is a genuine dearth of established clients -- coupled with a very conservative sense of wanting the support to widely exist before so much undeliverble mail is inflicted on people.
To be honest it makes me wish they'd gone with a NamePrep/Punycode solution for the domain name piece of the email address, since these two very different things are actually seeing one as a genuine subset of the other, conceptually.
I doubt this situation will last forever, though when I look at 22 email addresses in the Email Addresses tab in Exchange to cover that one mailbox across so many different profiles, that adding one more should be less intrusive and dangerous, eventually....
I remember years ago, when the groundswell of opinion that wanted a replacement for CMD.EXE found itself with budget.
And headcount.
And executive sponsorship.
And a code name.
You may heard of it.
Monad.
Now everyone involved had their own reasons for wanting this, but for me and several others like me, there was one basic feature we wanted.
We wanted the ability to support Unicode and complex scripts within Unicode (okay, maybe that's two basic features).
Now fairly early on (though I remember in talking with several old timers that we collectively realized it earlier) it became clear that only the Unicode part was possible in CMD.EXE; the complex script part was never gonna happen there.
Thus was born the Graphical PowerShell.
Now this became my PowerShell, because it supported everything that one could ever want in terms of the display and reading of Unicode text, with the only limitations coming out of scripts built targeting older, non-Unicode type support that had dominated the console for so long.
Unfortunately, this was not the PowerShell of most people.
The vast majority of PowerShell work apparently happens in the CMD.EXE-hosted version of PowerShell.
And the vast majority of PowerShell cmdlets target that least common denominator of non-Unicode, non-complexity that the CMD.EXE-hosted PowerShell has the easiest time supporting.
Have you (the reader of this blog) ever heard of Unicode support in the console, which has been there almost since the birth of CMD.EXE?
These cmdlet authors had not heard of this.
Have you (the reader of this blog) ever heard of writing light-up features that do better when running where better can be done (e.g. in the redirected console, or in the Graphical PowerShell)?
These cmdlet authors had not heard of this either.
Have you (the reader of this blog) ever heard of using terrible features like the console fallback language any time the console was deemed to be unable to support a given language?
Unfortunately, a lot of cmdlet authors who had been napping while those earlier questions came out were bright eyed and bushy tailed when this one became available.
Crap.
And this is a ledge I have been trying to convince cmdlet authors to stop jumping off for some time. Although there has been some success, it isn't nearly as much as I would like there to be.
Now I could blame this on a small team if I wanted to.
But since PowerShell was widely embraced by components needed by administrators all throughout many groups in Microsoft and by many people outside of Microsoft, the only small group of people I could ever hope to reasonably blame is the early group of people tasked with writing up the best practices for PowerShell cmdlets.
I lacked the authority to see the guilty "punished (i.e. laid off or reviews impacted) for the decades of man years worth of cmdlets written both inside and outside of Microsoft that were written incorrectly due to the piss-poor best practices produced. Though the people, through failure of due dilligence, deserved a bit of feeling the pain they inflicted....
The plan currently there in most cmdlets -- to take the fallback language for anytime either the CurrentCulture or the CurrentUICulture has the benefit of keeping them from printing out question marks to a CMD.EXE console using a bitmap font. but with a downside of it being the wrong language and the wrong cultural preferences -- a bad thing to do in ordinary circumstances, but terrible when the platform is able to support so much more (which is more often than anything people are being told in marketing materials how much people can do in the CMD.EXE based technology -- because rather than just showing question marks or square boxes there to push people to move on they'd rather dumb down everywhere.
That is, quite simply, bad engineering.
But if you are a PowerShell cmdlet author, either inside or outside of Microsoft, you have a chance to break that pattern.
Let's inject some good engineering -- and some simple "light-up" work to make the better locations produce better results!
The code, most of which comes from Cunningly conquering communicated console caveats. Comprende, mon Capitán?, over a year ago, with one small change to handle the complex script case, is code that I wish could end up either in some central module to be used by all PowerShell cmdlet authors, or in as many cmdlets as humanly possible, to allow the Graphical PowerShell that you may know of as PowerShell ISE to get the job done for the many many languages that are supported
Or if nothing else you could make sure that your own cmdlets don't suck. Because they don't have to....
The one small change?
A small addition, an interesting utility function for the sake of the CMD.EXE based world, which still hosts the CMD.EXE-based PowerShell.
If you look at the contrived function that is Main:
public static void Main() { if(IsConsoleRedirected()) { Console.WriteLine("You are running in a redirected console.\r\nWrite Unicode via WriteFile and be happy!"); } else { if(IsPowerShellIse()) { Console.WriteLine("You are running in PowerShell ISE and can support complex scripts."); } else { if(IsConsoleFontTrueType()) { Console.WriteLine("No PowerShell ISE, but a TrueType font is selected;\r\nyou can at least display some Unicode in CMD."); } else { Console.WriteLine("No PowerShell ISE, no TrueType font; you are limited to one code page."); } } } }
Basically there is no built-in IsTooComplexForCmd() method in the CultureInfo class to further figure out what works in the "it's TrueType" case beyond the "some Unicode works" comment, but we can make our own -- using the same technology currently being used to dumb down everyone.
The solution can be easily derived from the following algorithm:
And that's it.
I'll leave writing the most elegant version of that last function based on these seven steps as an exercise for the reader. Anyone want to give it a shot?
Braver developers might just use this code to warn people when they should be using the Graphical PowerShell entirely -- they could simply choose to NEVER fallback, and always support languages fully, but in deference to CMD.EXE they could warn users when the results may not be as good as they could be.
And this bit of code can help you single handedly create PowerShell cmdlets that don't suck for all users!
No, this is not a blog about Microsoft's newest review model. For that, you can paste in the description of the previous three systems, each of which was best able to allow for the best possible evaluation of all employees!
If you have seven constants:
that you can pass to GetLocaleInfoW/GetLocaleInfoEx, it would to some people be entirely reasonable to think that LOCALE_SDAYNAME1 will return the first day of the week for a given locale.
While others would expect that each value has an absolute identity attached to a particular day of the week.
If one was implementing an internationalization library, there are pros and cons to each method, and when combined with LOCALE_IFIRSTDAYOFWEEK, each of these two methods is entirely derivable from the other.
Now which one a developer might prefer is most often directly related to the way that helps them solve their particular issue, and of course there in the naive sense there is a 50% chance they'll end up being unhappy with how it is implemented in NLS.
All those years ago, the decision was made to check behind door #2.
Looking across all of the locales in Windows, here is how LOCALE_IFIRSTDAYOFWEEK is distributed:
Of course some of these are in all likelihood wrong, as blogs like Worked like a dog trying to make Tuesday Wednesday? Maybe he just needed Windows. And Bollywood tend to indicate. But overall, the general trend of the way they are distributed is clear.
But as for whether NLS should have been designed in a way that makes LOCALE_SDAYNAME* more dynamic, it is really too late to even call it a bug. After more than 15 years and more than 15 versions of Windows (95, 95 OSR2, 98, 98 SE, Me, NT 3.1, NT 3.5, NT 3.51, NT4, 2000, XP, 2003, Vista, 2008, 7, 2008 R2), the design is what it is.
And this issue is the design....
The message that came from Josh was an interesting one:
I’m investigating unexpectedly high memory usage in a feature I wrote, and I tracked it down to the fact that getting the DisplayName property of a CultureInfo object (InvariantCulture in this case) seems to trigger allocation of literally thousands of strings, which are not GC’d as far as I can tell. Is this a known issue? I can provide some example callstacks (at the point the allocations occurred) if it helps.
Thanks!
People have a tendency to look at the way memory gets used in the .NET Framework, so this particular allocation comes up from time to time.
Basically, it is for filling in locale data in the hash nodes that the .NET BCL uses to store its list of locales and the data in it.
The list is fairly static per machine (even when custom locales are installed, this is not a very common occurrence), and as data that is unlikely to change often but which may be requested at some point if it was ever requested. If it never is then it will page out eventually, but freeing it isn't necessarily a reasonable thing to do....
Now I'm not gong to unilaterally defend the design here, since I'm sure any compotent person reading this blog can question some of the actual allocations here, in some cases.
But in the end, I know that the people who have a full time job keeping that memory size down are the ones in the best position to fight those battles, so even if I am startled I shouldn't be too surprised or shocked. Therefore I am going to set that argument aside and assume that the people who did it had their reasons and that they are reasonable.
Perhaps, as can be the case in this Blog, it might make more sense to attack a different notion.
You may remember when I wrote Having neither army nor navy, Invariant is apparently just a dialect.
Now perhaps requesting the invariant culture's display name -- Invariant Language (Invariant Country) -- may seem important.
But if the locale is not being used for anything else than it probably isn't worth it.
And that is from me, speaking as the person who took Microsoft down the road of filling in all this data -- for the locake/culture that doesn't even exist, in this world where everything varies....