Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
So thinking about the design of MUI, some interesting thoughts came up in a conversation the other day.
Let's take NOTEPAD.EXE for a second.
If we move to the WINDOWS directory and hit ALT+ENTER we get it's Properties:
Let's move over to that Details tab:
Hmmm. I thought that we were all language-neutral now. Why is the file' version resource claiming that the file has a language of English (United States), anyway?
Let's look over in the language-specific directory and see what the NOTEPAD.EXE.MUI file looks like there:
Wow, no VERSION properties at all!
Wait, maybe that is an issue with the filetype. Let's copy the file and remove the .MUI extension:
and try again:
Ah, there we go. So it has a VERSION resource and the language is tagged.
Of course this proves nothing -- the original file makes the same claim, even though all of its resources are gone (as we learned in Random irreverent thoughts about the Ultimate Fallback).
So let's look at some of the other language files, like Arabic:
or maybe Hebrew:
And maybe we could mix things up a bit.
Like looking at the Arabic file under a Hebrew user interface language:
or the Hebrew file under the Arabic user interface language:
Okay, so the language files are being marked.
The only weird thing left is that the language neutral file, the one that even Mark Russinovich discovered the hard way is language neutral (ref: Random irreverent thoughts about the Ultimate Fallback), is marked as having a language.
Well, it turns out that the language splitting functionality in the Resource Compiler (RC.EXE and RCDLL.DLL).
Now RC.EXE has several flags related to creating .MUI files:
RC creates one language-neutral .RES file and one language-dependent (MUI) .RES file using script-file. This option must be used together with the /fo resname option. RC names the language-neutral .RES file resname.res and names the language-dependent (MUI) .RES file mresname.res.
RC creates a .RES file named resname using script-file.
If the /fm mresname option is also set, RC creates one language-neutral .RES file and one language-dependent (MUI) .RES file.
If /g1, is set, RC generates a MUI file if the only localizable resource being included in the MUI file is a version resource. If /g1 is not set, RC will not generate a MUI file if the only localizable resource being included in the MUI file is a version resource.
Localizable resource types RC places into the language-dependent (MUI) .RES file. If the /q option is also set, this option is ignored, and the information in the RC Configuration file takes precedence.
Overlapping resource types that RC places into both the language-neutral .RES and the language-dependent (MUI).RES files. The resource types that are specified by the /k option must be a subset of those that are specified by the /j option. For example, –J2 –J3 –K3 specifies that RC places resource type 3 in both the language-neutral and language-dependent (MUI) files. If the /q option is also set, this option is ignored, and the information in the RC Configuration file takes precedence.
An RC configuration file that follows the RC Configuration File format. The RC Configuration File format enables components to self-describe resource information such as resource versioning, MUI file path, resource types and items. This file specifies which resources go into the language-neutral .RES file and which resources go into the language-dependent (MUI) .RES file. This option, and the information provided in the RC Configuration file, override the command line options /j and /k.
Notice that in none of that talk of splitting resources does it ever make any claims about changing the language of the "neutral" .RES file it creates as part of the splitting.
Not that it wouldn't make sense for it to do that (since it took the time cause a de facto change to the language of the resource by removing all of the language-specific information), but that work item would have some interesting consequences, which I can talk about more some other time, perhaps....
This post brought to you by ຝ and ພ (U+0e9d and U+0e9e, a.k.a. LAO LETTER FO TAM and LAO LETTER PHO TAM)
Now this series has described a lot of the challenges related to font installation and removal, along the way hinting at the ways the people and processes can try to nibble away at these problems.
I have also been teasing at ideas on fuller solutions to the problems.
Before doing that, it is important to take a step back, and ask ourselves (now that we are so many steps into this quest), what is the reason behind the asking of the question?
Now this is not just to be trite or anything.
There are many distinct scenarios, and since the problems vary between them, it is only reasonable to doubt that the solutions would somehow all be identical.
So we'll look at each of the broad categories separately.
There will be some overlap, but the issues of each are of particular interest as are the suggestions at solutions.
FIRST, we have the scenario where the font ships with Windows or Office, and Windows or Office is its primary (and for all intents and purposes only) distribution mechanism. In this case, the removal scenarios are probably less important since removal is not as common, but the update scenario where files are replaced is a huge problem to be figured out and dealt with. Similarly, the idea of mix-ups in filenames are pretty unlikely and font face name overlap is if not unheard if then at least less common. People generally stay away from the names of these known and much-used "platform" fonts because in general people are not fools, or at least not the sort of fool who stand up with their font to watch it mowed over by system file protection or installation repair.The fonts are expected to be widely used, and usually are.
SECOND, we have the scenario where the name of the font file or the font face name just begs to be stomped on because the name is such an obviously intuitive and good choice for a particular scenario. After all, why not name the awesome Hindi font file name HINDI.TTF or the fantabulous Lao font face name Laotian, anyway? And yes we now know from the previous blogs in the series why this is a bad thing -- because the replacement and stomping over each other is such an obvious issue that it is really a shame that no one ever took up the cause to have the OS itself try and find solution here. Yet almost all solutions to date to try to work inside the box here, and the fonts that go down this road do, for the most part, get what they ask for. The fonts may see wide usage, or may not -- the designers probably would like wide usage, and you never know....
THIRD, we have the scenario of a font installed and primarily used by a particular application or suite. Other usage might occur if users like the way the font looks, but that is incidental, and perhaps not even desired by the company that got the font on the machine -- their primary need is having the font there so they can use it for some purpose. In their way would be people who remove fonts themselves as incorrect experimenting with dragging files around and of course good old fashioned SECOND scenario problems with fonts that share file or face names. Wide usage really is not what they are looking for.
Now beyond these really broad categories, there are four specific times we care about here in the life of the font on a machine:
We will want to look at each of these four periods of time across each scenario, when applicable.
Upcoming posts in the series will try to go through all of these scenarios and time periods, and give the best possible information on how best for the font to meet the needs and the intent of the creators of the font. and sometimes of th users to, if that is the intent....
This blog brought to you by Ֆ (U+0556, aka ARMENIAN CAPITAL LETTER FEH)
Nothing technical in this post, sorry!
"It had absolutely nothing to do with me, and I had absolutely nothing to do with it."
I had to tell many people the above line throughout the day.
It had to do with the headlines all over the Internet:
It happened at the Archstone Redmond Campus apartments, which is where I live.
The shots were fired a little after nine, when I think I was already scooting over to Microsoft.
There were sirens while I was heading down 156th at the time, but it was a fire engine and an ambulance so I don't think they were related.
The five articles I mention above were all links that people sent me (via email and over IM) throughout the course of the day, people who were concerned an wanted to make sure I was not somehow involved with the incident of the estranged husband with the .357 in his waistband who shot his wife with a 9-mm while she was heading to work (at Microsoft) from the apartment she was staying at.
I wasn't.
I don't know who she was, or who he was, or who the person she was staying with is. From the picture in the King5 article, it looks like it was near outside the office on Building Z and I am in Building E -- though like I said I wasn't even home when it happened, even if it had been right outside.
All day, people who kept asking me about it, through the day -- eight in all with the five articles.
One even asked if I was the person the woman who was killed had been staying with.
Its just that you've had friends stay over before when they needed a place to live, and your facebook Relationship Status says "It's complicated." This guy was out to lucnh enough to shoot his wife. So I kind of put two and two together?
2 and 2? They got 22, this time. My life isn't *that* complicated, by any means.
One of the articles even stated the friend was female. Which I'm not. Of course each person didn't see each story -- that was just people like me.
It made the day kind of surreal, to be honest.
The most recent one was a comment in Facebook after I mentioned it in my status:
Sadly, I have to admit you were the first person who popped in my head when I heard the news story. Josh & I used to live there, too.
That is when I decided to write this actually.
Back to the various new pieces... a neighbor's quoted statements I found quite frankly perplexing, things like:
If it was a random act of violence I would be concerned... But this is an isolated incident. It could happen anywhere.
I'm hoping this was just a misquote. I certainly hit a parse error on that one.
The police PIO was not much more reassuring, from the PNWLocaleNews.com coverage:
"It all happened in a matter of seconds," said Bove, who added that the man had a .357 magnum in the waistband of his pants during the shooting.
This makes it better, the other gun? Well I guess it means there might be two fewer guns out there.
Who was well-served by this coverage? I know I wasn't, and none of my friends who contacted me really were. No one was. So is this what the press is reduced to now? The people's right to know things that people really didn't have any actual need to know?
And I wondered about this guy I don't know who shot himself and the woman he shot and the friend of hers he didn't, and I realized that no matter how weird my day was, it had nothing on theirs. There is something really horrifying about the whole incident, and the series of tiny articles, and the people who would email me.
What about the people who emailed the friend? All of the people who emailed me just wanted to make sure everything was okay, but what if I was the one in the situation and a friend had just been shot?
I wasn't angry at the people who contacted me, but most of them were worded weirdly enough that I probably would have been, if I were involved.
No one really trains folks on how to do those "just wanted to make sure everything was okay" calls. And as far as I know there is no Miss Manners column about it, either.
I actually watched the news tonight on several channels, which is kind of a departure for me (I am mostly a Stewart/Colbert man for news these days), curious about what the coverage would be. I suppose I should be grateful that it when it was mentioned it at least came before the weather and local stock info and especiially before the Cheeto a woman found that was shaped a bit like Christ on the cross (the woman dubbed it Cheesus, of course -- didn't Heinlein have one of those in I Will Fear No Evil?) -- Cheesus was a MYQ2 exclusive.
The news reports were pretty much a rehash of the earlier stories, and all they added was a little of the art of the police moving around on the property. Which is why I am a Stewart/Colbert person now -- they at least add something to their coverage (though the downside is that they don't cover this kind of story at all since it really isn't funny).
Wrapping all this up finally, I will say a prayer tonight, for both of the two people involved who didn't shoot at anyone, and wish them whatever support they can get. If you want to take a moment to send some positive thoughts out into the universe on this then I doubt the time would be wasted.
Because the odd messages are the least of their problems, and there is really no way to make this better....
Richard asks (via the Contact link):
Hi,I have a (physical) US keyboard. I often write emails to and about mainland Europe based people, and like to spell their names correctly.Typically I would remember some key sequences so that I could spell Jürgen's name correctly by typing J,ALT-0252,rgen etc.I found recently that by switching to a US-international keyboard in windows I can instead use RIGHTALT-u and RIGHTALT-y etc. for easier (more obvious at least) access to those accented characters.However, it has the side effect of causing me to be unable to write "echo", as the first quote followed by e causes ë to be written instead and i end up typing ""<bksp>e instead, which is a pain.So - is there a better way (rather than using the character map application) for a US keyboard owner to type in accented characters but not be burdened with headaches around " and ' ?(I never really use the right alt (alt-gr I guess) in daily use, so it was convenient to use it for accented characters).While I'm here, some supplemental points:1) your site appears to crash firefox3 pretty hard, for me at least2) I use tweakui to set "focus follows mouse". That's a pain when using the language bar because it switches languages back when you re-focus by moving the mouse.Apologies if this is not the place to ask, but your blog seemed to have lots of relevant and useful information.thanks,Richard
In my book, the best/easiest way to handle this is:
And then everything ends up better. :-)
Keep the keyboard layout around for the future, too. If it is what you are used to typing with, you'll probably want to use it again....
This blog brought to you by ½ (U+00bd, aka VULGAR FRACTION ONE HALF)
There are many things that are beyond my scope of knowledge or power.
This blog is an attempt to clear out some of these items that ended upoon a TO DO list since they were sent to me via the Via the Contact link but I really can't do much with them.
If anyone has advice on the isses raised here, the Comments section is your scratchpad!
The first one from someone in Israel:
Dear Mr. Kaplan...I am currently working on the production of a Chinese-Hebrew-Chinese dictionary. Could you please refer me to a DTP software that can(1) generate a Chinese radical index,(2) generate Stoke count index, (3) generate a Pinyin Index, (4) support Hebrew fonts and (5) generate a Hebrew Index?Many thanks in advance for your help,Yours Sincerely,Ran
I really have no idea of a good software package to do this for a dictionary in any language, actually.
Sorry!
In another message, Elaine asked:
when opening my attached file it was encoding and could'nt read it can you please tell me what to do or going about how to decoding the attached file
I really can't help here since (a) the file in question wasn't attached, and (b) there is no way to attach files to Contact link email.
Usually when I have files in a totally unknown encoding I push back on the source of the file -- what if it is some kind of script with ill intent/
Just kidding.
What I usually do is try to open it in Word, which does a fairly good job of trying to figuire out what the text is so it can be recovered....
Then there was a post from Tim:
Hi, I'm just getting started with Text Services Framework and was wondering if you could recommend any good resources. I read an older post which mentioned using .NET 2.0 with TSF. I'm particularly interested in leveraging C# for the TSF - but I'm not even sure if that is really feasible.Hope all is well.Respectfully,Tim
For creating actual Text Input Profiles, I have spoken at length about the lack of good documentation here, and I have aevedn explained how you cannot ever do this in managed code, as long as managed code only allows one version to be loaded per process.
For just using an existing one, it really depends on what one is trying to do, but there are not oodles of good resources there either.
Unfortunately.
Back to harder to understand communications, this one from Patrick had me scratching my head a bit:
Michael,Picked up your linkage through your blog. Google suggested you have some zen related to correcting Outlook 2007 and Japanese. any help on this would be appreciated.
No zen here, we're fresh out.
Anyway, you kind of get the idea. There are `27 other pieces of mail that are kind of along teh same lines, interleaved with about a little over 300 that aren't. I'm going to archive them all now, and as I have kind of explained the SIAO Dead Letter Office a bit, I won't feel guilty about never really responding.....
All of the proposed characters in the rejected proposal for Klingon are sponsoring this post, especially U+xxE0 KLINGON LETTER QH and U+xxE4 KLINGON LETTER TLH
I was asked the other day for some help understanding why the letters used in C# to determine different kinds of literals were not the same as the letters used for format specifiers.
Hmmmmm, interesting. First let's look at what they are.
For format specifiers (from the Standard Numeric Format Strings topic):
Now the important note here is that we are looking at hints for how to deal with numbers that are contained within strings -- like the character to use in a ToString() call.
The type of the number being converted to a string is not needed here -- it is implicit in the type it is already in! After all, if I call Decimal.ToString() then I know the type is Decimal.
Also, if you look at the topic you will see that each of these types has a specific meaning. One could argue that the definition of the "Decimal" format specifier might lead to confusion given the Decimal type, but that is a separate issue. :-)
In the end, this is a hint to a way to format a string at runtime.
Now the literal specifier is for a different purpose.
Its intention is to give a hint on how to treat a literal number o how it is represented in the compiled code is what would be expected. The number will always have a specific representation in the code, and the suffixes can be used to override that default. The suffixes are similar to generally but quite different from specifically in their values, e.g.:
and so on.
Something very different is being done with these suffixes, which is why the two groups are slightly different anyway -- they don't really overlap, and the meanings change like in the Decimal example I mentioned above.
A little confusing, perhaps. But when one really thinks about the whole situation, it all kind of makes sense....
Now when you add on the concept of localization (you knew I'd be doing that eventually!), neither of these are localized, but obviously the descriptions will be.
Thus if you look at the French equivalent of the English Standard Numeric Format Strings and find in Chaînes de format numériques standard the fact that the format specifier stays the same as even the name changes is obvious....
This blog brought to you by m (U+006d, aka LATIN SMALL LETTER M)
Allan asks via the Contact link:
Hi, Michael;Thanks for a great blog; it's the most insightfull (and honest) look at the whole IME business. I remember running into a Mr Kaplan at one of the Microsoft Game Meltdown events; is that part of your past, before respectability and keyboards?Anyways; I'm trying to port our IME code for a fullscreen D3D8 application (yes, I know.. we're almost pre-historic, but sadly we make casual games, and to that audience, DX8 is cutting edge). We're having a pain of a time getting IME to work on Vista using the IMM framework, but since functions like ImmGetCandidateListW are broken (returns with no response).Are we going to need to rewrite everything to use TSF (in all it's undocumented glory?), or are there ways to beat this beast into submission.Thanks,Allan
At the time when I wrote most of my previous comments I was unaware of it, but I have since found out that the original Input Method Manager (IMM) based IME samples won't actually compile with the header files from the Vista SDK or the Vista version of the Platform SDK. and none of the IMEs that ship with Windows rely on the IMM API.
Now for a few features the IMM API provides interesting shortcuts into getting information, and there is a huge compatibility layer that is used to support applications that use the IMM API to still work properly without change, but that layer does not have 100% coverage of the functionality across Vista and later versions of Windows, so I think in the end people will really be pushed to use the Text Services Framework to get their work done.
From a timing perspective that change will most likely happen as soon as an application runs into a blocking bug in that compatibility later provided by the IMM API, as soon as Vista is targeted.
There is a huge investment here, so I suspect that reported bugs have a decent chance of getting fixed, especially when they block functionality of existing applications, but this is hard to rely on and will ultimately have the biggest effect on applications in China, Japan, and Korea any time they are blocked because of this issue.
Regular readers might recall the ImmGetCandidiateList on Vista issue being mentioned here previously, in Sprechen Sie IME?.
Interestingly, the function does not fail for all IMEs, only some of them.
Though the ones it fails for probably aren't all that reassured by this fact.
Further, I was told by a few people that in some cases it is only the "ImmGetCandidateList(hIMC, dwIndex, NULL, 0)" code that is intended to get the required length that fails -- if one passes in fully sized buffers one can actually succeed, and supposedly just passing a non-NULL buffer with a 0 length can sometimes help, too.
People it fails for are probably not that much reassured by this, either.
And finally, I have been told that sometimes doing the lower level work oneself via ImmLockClientImc/ImmLockIMC/ImmLockIMCC/etc. directly can also help here, though I suspect that these are cases where the bigger buffer sizes would also do the trick, so I am skeptical.
In any case, the specific TSF interfaces involving the candidate list are:
And it is the last two that are most interesting for clients who would have wanted the results of ImmGetCandidiateList back in the day.
There is more information what to do over in TSF Aware in the Text Services: Candidates blog -- though centered on the applications, it is at least talking about how to get the info a bit.
Of course the samples on this page are still missing, though those samples are definitely centering around providers rather than clients so they wouldn't help here....
This blog brought to you by の (U+306e, aka HIRAGANA LETTER NO)
Reader Barney asks:
Hi,I've been reading your blog for a while and using all the good bits, but I have a problem I can't find the answer to.I have created an on screen keyboard, and the user can enter some text that will be typed by the keyboard.When parsing and typing the string to the target application (essentially a VB SendKeys replacement) I'm using VkKeyScan to determine how to send the key. However, as you note in some earlier posts, this does not return dead keys.How do I determine which dead key(s) I need to type to print a character (for example á on a Portuguese (Portugal) layout)?Many thanks,Barney
He is right, VkKeyScan/VkKeyScanEx aren't going to do the trick.
But there is a lot of other stuff missing from those functions, too -- it is best just ignored for any genuine need.
The only real, supported answer for code running in user mode is the solution that just asks the keyboard for all of the information and stores it for you to use as you need it....
This is code I have actually written twice in my life:
Now the first one is a bit more challenging to get to (I suppose it could be eventually puzzled out via Reflector?), but it is the code that supports the "test surface" in MSKLC that lets you test out the layout as you are developing it.
The latter's source, however, is right there in the series. It is written in C# but it is mainly Win32 code written in C# for easy conversion to other languages.
By storing the information on what the keystrokes do, you can simulate their effects later on....
This blog brought to you by ಘ (U+0c98, aka KANNADA LETTER GHA)
This is not a blog advising terrorists on how to circumvent the efforts of TSA inspectors!
Developer Sean mentioned:
Not sure who to address this to, but we just noticed that the wide string conversion functions don’t handle the whitespace Unicode markers (0xfeff).
The function where he first noticed the behavior in was in wcstoul, a function which clearly describes its whitespace behavior:
expects nptr to point to a string of the following form:[whitespace] [{+ | –}] [0 [{ x | X }]] [digits]A whitespace may consist of space and tab characters, which are ignored...
Okay, so it easy to see why he was expecting it to be ignored, but now leads us to wonder how wcstoul is deciding what "whitespace" is -- are they doing a simple check for tab and space?
The great thing about the C Runtime is that the source is right there so anyone can take a look. Let's do that now. The function can be found in VC\crt\src\wcstol.c, and the relevant bit of the function is:
while ( _iswspace_l(c, _loc_update.GetLocaleT()) ) c = *p++; /* skip whitespace */
Ok, so the function skips the initial whitespace, like it claims to. But U+feff, the famous ZERO WIDTH NO-BREAK SPACE, obviously fails this particular test.
It turns out that iswspace and its cousins like _iswspace_l are using the character property information that comes out of the NLS GetStringTypeW function, which I have talked about before.
So where does GetStringTypeW decide what is a C1_SPACE or C1_BLANK?
This is something I mentioned last year in The difference between C1_SPACE-ing out and drawing a C1_BLANK, and clearly from that list you can see that although some space characters are covered there, ZERO WIDTH NO-BREAK SPACE is not -- because Unicode calls it a formating character (general category Cf), not a space -- and NLS goes along with that.
It turns out that the code in question was grabbing its source string from a file that started with a BOM -- which kind of points to the best way to resolve the problem: strip the BOM out since it is a part of the file "envelope" and not a part of its content....
This blog brought to you by U+feff, aka ZERO WIDTH NO-BREAK SPACE)
Regular readers might remember my blog entitled Did he say shaping? It's not in the script!, where I showed some of the consequence of the Latin script being in that strange place of sometimes being considered complex and other times not, and some of the consequences of that when a font such as Segoe Script is used.
You know, a font where the complex shaping can be rather all-encompassing and can affect every letter.
It might be fun to put together a tool to assess the CPC (complexity per character) of a font, and use it to compare various fonts. Just a random thought....
Anyway, the other day I had someone send me mail via the Contact link about the Segoe Script font:
I was wondering if you had ever noticed the OpenType contextual scripting error in Segoe Script?If you type in [lowercase] č as an initial or medial character, it shows up as ć, which causes issues for words that start with č. And not to be out done, ç is rendered as č. I was wondering about your opinion on such a bug, that can cause an interesting headache for a good portion of Europeans, and N&S Americans?
Interesting, very interesting.
(I say someone since he or she did not leave a name!)
Now of course I never throw away sample applications, so running the string through the Edit Control Sample from Did he say shaping? It's not in the script! shows that this nameless person was quite right.
We'll take the following string:
č ç čç çč
basically various combinations of U+010d (LATIN SMALL LETTER C WITH CARON) and U+00e7 (LATIN SMALL LETTER C WITH CEDILLA) to see when the appearances change as suggested:
č ---> ć (U+0107, a.k.a. LATIN SMALL LETTER C WITH ACUTE)
ç ---> č (U+010d, a.k.a LATIN SMALL LETTER C WITH CARON)
and run it through our application that shows both the complex and non-complex views of the situation:
Yep, I'd say that the reported bug indeed can be seen in all of those cases that Uniscribe does its "complexification" thing....
I'll forward this on to the folks who manage this font so they do whatever they need to do.
This can indeed be difficult for situations when you need those characters in your language, though of course one only needs to take the lesson from the average person's handwriting to realize that if the worst problem you have is minor things like this, you probably don't have too much to worry about. So this is kind of a big deal that isn't too big of a deal.
I certainly would not start planning on how to try to force the "non-complex" path just to be sure you get the right characters or anything like that. :-)
If you don't know what four Unicode characters would sponsor this blog, you might might have been paying too much attention!
Prior posts in the series:
Okay, now at long last the data provided by Thakara nearly two years ago, in response to Creation of transliterating input methods, is here. You can download a Sinhalese transliterating input method using the data provided, either here for the text file or here for the zipped version.
The file is actually pretty small, but the ZIP keeps people from forgetting to right-click on the link to download the file....
I was going to post instructions again but I really cannot do better than I did in 12 (The knights who say நீ, redux, #2) so I will just point people there for the instructions.
Once they are followed the Sinhalese Input Method 0.5 will be right there for people to try out and use.
Enjoy, and if Sinhalese is a language you know fluently, then please feel free to give feedback to help make it better....
This blog brought to you by ෝ (U+0ddd, aka SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA)
Way back in December of 2007, aaron asked in the Suggestion Box:
Your recent In SQL Server, A-Z [...] might not mean the same thing: It got me thinking, a whole post dedicated to the problems of mixing regular expressions and i18n would be very interesting. Some questions i've always woried about but never tested: '\b' word boundaries, do they incorrectly show up when surrogate pairs or combining characters are involved? '\b' word boundaries, are there / should there be characters that form word boundaries only sometimes. It's plausible in some interpretations that "hy-phen" has only two word boundaries, at the begining and end, but in reality is has 4, as '-' is not a '\w' character. But do other unicode characters have some sort of weird identity. If i have an accented character as two code points (combining), does / should '.' (or '?' in Win32 regex) match the character and the accent, or just the base character? how wide is the definition of '\w' word character? Does it / should it ever change based on the current user locale/language? Most importantly how likely is your average regular expression going to be i18n unsafe? what are the common pitfalls to avoid? Note: for 'should / does', i'm asking all of (a) what do you (Michael Kaplan) think it _should_ do, and (b) what do some common implementations do (for instance, the .Net System.Text.RegularExpressions.Regex class, or the new TR1 regex in Visual Studio 2008, or Win32 with FindFirstFile and friends)(Oh, and your blog is awesome!)#aaron
Your recent In SQL Server, A-Z [...] might not mean the same thing: It got me thinking, a whole post dedicated to the problems of mixing regular expressions and i18n would be very interesting. Some questions i've always woried about but never tested:
Most importantly
Note: for 'should / does', i'm asking all of (a) what do you (Michael Kaplan) think it _should_ do, and (b) what do some common implementations do (for instance, the .Net System.Text.RegularExpressions.Regex class, or the new TR1 regex in Visual Studio 2008, or Win32 with FindFirstFile and friends)(Oh, and your blog is awesome!)#aaron
Hopefully the long delay before I got to responding did not change his opinion of the blog. :-)
I'll start off with a quote from Jamie Zawinski:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
This is to help set expectations realistically. :-)
I'll start by saying that most modern regular expression implementations do have certain features that are particularly good for internationalization, such as Unicode storage semantics. Such features are pretty much essential in most cases. Some of them build in Unicode property type information and most of those keep up with recent versions of Unicode.
With that said, most of the ones that I have worked with are otherwise very primitive, not properly handling Unicode normalization/canonical equivalence, known in .NET as the "text element" semantic. This leads to some problems with Aaron's first and third points above with sequences of characters that should be treated as equivalent to some other character or sequence.
Most also have no notion of exceptional cases such as the one in Aaron's second point above -- to support these things you have to build up complex expressions to try to handle the exceptional cases. If one is lucky they are included in samples, but usually only the very simplest ones tend to be built in.
And none that I have ever worked with properly handle locale specific differences, the thing that I have referred to here in the blog as "sort elements" -- what users in a particular language think of as a single character, the kind of thing that Aaron hints at in the fourth point above.
For more on sort elements and text elements, blogs of mine like Sort element vs. text element are a good place to start.
The theory of all of that good property support often runs into the kinds of problems described in No Regex in the Unicode room! (and no sex in the champagne room, either!) and 4400 (*not* 'The 4400') and 'The 44' (*not* 'The 4400'), where what the engine does manges to fall short of what one might expect from an implementer of information coming out of the Unicode Character Database....
And I'm not going to pick on Microsoft's implementation, which is probably about average here. Most suffer from the complicated nature of the data in the UCD when their comparatively simplistic implementation tries to use the data.
Which then leads to the last question, the one about how common the "i18n unsafe" expression problem might be expected to come up. On the whole, I expect it is way more common than people realize, as the nature of the more complicated cases requires built-in expressions much more complicated than the definitions that are usually present....
An ideal implementation plan for such an engine is covered in Unicode Technical Standard #18: Unicode Regular Expressions, whose own summary states: "This document describes guidelines for how to adapt regular expression engines to use Unicode." Though many fall short of that ideal here (the only reason I don't say all here is that I have not tested every engine out there, but all the ones I have used and/or dabbled with and/or tested have issues).
Now going back to that original series of blogs about SQL Server, it is clear that problems I point out in that series and in posts like Wild[card] thing, You make my CHAR sing and With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width) are more than anything else to do with SQL Server choosing to draw that line between appropriate behavior and simple definitional consistency in a better place that regular expressions tend to do. Which leads to inconsistencies in the documentation and limitations/flaws in the syntax (which was not made to handle things this complex either).
I must admit that I find myself more comfortable with where SQL Server sits here, rather than where regular expressions do. :-)
This blog brought to you by ঐ (U+0990, aka BENGALI LETTER AI)
The question was deceptively simple:
Is String.SubString complex script safe? Can we use substring on a localized string safely?
Now the shape of the question itself hints at the concern -- by asking about complex scripts, the question about String.Substring is being framed in terms of combining characters, with the question being whether String.Substring is smart enough to know not to chop off dependent/combining characters.
Well, the obvious answer is easy -- it isn't.
String.Substring is based UTF-16 code units and as long as things fall within those boundaries, it can/will split them up any way it is asked to, without warning.
Once again, there is an easy answer, the one I talk about in posts like:
The StringInfo class has the methods and properties to properly respect the character boundaries the question is talking about.
Note of course that this won't do anything with compressions (contractions) used in sorting, but we'll leave that one lie for now.
Let's think more closely about the question for a moment:
Can we use substring on a localized string safely?
If we take the word localization as the much more careful and enlightened version of translation, where ideally all of the relevant issues such as language, regional variation, market expectations, and so on are all considered, can any automated process that chops on character boundaries be considered "safe" for the purposes of localization?
For example if I truncate
You must then watch her assessment of the project
at 27 characters using the StringInfo style safety guarantees to meet some arbitrary buffer requirement using StringInfo to not break the user's character boundaries, you will get:
You must then watch her ass
and then you'll be really sorry that the English version isn't localized so that a localizer could take one look and realize some developer was once again being clever rather than being smart!
Now do we feel better if we know not truncate an Extension B ideograph due to splitting a surrogate pair, if we know not to convert ధు (TELUGU DHU) into ధ (TELUGU DHA)? Maybe.
But is just as possible to make the same kind of mistake as the assess example in other languages.
Which just goes to show every that developer has the power to make an ass out of themselves if they don't consider their options carefully. :-)
This blog brought to you by ధు (U+0c27 U+0c41, aka TELUGU LETTER DHU, aka TELUGU LETTER DHA + TELUGU VOWEL SIGN U)
There is a distinct lack of inspiration around the SiaO halls at the moment.
The last track on Aimee Mann's Lost in Space entitled It's Not kind of captures the feeling.
Mainly the second verse, but I'll put the whole song up and just emphasize the second verse:
I keep going round and round on the same old circuitA wire travels underground to a vacant lotWhere something I can't see interrupts the currentAnd shrinks the picture down to a tiny dot.And from behind the screen it can look so perfectBut it's not.So here I'm sitting in my car at the same old stop light.I keep waiting for a change but I don't know what.So red turns into green turning into yellow,But I'm just frozen here in the same old spot.And all I have to do is to press the pedal.But I'm not.No I'm not.People are tricky you can't afford to showAnything risky anything they don't know.The moment you try... Well, kiss it goodbye.So baby kiss me like a drug, like a respiratorAnd let me fall into the dream of the astronautWhere I get lost in space that goes on foreverAnd you make all the rest just an afterthoughtAnd I believe it's you could make it betterThough it's notNo it's notNo it's not...
It is my second favorite song on the album (right after Invisible Ink), but what is grabbing me at the moment is that feeling of waiting.
I have like dozens of blogs already written on various topics, yet something holds me back.
I have no idea what it is.
But I've done that very thing at a stoplight on occasion, and I've certainly done it in the scooter at the Walk/Don't Walk sign sometimes, too.
For the record, I'm not doing it at work, or at the other random events going on (exhibitions, an Insider summit with associated off-hours events, random meals turned social intercourse, etc.).
But in the blog, something holds me back.
Hmmmm.
You know, I look at the lyrics and am reminded of an incident in a bar several years prior, after a show.
Someone there was insistent that the line in the last verse was "and you may call the rest..." despite my calm assurances that the actual words were "and you make all the rest..." . When I went out to my car, brought in the CD with the lyrics in it, and collected the round I won after he admitted I was right, a woman asked what the difference was.
I was probably a bit drunk at the time, but that did give me pause. I wanted to get the thought out clearly, I wonder how much that moment sobered me up, at least conceptually?
"Well, there is usually big difference between calling something an afterthought and making it one. Especially in the context of a song from the point of view of someone who can't seem to get things started, while other things appear to have no trouble starting around them -- where other people are getting things done, making things happen."
She didn't literally have a light bulb come on over her head, but nobody paying attention would have failed to see the light of the bulb there as Heather nodded.
Yes, I found out her name. It was Heather. She introduced herself right after the above described little incident. The guy who lost the bet didn't introduce himself, but that was his own fault for trying to argue lyrics with me....
Anyway, I did my job that night. Putting out the whole "importance of the words" message I tend to preach about from time to time.
This is also an interesting linguistic phenomenon as well, whose name I can't immediately recall.
I expected that the whole "not actually being a linguist" thing my come back to haunt me at some point. No sense putting off a trip to the dentist!
But the difference in meaning is an important one, really....
I remember talking with Liz about this one once. Her comment was along the lines of "I may call you lyrically OCD, but you make all the claims real." And she made the two incidences sound identical, though I pointed out that her example left one of the cases completely unmabiguous (to which she responded that I mansged to prove the OCD point right there).
I plead nolo contendere. :-)
Back to the topic at hand -- I think I have to just make myself post some of these blogs that are done.
It is true that there are no cars behind me waiting to go through the light that I'm blocking. But I feel like the Blog goes ever on and on (in true Tolkienian fashion), and I feel like I am stopping up the works a bit.
Sorry about that.
This blog brought to you by ⎉ (U+2389, aka CIRCLED HORIZONTAL BAR WITH NOTCH)
So after yesterday's blog (Behind facebook status like: "...somewhere between 'Addictive' by Faithless and 'Addicted' by Juliana Hatfield."), I had several people point out that there seemed to be something missing.
After all, I mentioned that each of the quotes would have:
...[as least] three meanings: the mundane one that anyone can see even if they know nothing of either/both songs or even either/both artists; the deeper one involving some aspect of part of the underlying meanings/themes of the two songs; the deepest one involving the exact occurrence or memory or story or experience that I either know of or heard of that inspired me to add it. But even if you only have one of these it can still be fun...
...[as least] three meanings:
But even if you only have one of these it can still be fun...
Yes, that is right -- in my example I did not explain the third point.
I kind of thought the reason behind the omission would be self explanatory, wouldn't it?
Sometimes those deep personal things are really personal; many times they weren't even about me.
So sharing can feel like like a violation of someone's privacy (perhaps mine, perhaps not) and on the whole it seems exist to just not risk it.
I thought that was implied, but not everyone picked up on this point. Sorry about that!
I share a lot here, more than I ought, according to some.
But not everything is shared -- because in addition to all the above, sometimes the best little secrets are kept....
If it is something I feel comfortable blogging about, you'll likely see it.
This blog brought to you by ݉ (U+0749, aka SYRIAC MUSIC)