Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
In internationalization contexts, one often hears about the notion of dangerous characters.
This is not (as it may sound) about the criminal element, but rather about specific Unicode characters that can cause problems if their consequences are not taken into account.
Here is (for example) what one of the resources provides (I got this from Gwyneth Marshall from over in Office:
German
1252
ß (U+00DF)
French
æ (ALT+0230), Æ (ALT+0198)
œ (ALT+0156), Œ (ALT+0140)
ç (ALT+0231), Ç (ALT+0199)
î (ALT+0238), Î (ALT+0206)
The æ chars are not very common so I included a fourth char î.
Spanish
Italian
Any 3 of à(À), è(È), é(É), ì(Ì), ò(Ò), ù(Ù)
Some features useful in English are useless for Italian. Capitalize the letter “i” in “I” is an example of a feature that must be cut off the Italian version of Word.
Swedish
å (ALT+0229)
ä (ALT+0228)
ö (ALT+0246)
Brazilian
Dutch
Any 3 of accented characters (vowels)
With the Belgian Dutch AZERTY keyboard layout some characters (e.g. some of those entered with the AltGr key) are sometimes impossible to make, especially for accelerator keys.
Sub would like some testing of support of this keyboard layout.
Danish
æ (ALT+0230)
ø (ALT+0248)
Sometimes a problem with æ getting incorrectly seperated to a and e
Norwegian
Finnish
Portuguese
Any 3 of á, é, í, ó, ú, à, ê, ô, ã, ç
None of these extended characters tend to cause major problems, but they should not be used as hot keys.
Czech / Slovak
1250
Š (ALT+0138), š (ALT+ 0154)
Ť (ALT+0141), ť (ALT+0157)
Ž (ALT+0142), ž (ALT+0158)
Characters within the range 0128 to 0159 are often problematic because developers can assume that this range is non-alphanumeric.
Polish
Ś (ALT+0140), ś (ALT+0156)
Ź (ALT+0143), ź (ALT+0159)
This characters are mentioned based on the same reasoning as for Czech.
Hungarian
ő (ALT+0245), Ő (ALT+0213)
ű (ALT+0251), Ű (ALT+0219)
These CE characters are specific to Hungarian.
Slovenian
Č (ALT+0200), č (ALT+0232)
Š (ALT+0138), š (ALT+0154)
Russian
1251
я (ALT+0255)
Ч (ALT+0215), ч (ALT+0247)
Ё (ALT+0168), ё (ALT+0184)
р (ALT+0240)
Because it is the last letter in the codepage.
Because CP 1252 has multiplication and division in these places.
Because these letters are outside the main range of Russian letters.
Greek
1253
Σ (ALT+0211) – capital letter sigma
σ (ALT+0243) – small letter sigma
ς (ALT+0242) – small letter final sigma
Any of Greek accented characters
Both small sigma characters capitalize to the same capital letter.
Final sigma only appears at the end of a word.
Turkish
1254
ı (ALT+0253), I (ALT+0073)
i (ALT+0105), İ (ALT+0221)
ğ (ALT+0240), Ğ (ALT+0208)
ş (ALT+0254), Ş (ALT+0222)
Most of these characters are the only ones that are not in CP 1252 but are in 1254.
Possible problems with I would be, setup files staring with I, any registry entries that contain uppercased I, auto upper/lower casing in apps.
Japanese
932
0x5c Characters - ソ十申暴構能
0x5f Characters (DBCS) - 雲契活神点農
0x7b Characters - ボ施倍府本宮
0x7d Characters - マ笠急党図迎
0x7e Characters - ミ円救降冬梅
0x5b Characters - ゼ夕票充端納
0x5d Characters - ゾ従転脳評競
0xe5 Characters - 怜蒟栁ょ溷瑯
瑞 (U+745e)
ソ (U+30BD)
0x5c characters are the most problematic ones
Lead and trail byte are identical.
Full-width Katakana or DBCS
Chinese (Simplified)
936
950
Korean
949
Thai
874
Two characters to form one character. Two issues to watch for:- Display. The circle should be on top of the character, not off to the right.- Caret movement. One click should jump over the entire cluster, not two clicks.
Vietnamese
1258
The entry that I found most interesting for the purposes of today's post is the one on Greek:
Now this is something I have talked about before, in the following posts:
I ended up having a few conversations with people about what specific circumstances would make these characters dangerous, especially in light of the information in the above posts.
The answer I got was fascinating, and it is something I have often run across many times in code in the past....
For some reason, many developers prefer to handle case insensitive comparisons using the same Change Case and do a Binary Comparison that is the methodology used in OrdinalIgnoreCase and NTFS style comparisons. And they often roll their own code here that lowercases rather than uppercases (or they use the CRT functions that lowercase rather than uppercase).
So what happens if one is trying to validate file paths and one uses a convert to lower case and then do a binary comparison style operation? One gains an extra character (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA), and any file path validation one does will not match the actual file system.
Now I do agree with the decision to case the Greek script as happens now in Microsoft products for the reasons discussed in The last word on the FINAL SIGMA. But it is hard to get away from the fact that many developers run into problems here because they are either doing the wrong thing (in which case they are to blame) or because the CRT is doing the wrong thing (in which case one can blame the forces in the universe that conspire to do something in international standards that is not done by Microsoft.
I think I'll take a much wider view and perhaps blame the original decision in so much of Windows to support case insensitivity by uppercasing. Why on earth didn't they lowercase here? It would have made everything much easier, and then U+03c2, (GREEK SMALL LETTER FINAL SIGMA) wouldn't have to be a dangerous character....
Probably too late to do anything about it (though it is tempting to try to change Windows to lowercase for it's case-insensitive binary comparisons and see what breaks!). We'll just have to live with the dangerous nature of this character.
Or maybe encode a GREEK CAPITAL LETTER FINAL SIGMA in Unicode; the fact that no such character hasn't stopped us in the past; why let it stop us now? :-)
This post brought to you by ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)
Michael, I was in Turkey and could not log in to the Google account from Hotel PC. Latter I found the reason was that my username had letter 'i' in it. In Turkish keyboard the letter i [which is generally between u & o] represent a diffrent 'i' than what we use. But I found letter i in a different place.