Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
The question went something like this:
I'm trying to display GB18030 text (say unicode 0x3400 character) using DrawTextA and WideCharToMultiByte. I am using the code page for GB18030 which is 54936. Why doesn't this work? Originally, I thought it had to do with font linking. Thanks for the help.
You can see what is going on here -- the general assumption that the non-Unicode Win32 API will handle any/every code page that isn't Unicode. Which we know isn't true from the many times UTF-8 support in "A" functions has been discussed (if you look at Raymond's recent post that points out so many of the times I have talked about it, the subject has come up way too many times!).
Once I pointed out UTF-8 and Gb18030 at the same time (in UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages).
Now GDI is fundamentally a Unicode thing internally and was even back in Windows 95, mainly because most of the plumbing is Unicode anyway.
The issue of which code page is used is not a simple answer like CP_ACP, as I pointed out in What code page does MSLU convert with?. The MS Layer for Unicode was designed to map to what the OS does in so many cases, including the GDI ones that were kind of based on the charset of a device context.
But all of the underlying code pages that the charset values map to are ACPs, and GB18030 cannot be an ACP, for much the same reasons that UTF-8 cannot.
Obviously, the quick answer is to use DrawTextW with the original Unicode text that isn't converted at all, rather than converting it and not being able to display all of the data that DrawTextA won't recognize....
This post brought to you by 㐀 (U+3400, the first CJK ideograph in CJK Extension A)
It is something I found out about right after I saw the post on Shawn's blog entitled Security patch MS07-040 for .Net 2.0 breaks some culture names for .Net 2.0 on Windows XP/2003/2000.
The issue is the one I first blathered about in Important changes in NLS that span Windows and the .NET Framework.
I have to admit I do not like when this sort of thing happens.
The MS07-040 Security Bulletin text does not mention it, and neither does KB 931212 that "documents the currently known issues that customers may experience when they install this security update."
Update 3:51pm: Note that KB 939949 does talk about this change, and hopefully they will update the oher KB to link to it soon and make this blog post look horribly out of date!
Somehow this just got shoved in, and if Shawn hadn't posted about it then no one would know until applications that used to work started breaking.
How is this not an issue to at least mention?
Now please don't misunderstand me; I do think it is an important issue, and given that some platforms (like Vista) had the fix and some did not, some apps were already having problems.
But no wants "stealth" fixes, especially ones with potential backcompat breaks in them.
In my opinion, the update should have been a separate optional one, not one that you can only opt out of if you fail to pick up multiple security bug fixes. Or even if it was listed s required, it still should have been kept separate so people could opt out....
This post brought to you by ⌢ (U+2322, a.k.a. FROWN)
Back in this post and this other one, I have been kind of hinting around at the functionality known as FONT ASSOCIATION.
Now in the post you are reading, I am going to explain what it is.
First we'll go with a dictionary definition. I'll go with the American Heritage definition since the book is close by:
Association (n, ə-sō'sē-ā'shən) -- An organized body of people who have an interest, activity, or purpose in common; a society.
One thing you notice about many such associations is a sense of exclusivity -- by joining the association you get privileges that others do not. But the members themselves are quite independent and have their own purposes and goals.
Well, font association fits into this kind of definition quite well. :-)
Basically, to join the association, you must
If you meet these guidelines, then GDI will have a general preference for the preferred font of the main script of the particular East Asian locale (e.g. SimSun for Simplified Chinese, Gulim for Korean and so on).
And this association will happen even if one does not use a font in the GDI font link chain, which one cannot always rely on otherwise. If it is not turned on then people who are used to it will complain about the bug they hit, and sometimes if the characters are not in the font given to them by their "association" then they will get NULL glyphs rather than characters.
And once again the LOGFONT lfCharSet member becomes the only way one can break out of the association for a little while (unless your font already supports the script, of course). Basically one is better off if one sets the lfCharSet to something that you are sure does not contain the characters, the only good escape here.
Note also from that second post that Gulim has font linking behavior when Korean is the default system locale. But like all good associations, the Fellowship of the Font Linkers are not their partners, and the font link chin that exists for a given font is not respected if the font is chosen through association....
That is font association -- a legacy technology that leads to a specific popular behavior that customers in East Asia may be relying on now -- so we are kind of stuck with it. Even if it can often be lousy....
I wish we didn't, and that I could say "I don't associate with those kinds of fonts anymore."
This post brought to you by 덯 (U+b36f, a.k.a. HANGUL SYLLABLE TIKEUT EO HIEUH)
Last month when I posted Guilt by [font ]association (aka The consequences of picking the wrong font #3), I said I'd try to dedfine font association better.
I'm not going to do that just yet.
Though I will talk about another bug that might be related to it, one that might help discern what is going on a bit. Or maybe it will confuse things further....
The report came in just the other day. Some specific text:
國 (U+570b): In Gulim, MS UI Gothic, PMingLiU, SimSun国 (U+56fd): In MS UI Gothic, PMingLiU, SimSun龱 (U+9fb1): In PMingLiU, SimSun⺋ (U+2e8b, CJK RADICAL SEAL): In SimSunՄ (U+0544, ARMENIAN CAPITAL LETTER MEN): Not in Tahoma, Segoe UI
When stuck in Notepad using a specific font (Century) was reportedly showing some strange behavior. Not when there was an en-US default system locale:
but when the default system locale was ko-KR (Korean):
The kicker, though? If you changed the script to Greek:
then both English:
and Korean:
look fine.
So much for the LOGFONT lfCharSet member not being relevant to Unicode text rendering, huh? :-)
Though the original people were assuming it was to do with font linking, that would not be the case since Century is not in any GDI font link chain. It can't be due to Uniscribe font fallback since the text is not (for the most part) complex. It is not even font substitution, another lingering legacy thing I have discussed before, since the font is not on the list.
I believe it was Arthur C. Clarke who said that Any sufficiently advanced technology is indistinguishable from magic.
Though by the time you go through all of the thesaurus words for "ways to change what the font is" that make even smart people like Andrew West uncomfortable (as he mentioned here), it may not seem so much like magic as maelstrom, if you know what I mean.
Somehow, I think the title of this post covers the situation better here. :-)
This latest behavior has been attributed to font association. the feature I mentioned in this post. Though that is still so ill-defined that one may as well give in to the confusion and just say it works by magic.
I will see what I can do to try to provide a better definition than that -- one that will unconfuse things....
This post brought to you by উ (U+0989, a.k.a. BENGALI LETTER U)
Woo hoo! They have just announced the Vancouver Development Center, as Jenna posted on JobsBlog.
This rocks, for all the reasons Jenna has indicated. There have been far too many blockades to excellent candidates from around the world with all of the pain in getting H1B Visa stuff handled, and this is a great alternative for groups in Redmond to consider.
You can see the presspass announcement here.
Very cool!
This post brought to you by ᑕ (U+1455, a.k.a. CANADIAN SYLLABICS TA)
The other day when I wrote We've got a style of glyphs, yes we do; we've got a style of glyphs, how 'bout you?, regular reader Mihai commented:
...Character Map is not consistent.Select "Angsana New" (or Arial, or "Lucida Sans" or whatever) and using "Character Set: Unicode" you will notice that only the glyphs that exist in the font are shown.So I guess the expectancy is that Character Map does not do any font fallback/linking/substitution.
This is an excellent point; Character Map is really a tool that is built for the display of the fonts, not of the display of other font technologies.
Though as I think about it, wouldn't a Character Map Plus tool that did all of the fallback/linking/substitution and showed you what you could expect to actually GET if you asked for a given font be a really cool idea?
or maybe it could be an additional checkbox that would expand the view to this much wider one.
I imagine that I would use that one more often than I use the one that is there, to tell the truth.
Anyone else think this would be worth thinking about? :-)
This post brought to you by ઢ (U+0aa2, a.k.a. GUJARATI LETTER DDHA)
Bindesh's question was:
Hi I am trying to print out the value from a reg key and have the following code: HKEY hkey; LONG returnStatus; DWORD dwRegType = REG_SZ; DWORD dwRegSize = 255; char cRegVal[255]; char regKeyPath[] =""; returnStatus = RegOpenKeyEx(HKEY_LOCAL_MACHINE, TEXT("SOFTWARE\\Microsoft\\Netmon3"), 0, KEY_READ, &hkey); if(returnStatus == ERROR_SUCCESS) { returnStatus = RegQueryValueEx(hkey,TEXT("InstallDir"),NULL,&dwRegType,(LPBYTE)&cRegVal,&dwRegSize); if(returnStatus == ERROR_SUCCESS) { cout<<cRegVal; } else { cout<<"Error "<<returnStatus<<endl; } } else { cout<<"did not get the key"<<endl; } The issue is that the output from cRegVal(highlighted one) just brings the first character (in this case, just the alphabet ‘c’) from the array and when I see the value stored in the array(putting a breakpoint) I find that after each character there is a null kept that basically marks the character as a complete string and hence I am unable to print the entire value.
Hi I am trying to print out the value from a reg key and have the following code:
HKEY hkey; LONG returnStatus; DWORD dwRegType = REG_SZ; DWORD dwRegSize = 255; char cRegVal[255]; char regKeyPath[] =""; returnStatus = RegOpenKeyEx(HKEY_LOCAL_MACHINE, TEXT("SOFTWARE\\Microsoft\\Netmon3"), 0, KEY_READ, &hkey);
if(returnStatus == ERROR_SUCCESS) { returnStatus = RegQueryValueEx(hkey,TEXT("InstallDir"),NULL,&dwRegType,(LPBYTE)&cRegVal,&dwRegSize); if(returnStatus == ERROR_SUCCESS) { cout<<cRegVal; } else { cout<<"Error "<<returnStatus<<endl; } } else { cout<<"did not get the key"<<endl; }
The issue is that the output from cRegVal(highlighted one) just brings the first character (in this case, just the alphabet ‘c’) from the array and when I see the value stored in the array(putting a breakpoint) I find that after each character there is a null kept that basically marks the character as a complete string and hence I am unable to print the entire value.
Luckily, Doron Holan (a fellow Technical Lead, at least for the moment!) pointed out the problems here quickly:
Are you compiling your application as Unicode? You are mixing character size independent macros (TEXT) with hard coded types (char) instead of size independent types (TCHAR). Furthermore, cout is ANSI, to output in unicode you need to use wcout.d
Interestingly, I answered the same question that was forwarded to me earlier (thought not seen by Bindesh until after he had asked elsewhere) as follows:
The NULL values are likely due to it being Unicode text, and for that you need to use wcout, not cout.Though looking at the code you will also want to fix mixed use of TEXT macros and hardcoded data types (if nothing else use WCHAR/wchar_t, but consider getting rid of TEXT() and replacing with L"" type strings, etc.).
Now the way that Doron presented the answer was very interesting, and it is the type of thing I want to get better at myself.
The code clearly had two problems -- one of which was causing the reported problem, and the other of which led to the problem and might lead the same problem happening again (in fact, had the code only been switched from cout to wcout, I think it would have also failed with the new error of a datatype mismatch).
But by starting with the actual solution to the immediate problem rather than starting with underlying problem, I might have unintentionally caused the other issues to not be considered due to the general situation in many cases of developers "coding 'til it works" rather than "coding 'til its right".
Now in this case it wouldn't matter since both problems had to be fixed, at least enough for the wcout call to succeed. But sometimes that won't be the case, and a person could get by until the next bug that came up.
The benefit is in pointing out the architectural issues first (the char variable that led to the cout call) rather than treating it like a side point. It makes it more likely that a person would be nore likely to address all of the issues at the same time....
Now I am probably reading way too much into this short example and attributing way too much to Doron, though he is actually a good person to pay attention to, anyway.
So no harm done. But it did remind me of this particular tendency I have in question answering which ironically enough can be appreciated by the asker even if it would be more helpful to end with the answer rather than begin with it. :-)
This post brought to you by d (U+0064, a.k.a. LATIN SMALL LETTER D)
Just the other day, developer George was asking me about that Reversing sort keys post I wrote way back when.
His main interest was not in the functionality, but in the name of the function! :-)
He couldn't decide whether the name of the function would be UnLCMapString or LCUnMapString....
An interesting problem is positional morphology, no?
Well, actually, no!
The LCMapString function has a simple job -- it does locale sensitive mapping of string. And in almost all cases, those mappings have reverse mappings, which means that operations involving A(B(A(<string>))) == A(<string>).
You can see this with
And the only mapping that has no reversing operation is LCMAP_SORTKEY.
Thus, if the functionality to reverse sort keys ever did exist, it would almost certainly be another flag for LCMapString to use, not a whole new function (which neatly sidesteps the positional morphology question by rejecting the premise that either name makes sense!). :-)
But let's pretend that is not the case, and that one did have to choose between two different function names for reversing operations. And you were looking at either UnLCMapString or LCUnMapString.
What would the best name have been?
Now keep in mind this is not the opinion of a trained linguist, but I'll give it a shot....
Given that the name LCMapString really exists for the purpose of locale-sensitive mapping of a string, if one were positing a function name for the reverse operation (locale-sensitive unmapping of a string), it seems like the best name choice would in fact be LCUnMapString.
Of course this implies that the process for creating the names of Win32 API functions is either consistent or intuitive, a shaky assumption at best. But for a function that probably shouldn't ever exist even if the functionality did (which it doesn't), there is no harm in being idealistic about the process, right? :-)
This post brought to you by ᴙ (U+1d19, a.k.a.LATIN LETTER SMALL CAPITAL REVERSED R)
Obviously a follow-on to TTC indexes, the hard way..., this post provides the code that Sergey Malkin put together to work with .TTC files, and more importantly with the individual fonts thereof.
And I think I'll save some of his helpful functions byte swapping, too -- for future forays in this area like that thing I did with the names.... :-)
Here is the code:
#include <stdio.h>#include <windows.h>USHORT ReadUshort(BYTE* p) { return ((USHORT)p[0] << 8)+ ((USHORT)p[1] );}DWORD ReadDword(BYTE* p) { return ((LONG)p[0] << 24 )+ ((LONG)p[1] << 16 )+ ((LONG)p[2] << 8 )+ ((LONG)p[3] );}|DWORD ReadTag(BYTE* p) { return ((LONG)p[3] << 24 )+ ((LONG)p[2] << 16 )+ ((LONG)p[1] << 8 )+ ((LONG)p[0] );}void WriteDword(BYTE* p, DWORD dw) { p[0] = (BYTE)((dw >> 24 ) & 0xFF); p[1] = (BYTE)((dw >> 16 ) & 0xFF); p[2] = (BYTE)((dw >> 8 ) & 0xFF); p[3] = (BYTE)((dw ) & 0xFF);}DWORD RoundUpToDword(DWORD val) { return (val + 3) & ~3;}#define TTC_FILE 0x66637474const DWORD SizeOfFixedHeader = 12;const DWORD OffsetOfTableCount = 4;const DWORD SizeOfTableEntry = 16;const DWORD OffsetOfTableTag = 0;const DWORD OffsetOfTableChecksum = 4;const DWORD OffsetOfTableOffset = 8;const DWORD OffsetOfTableLength = 12;HRESULT ExtractFontDataFromTTC( HDC hdc, __out DWORD* pcbFontDataSize, __deref_out_bcount(*pcbFontDataLength) void** ppvFontData) { *ppvFontData = NULL; *pcbFontDataSize = 0; // Check if font is really in ttc if (GetFontData(hdc, TTC_FILE, 0, NULL, 0) == GDI_ERROR) { return GetLastError(); } // 1. Read number of tables in the font (ushort value at offset 2) USHORT nTables; BYTE UshortBuf[2]; if (GetFontData(hdc, 0, 4, UshortBuf, 2) == GDI_ERROR) { return GetLastError(); } nTables = ReadUshort(UshortBuf); // 2. Calculate memory needed for the whole font header and read it into buffer DWORD cbHeaderSize = SizeOfFixedHeader + nTables * SizeOfTableEntry; BYTE* pbFontHeader = (BYTE*)malloc(cbHeaderSize); if (!pbFontHeader) { return E_OUTOFMEMORY; } if (GetFontData(hdc, 0, 0, pbFontHeader, cbHeaderSize) == GDI_ERROR) { free(pbFontHeader); return GetLastError(); } // 3. Go through tables and calculate total font size. // Don't forget that tables should be padded to 4-byte // boundaries, so length should be rounded up to dword. DWORD cbFontSize = cbHeaderSize; for(int i = 0; i < nTables; i++) { DWORD cbTableLength = ReadDword(pbFontHeader + SizeOfFixedHeader + i * SizeOfTableEntry + OffsetOfTableLength ); if (i < nTables - 1) { cbFontSize += RoundUpToDword(cbTableLength); } else { cbFontSize += cbTableLength; } } // 4. Copying header into target buffer. Offsets are incorrect, // we will patch them with correct values while copying data. BYTE* pbFontData = (BYTE*)malloc(cbFontSize); if (!pbFontData) { free(pbFontHeader); return E_OUTOFMEMORY; } memcpy(pbFontData, pbFontHeader, cbHeaderSize); // 5. Get table data from GDI, write it into known place // inside target buffer and fix offset value. DWORD dwRunningOffset = cbHeaderSize; for(int i = 0; i < nTables; i++) { BYTE* pEntryData = pbFontHeader + SizeOfFixedHeader + i * SizeOfTableEntry; DWORD dwTableTag = ReadTag(pEntryData + OffsetOfTableTag); DWORD cbTableLength = ReadDword(pEntryData + OffsetOfTableLength); // Write new offset for this table. WriteDword(pbFontData+ SizeOfFixedHeader + i * SizeOfTableEntry + OffsetOfTableOffset, dwRunningOffset ); //Get font data from GDI and place it into target buffer if (GetFontData(hdc, dwTableTag, 0, pbFontData + dwRunningOffset, cbTableLength) == GDI_ERROR) { free(pbFontHeader); return GetLastError(); } dwRunningOffset += cbTableLength; // Pad tables (except last) with zero's if (i < nTables - 1) { while (dwRunningOffset % 4 != 0) { pbFontData[dwRunningOffset] = 0; ++dwRunningOffset; } } } free(pbFontHeader); *ppvFontData = pbFontData; *pcbFontDataSize = cbFontSize; return S_OK;}int main(int argc, WCHAR* argv[]) { HDC hdc = CreateCompatibleDC(0); LOGFONT lf = { 12, 0, 0, 0, FW_NORMAL, false, false, false, ANSI_CHARSET, OUT_DEFAULT_PRECIS, CLIP_DEFAULT_PRECIS, DEFAULT_QUALITY, DEFAULT_PITCH | FF_DONTCARE}; wcscpy_s(lf.lfFaceName, sizeof(lf.lfFaceName)/sizeof(WCHAR), L"MS Gothic"); HFONT oldfont = (HFONT)SelectObject(hdc, CreateFontIndirect(&lf)); void* pvFontData; DWORD dwFontDataSize; if (FAILED(ExtractFontDataFromTTC(hdc, &dwFontDataSize, &pvFontData))) { return -1; } printf("Font extracted: %i bytes in size", dwFontDataSize); FILE* file = fopen("font.ttf","wb"); fwrite(pvFontData, sizeof(BYTE), dwFontDataSize, file); fclose(file); free(pvFontData); return 0;}
This post brought to you by ख़ (U+0959, a.k.a. DEVANAGARI LETTER KHHA)
Kristen brought up a very good question yesterday that one of her testers asked her:
This happened on WinXP and VistaSet locale to Chinese(Tawain)On XP - install the needed CHT pkg for surrogate support create a new Groove notepad record enter Chinese characters and then surrogate pairs D841DD8C or D842DF63 on winXP save the record print the record Actual results: surrogate pair records are 'fainter' and thinner than the regular Chinese textExpected result: would assume the characters should print the same.
This happened on WinXP and VistaSet locale to Chinese(Tawain)On XP - install the needed CHT pkg for surrogate support
Actual results: surrogate pair records are 'fainter' and thinner than the regular Chinese textExpected result: would assume the characters should print the same.
Actually, there is a bit of a misunderstanding about what is expected here....
To describe what I mean, I will start with the CJK ideographs in the Basic Multilingual Plane (BMP)
You may remember the GDI font link chain I described back in Font substitution and linking #3. I did not give the whole list, but I showed the order of the CJK fonts themselves, and how it varied with different default system locale settings:
Japanese will default to using MS UI Gothic (fallback to PMingLIU, then SimSun, then Gulim) Korean will default to using Gulim (fallback to PMingLiu, then MS UI Gothic, then SimSun) Simplified Chinese will default to using SimSun (fallback to PMingLiu, then MS UI Gothic, then Batang) Traditional Chinese will default to using PMingLiu (fallback to SimSun, then MS Mincho, then Batang)
Now let's take the situation where the default system locale is Japanese, and the first two fonts are MS UI Gothic and PMingLiU. Looking at the two fonts in Character Map shows that the first font clearly has a different set of ideographs in that first block, if for no other reason that the two sets of ideographs in the visible block are not identical and end on two different ideographs:
Now looking at the two fonts, they clearly show different styles for the ideographs, which means that any time you have a Japanese system locale, use MS UI Gothic, and then have text that is not contained in that font that the text will look like it has different styles.
And this is a problem that also exists in differences between the different fonts that contain CJK ideographs from Extension B as opposed to the BMP, where differences in styles between fonts can exist just as easily. And you can end up with those different styles in mixed text.
Note that in the Extension B case, it is not GDI font linking that does the work; it is Uniscribe font fallback. But the end result is the same any time the two fonts do not have identical styles....
Now note that this will not always be the case, and sometimes the fonts may be intentionally designed to have like styles. But when they are not, it isn't always going to be unexpected! :-)
This post brought to you by 𠖌 (U+2058c, the CJK Extension B ideograph represented by U+d841 U+dd8c in UTF-16)
(Does anyone not think that the word "Sansless" is freaking hilarious, like The La Trattoria in Mickey Blue Eyes?!?)
Ok, font joke of the day, which was constructed by yours truly (Michael Kaplan) and Carolyn Parsons:
Q - Why did
Times New Roman
go out with
Courier New
?
A - He wanted to be more Open and was dating against Type.
eventually break up with
A - She was way too Fixed in her ways, and really had no sense of Proportion.
{Insert groan here}
Unsurprisingly, the Fall 2007 Fox Sitcom (Dating Against Type, the endearing comedy of a bunch of young fonts learning about life and love and smooth clean lines and parents suggesting they practice "sans" sex) was cancelled when studio execs heard rumors about this kind of humor being the cornerstone of the series.
Maybe if the pitch had been clips like this they would have gone for it:
Or the other version:
On the other hand, this is Fox we're talking about....
I wonder whether either of these two videos can legitimately be considered font rage? :-)
This post brought to you by ꃕ (U+a0d5, a.k.a. YI SYLLABLE FOX)
Ben asked me the other day via email:
This isn’t something I’m blocked on, but if you’re curious (I am!) –I’m wondering about the expected behavior of CompareInfo.IndexOf. I found that when searching for a Kannada string (“ಕನ್ನಡ”) I match versus a longer version that ends with what appears to be a non-spacing mark: “ಕನ್ನಡಿ” (hex dump below). I can work around that by checking for trailing non-spacing marks at the end of the match.However, I also experimented with searching for ‘e’ (0x65) in the text ‘e’ + combining acute { 0x65, 0x301 }. In this case, IndexOf returns -1. In both cases, I have a trailing NonSpacingMark, but in only one do I get a match. Any idea what gives? string text = new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1, (char)0xcbf }); string pattern = new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1 }); CompareInfo compareInfo = CompareInfo.GetCompareInfo("kn-IN"); int index = compareInfo.IndexOf(text, pattern, 0); // returns 0 text = new string(new char [] { (char)0x65, (char)0x301 }); pattern = new string(new char [] { (char)0x65 }); compareInfo = CompareInfo.GetCompareInfo("en-us"); index = compareInfo2.IndexOf(text, pattern, 0); // returns -1Ben
This isn’t something I’m blocked on, but if you’re curious (I am!) –I’m wondering about the expected behavior of CompareInfo.IndexOf. I found that when searching for a Kannada string (“ಕನ್ನಡ”) I match versus a longer version that ends with what appears to be a non-spacing mark: “ಕನ್ನಡಿ” (hex dump below). I can work around that by checking for trailing non-spacing marks at the end of the match.However, I also experimented with searching for ‘e’ (0x65) in the text ‘e’ + combining acute { 0x65, 0x301 }. In this case, IndexOf returns -1. In both cases, I have a trailing NonSpacingMark, but in only one do I get a match. Any idea what gives?
string text = new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1, (char)0xcbf }); string pattern = new string(new char [] { (char)0xc95, (char)0xca8, (char)0xccd, (char)0xca8, (char)0xca1 }); CompareInfo compareInfo = CompareInfo.GetCompareInfo("kn-IN"); int index = compareInfo.IndexOf(text, pattern, 0); // returns 0 text = new string(new char [] { (char)0x65, (char)0x301 }); pattern = new string(new char [] { (char)0x65 }); compareInfo = CompareInfo.GetCompareInfo("en-us"); index = compareInfo2.IndexOf(text, pattern, 0); // returns -1Ben
Well, the second case Ben describes is by design and is similar to issue I mentioned here and here.
Though this case is more convincing/compelling, since it really is a diacritic on a letter, etc.
The first case, however, although the additional character is U+0cbf (a.k.a. ಿ, KANNADA VOWEL SIGN I) is technically general category == Mn (Mark, Nonspacing), it takes more than that to impact collation -- in this case because the letter has primary weight.
Anyone want to guess what that reason might be? :-)
(I'll give people a chance to respond for this question, and I'll give some answers tomorrow or the next day)
This post brought to you by ಿ (U+0cbf, a.k.a. KANNADA VOWEL SIGN I)
People who have been reading here for a while know that I am pretty down on the way Windows supports calendars. Posts like Calendars on Win32 -- Not all there yet and Calendars on Win32 -- just there for show.... make it clear that the limitations in the way calendars are implemented simply bother me any time I think about them.
And as posts like Calendars.NET -- new platform, new issues show, I don't think the situation really improved all that much under .NET.
The underlying architecture of calendars on both platforms is just really not inspiring.
However, calendars themselves fascinate me.
In fact, were it not for the fact that there are a few other things that capture more of my fascination (e.g. collation and linguistics), I'd actually be trying to push vision documents and architecture proposals for how I see calendars working in the future. It is a very interesting area that deserves the time and attention, generally.
Maybe one day I'll talk about some of my thoughts in this space, too. Just to mix it up a bit!
In any case, when Shelby Eaton (our SDET who owns calendars) set up a two hour meeting for people in my old group to talk with Nachum Dershowitz (co-author of Calendrical Calculations), I knew I had to be there.
(And yes, I had him sign my copy of his book!)
That book, in both of its editions, not only was of tremendous help in learning about calendars, but they have actually helped lots of other people as my copies kept getting borrowed by people over the course of the last few years....
Anyway, the whole conversation was really fascinating, and I suspect it may even be ongoing as people work to determine what happens next with calendars, and how to make for a better international experience in the future for calendars in both managed and unmanaged code.
This post brought to you by 𐂵 (U+100b5, a.k.a. LINEAR B IDEOGRAM B173 MONTH)
It is a known fact that some people hate Comic Sans MS (why else have a website like http://www.bancomicsans.com/ if everyone loved it?).
Though as Mark Liberman pointed out yesterday in Language Log in his post entitled Font Rage, some people are choosing to be pretty extreme.
I happen to like the font -- it is my main font in email in Outlook, and just like last year I still pine for the day that they make Comic Sans Fixed a reality:
This despite the fact that holding my breath waiting would likely prove fatal....
I'll dig up another instance of font rage in a few hours.
This post brought to you by "ආ" (U+0d86, a.k.a. SINHALA LETTER AAYANNA)
(This could potentially be a very exciting new series -- I am optimistic enough about this idea that I suffixed the title with a #1!)
One of the things I like about this blog is that people often point me at things that I might never have known about otherwise....
Like earlier today, when over in the Suggestion Box, Frederic Delhoume asked:
Hi,I have been unable to find any information related to locale aware memory formatting.I use StrFormatByteSize but how does it get its formatting information ?I have not found any Win32 API related to this topic.Thanks,Frederic.
This function is a fascinating bit of code, in both good and bad ways.
Here is how it works:
So you end up with a string that is partially formatted based on the default user locale and partially formatted based on the default user UI language.
In my opinion the same setting could (perhaps should?) be used throughout.
But even more importantly, a function should take the locale/locales to use and then perhaps a simpler version could pre-fill the information for the Shell usage.
With the current code, any other settings are inaccessible, which does close down a whole bunch of scenarios.
This could be an interesting feature to consider adding to a GetNumberFormat/GetNumberFormatEx extension or new function that could take the info about units to use. I see lots of interesting complications that may not make it practical, but I think given the extensive work behind StrFormatByteSize, there is clearly a scenario worth thinking about....
This post brought to you by ∑ (U+2211, a.k.a. N-ARY SUMMATION)