Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Well, that's not right. Maybe he would say "I want my space." No, that's not right either.
He would say "I want myspace". That's it!
He found it, he actually has a myspace site, right here.
When someone first pointed it out to me, I was sure it was not him doing the site, himself. Fans do tribute sites all the time like this.
But Spencer Lewis, a man with an inside track on things related to Michael Penn, confirmed it.
I think that is incredibly cool. How many times do famous people actually do their own sites like that?
Even looking at an official site can be a task that is hard to get to. I remember when I was down in L.A. and I told Aimee Mann after a Largo show that I had done the international links on her site (her old Bachelor No. 2 site) she looked at me blankly for a second and then said very warmly "Great, thank you!". I am pretty sure she had not seen the links before....
Separate note -- now that the old site I worked on is defunct, maybe I should post the content (some localized biographies, nothing huge) here. It was an interesting localization project to talk about, anyway. :-)
But feel free to check out Michael Penn's myspace.com site. If you have a myspace account you can even claim to be one of Michael Penn's friends!
Recently someone asked a question via the Contact link:
Hi, I was wondering if you know of a way to get the stroke count of a chinese character? This function seem to be in Word (chinese index), and also in .NET (string sort according to stroke and bopomofo) But I can't seem to find any API functions exposed to retrieve the stroke count nor bopomofo?
Hi, I was wondering if you know of a way to get the stroke count of a chinese character? This function seem to be in Word (chinese index), and also in .NET (string sort according to stroke and bopomofo)
But I can't seem to find any API functions exposed to retrieve the stroke count nor bopomofo?
Of course they could have asked in the Suggestion Box and suggested it as a topic, but I suspect they were not considering that at the time. But it is a question I have been itching to answer anyway, so I'll take this as a bit of serendipity. :-)
The quick answer:
You can't.
The fast paced answer that is a bit less of a sprint:
Although the collation tables for the various CJK languages are based on source data that has the various pronunciations and stroke counts, the source data is gone after the weights are assigned.
The long answer:
I'll tell you a secret -- that original source data is not just gone from the product, it is actually gone from this plane of existence, as far as I have been able to gather. No one seems to know where it is anymore, and the person who did a lot of the work is also gone. Access to the source code would only get you those weights. And the collation APIs work quite well at using that data in every single way except getting the original source back.... Maybe it will show up somewhere and people find stuff all the time. And obviously it can be reconstructed by working from the same original standards -- it is based on a known entity. There is one exception to this: Korean, where Hanja sorting is based on the Hanguel pronunciation and thus in the case of Korean the pronunication is fairly intrinsic and easy to get to, since the Hanguel is right there. The source is built in and fairly easy to reconstruct using the information I mention in my explanation of why NORM_IGNORENONSPACE makes Korean text sort in apparently random order. In this one case, the data is right there. The same is not true of (for example) Bopomofo, since the actual Bopomofo script is not sorted in with the Han ideographs; this does not meet user expectations. Kind of a shame in a way, it would be cool to have the Bopomofo interlaced ad one might do in a Bopomofo dictionary or address book. But its not what the users expect, which does really rule our plans in such cases. And it is not true of Pinyin, which is of course Latin letters (if they started sorting in the middle of Han text the natural order of the universe would seem fragmented!). It is kind of true for stroke-based counts since obviously one can just count the strokes and work backwards, but that is hardly a function in the API....
I'll tell you a secret -- that original source data is not just gone from the product, it is actually gone from this plane of existence, as far as I have been able to gather. No one seems to know where it is anymore, and the person who did a lot of the work is also gone. Access to the source code would only get you those weights. And the collation APIs work quite well at using that data in every single way except getting the original source back....
Maybe it will show up somewhere and people find stuff all the time. And obviously it can be reconstructed by working from the same original standards -- it is based on a known entity.
There is one exception to this: Korean, where Hanja sorting is based on the Hanguel pronunciation and thus in the case of Korean the pronunication is fairly intrinsic and easy to get to, since the Hanguel is right there. The source is built in and fairly easy to reconstruct using the information I mention in my explanation of why NORM_IGNORENONSPACE makes Korean text sort in apparently random order. In this one case, the data is right there.
The same is not true of (for example) Bopomofo, since the actual Bopomofo script is not sorted in with the Han ideographs; this does not meet user expectations. Kind of a shame in a way, it would be cool to have the Bopomofo interlaced ad one might do in a Bopomofo dictionary or address book. But its not what the users expect, which does really rule our plans in such cases. And it is not true of Pinyin, which is of course Latin letters (if they started sorting in the middle of Han text the natural order of the universe would seem fragmented!). It is kind of true for stroke-based counts since obviously one can just count the strokes and work backwards, but that is hardly a function in the API....
Now I actually have been doing a step better than reconstructing that data -- I have been working to try to expand it, for future versions. And of course to make sure it is kept, since it is the secular equivalent of the 'Collation Holy Scripture'. There are not currently plans to expose the source data directly, though it may be something worthy to think about at some point. I'll consider this request as another data point in that decision, when/if something is decided. :-)
This post brought to you by "ㄅ" (U+3105, a.k.a. BOPOMOFO LETTER B)
Back in the beginning of May, I posted about Community Server issues that seem affect this blog.
Looks like they have been busy fixing some bugs!
Bugs #2 and #3 no longer seem to happen -- there appears to be full access via both the month links and the category links, all the way back to my original posts as far back as November of last year.
Bug #1 (searches unable to look past page #1) still happens, though it is slightly different now:
Search http://blogs.msdn.com/michkap/ for a word that is there a lot, like collation. It is now up to 57 hits (but after all, the name of the freaking blog is "Sorting It All Out"; what did you expect?).
Page 1 has this link:
http://blogs.msdn.com/michkap/search.aspx?q=collation&p=1
There are four link pages at the bottom. The links for 2, 3, 4 look like this:
http://blogs.msdn.com/blogsearch.aspx?App=michkap&q=collation&p=1&PageIndex=2http://blogs.msdn.com/blogsearch.aspx?App=michkap&q=collation&p=1&PageIndex=3http://blogs.msdn.com/blogsearch.aspx?App=michkap&q=collation&p=1&PageIndex=4
all of which still point content-wise to page 1. If you instead use
http://blogs.msdn.com/michkap/search.aspx?q=collation&p=2http://blogs.msdn.com/michkap/search.aspx?q=collation&p=3http://blogs.msdn.com/michkap/search.aspx?q=collation&p=4
then everything works. The PageIndex stuff looks like a no-op.
On the brighter side, the bug with the comment counting that I talked about in Comments work here (though not so you'd really notice or anything) seems to be getting better. Old posts with bad counts did not have their counts spontaneously updated, but new anonymous comments seem to be counted. If it is true, the counts will slowly start to get more accurate.
Someone has been busy fixing stuff. Awesome work, mostly, and that is more than good enough for me....
This is not one of those fun posts where I get to talk about exciting new features. Instead I am going to answer some questions about CMD, the console (kŏn'sōl'), and now I am going to try and console (kən-sōl') the people with questions, since they will probably not care for the answers. :-(
Several hundred posts ago (back in the end of 2004), Per Bergland asked (in the Suggestion Box):
This may have been asked (and answered) before, but I find it such a shame that cmd.exe can't execute a bat/cmd file in unicode (UTF16). Since Notepad doesn't do "OEM", I find myself using the DOS EDIT text editor to fix up national characters such as our Swedish å,ä and ö. Hey, cmd can *read* unicode and even *write* using the "/U" switch, so why can't it read & execute a file containing Unicode? You wouldn't happen to know this, would you?
This may have been asked (and answered) before, but I find it such a shame that cmd.exe can't execute a bat/cmd file in unicode (UTF16). Since Notepad doesn't do "OEM", I find myself using the DOS EDIT text editor to fix up national characters such as our Swedish å,ä and ö.
Hey, cmd can *read* unicode and even *write* using the "/U" switch, so why can't it read & execute a file containing Unicode?
You wouldn't happen to know this, would you?
I can't answer this one with authority since I don't own CMD.EXE. In fact, I am not even sure who does these days. But I do know that it is not easy to get major feature work done in this area, in that codebase. The whole point of the Monad project (read more about it in posts here) is to get away from all of the backcompat issues that keep people from wanting to touch the code to make changes. The last time I checked it out, the plan was to support Unicode files, though.
For the legacy case, I have been in the habit of using Word and choosing the code page to save a file to as plain text as a way to get the files in the right format, and I have tried to lobby the owners of Notepad to consider adding another "Save As..." option for the OEM code page, but I have not gotten much traction on that (or on my other request for that list, the UTF-8 without BOM choice). Though if i had to guess which was more likely to be seen in the future, I would guess that they would be quicker to add features to Notepad then to the console....
Then, moving on into January, KJK::Hyperion asked (also in the Suggestion Box):
Console windows support Unicode, but they necessarily have a number of limitations, having to support the OEM charset and being limited to monospace fonts (which, I've seen, rules out composed characters and some special spacing characters). How is this handled internally? especially, how is Japanese handled, with its mixture of half-width and full-width characters? and how are valid fonts chosen?
See above for some answers. For wanting to have your own font choice, you can pick any monspace or essentially monospace font and then set one or both of the following registry values:
KEY == HKEY_CURRENT_USER\Console, ValueName == FaceName, Value == <whatever font you like> KEY == HKEY_CURRENT_USER\Console, ValueName == FontFamily, Value == <50 for decorative, 40 for Script, 30 for Modern, 20 for Swiss, or 10 for Roman>.
KEY == HKEY_CURRENT_USER\Console, ValueName == FaceName, Value == <whatever font you like>
KEY == HKEY_CURRENT_USER\Console, ValueName == FontFamily, Value == <50 for decorative, 40 for Script, 30 for Modern, 20 for Swiss, or 10 for Roman>.
Now when I say essentially monospace above, the reason for that is that none of the CJK fonts are true monospaced fonts. They all (even the bitmap fonts) have the halfwidth characters taking up half as much space as the fullwidth ones, though.
Most recently, Denis Bider asked (first of Larry Osterman, then of me (directly):
In our company, we observed the following apparent inconsistency in cmd.exe. If you execute cmd /?, you get this help text: /A - Causes the output of internal commands to a pipe or file to be ANSI /U - Causes the output of internal commands to a pipe or file to be Unicode But the fact is, the output of cmd /A is not actually ANSI. It is in the OEM code page. For example, if I try cmd /A echo csz > file.txt, and then try to open file.txt in Notepad (which uses ANSI), I get garbage. Lots of other command line utilities (like those in Cygwin) actually use ANSI. So this is a problem - characters get corrupted across pipe boundaries; files get interpreted in incompatible ways. From a user's perspective, it seems somewhat logical to expect that if the /A flag description says it will produce ANSI, it should produce ANSI; not OEM. What do you think? Is this intentional or is it a problem?
In our company, we observed the following apparent inconsistency in cmd.exe.
If you execute cmd /?, you get this help text:
/A - Causes the output of internal commands to a pipe or file to be ANSI
/U - Causes the output of internal commands to a pipe or file to be Unicode
But the fact is, the output of cmd /A is not actually ANSI. It is in the OEM code page. For example, if I try cmd /A echo csz > file.txt, and then try to open file.txt in Notepad (which uses ANSI), I get garbage.
Lots of other command line utilities (like those in Cygwin) actually use ANSI. So this is a problem - characters get corrupted across pipe boundaries; files get interpreted in incompatible ways.
From a user's perspective, it seems somewhat logical to expect that if the /A flag description says it will produce ANSI, it should produce ANSI; not OEM.
What do you think? Is this intentional or is it a problem?
Well, since it is behavior that has been around for several versions, I would hesitate to call it a bug. I will run it up the flagpole here, but I assume the "fix" will be to just fix up the text in that help. Which is really all that they could do, since changing the behavior of the flag would break who knows how many scripts (well, if we made the change, we would know -- from all of the people complaining about the behavior change!).
In the meantime though, I can recommend chcp.com, a nice little utility that will either display the active OEM code page in the console (if run with no parameters), or allow you to change that code page. You can look at some documentation on it here. Note that when you run this utility, is reports back the code page as the "Active code page". Not a 100% solution, but as good as the console will really allow.
Did I mention that you may want to take a look at Monad? :-)
This post brought to you by "〷" (U+3037, a.k.a. IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL)
Cathy Wissink has threatened to have the slogan in the title of this post made into a bumper sticker. Of course she is usually inspired to think about this after seeing yet another example of heavy metal umlauting (I can't believe this topic has a Wikipedia page!), a phenomenon that Arnold Zwizky mentioned over on the Language Log yesterday....
If she ever gets around to having them made, I'll buy one, take a picture of it, and post it here. :-)
Hugh McLeod has a cartoon-based blog named gapingvoid.com that is incredibly funny. The cartoons are mostly on the backs of business cards, like the hilarious one below (which I would love to see on a T-shirt; if you would too, click on the link and leave a comment to that effect!):
Ok, it is time for one of my periodic delusional episodes (you know, those delusions of linguistic aptitude I have from time to time.
(this post pre-recorded, a little blog experimentation!)
Now there is disambiguation, a word which may have already existed (it did according to a colleague who is in fact a linguist). But it was spontaneously reinvented by a program manager presenting at a developer conference, trying to describe the process by which identifiers in VBA are resolved. And it had been a party that went late into the night on the evening prior, and he was tired. Maybe even a little hungover. He was explaining that if the name has not been bound to anything yet, that what it is meant to refer to is ambiguous. This set the stage for his next words -- that VBA had to look through the references to disambiguate the name.
He thus introduced the word into the cosmic consciousness of VBA developers.
But I was going to talk about identfiers.
When I had to write the managed version of IsNLSDefinedString for Whidbey, what to call it was an interesting question. I suspect that largely on the basis of no one else really caring what it was called, no oneobjected when I dubbed it IsSortable. I actually had more than one person ask me later if sortable is a real word (to which I of course responded that English is a productive language, yada, yada, yada). On the longstanding precedent of the Server 2003 IsNLSDefinedString function in addition to weightless strings, unpaired surrogates and private use characters will cause them both to return FALSE. And while several people have asked why (since both do have some kind of sorting weight), people have stuck to their guns on this one -- there is no useful cross-machine usage for either unpaired surrogates or for private use area characters in identifiers like machine names.
It may seem somewhat pretentious, it may even be a little pretentious. It is just not a good idea, and maybe by having a method that calls itself IsSortable, people can be influenced about the idea of using these things in machine names and identiers in programming languages and such. The former might be possible (Active Directory uses us for collation after all), but the latter is of course a pipedream, since programming languages that allow attrocities like this will not even blink before allowing these "unsortable" characters in identfiers.
But is there something wrong with using IsSortable here? It is not like the naysayers who questioned its validity as a word had a better name they could suggest. And the method is referring to strings being used in collation operations, which do prefer meaningful strings anyway. Maybe IsDefined would have been better, but people seemed reluctant to have too new concepts added. If people were to ask what were the consequences of being undefined, the answer would be that you could not sort them effectively -- so we'd have explain what it meant to be sortable, anyway. So the current plan has fewer concepts to explain. :-)
Now yesterday, I was talking about the TextRenderer class. If one has the job of rendering, I suppose one is a renderer. And if one is rendering text, one is a text renderer. And English is a productive language, yada, yada, yada. But is it a word one would usually use in this context? It seems like it is more common to put the word Render in a method. And obviously when one looks at the methods on the class, it has two actions -- measuring and drawing. It is kind of a stretch to say they both fall under the category of the act of rendering. So what makes this usage seem okay? Or maybe nothing does and people just shake their heads.
Which gets me back to what you could (loosely) call the subject of this post.
The grammar of identifiers is a sparse one, meant to be the consistent application of a limited number of concepts. If I were creating a property on System.String for this, it would just be String.Sortable or String.Defined. But any time they have to be methods (like when they take parameters), they have an Is prefix, like all the char.Is* methods. Maybe calling a class TextRenderer feels weird to some because classes are supposed to be "pure" nouns, and not just the noun forms of verbs. Or maybe it just feels weird since the scope of what the class represents is not fully covered by the name. All as if we are trying to create an actual language that one could use to communicate the concept of the program.
Of course to non-programmers it may appear that programmers are talking like people with developmental disabilities, and some linguist may even balk at the idea of calling a programming language an actual language, but in truth one can communicate some very complex ideas. And every time I write a program in an object oriented language, I am extending the language. Well, maybe I am not unless I am adding something to the BCL, which in my case actually can happen.
But that raises another interesting idea -- when I create methods and properties in a new program, am I creating a dialect? One that you only speak if the appropriate terms are in scope for you? And if that is true, what does it mean to be the author of a much-used library -- are you the programming equivalent of Académie française
Of course, as Raymond Chen pointed out, sample code often tries harder to be multilingual, for obvious reasons. :-)
Are we creating language here? Or are my delusions of linguistic aptitude confusing me?
I wonder if anyone has ever studied this before, on the linguistic side.
This post brought to you by "Ӓ" (U+04d2, CYRILLIC CAPITAL LETTER A WITH DIARESIS)
A few days ago, Mark Liberman was talking about Spanish in Charlotte in 1965 and the way that the pronunciation of Oribe (which to most Spanish speaking people would be oh-REE-beh) became Or-bay mainly to match the phonemes of the people in North Carolina in the 1960's.
Interestingly enough, I was around a similar situation in the distant past.
I was once married to a lady of French Canadian heritage whose maiden name (Gagnon) was of course pronounced ga-yown in French but for ease of living with neighbors in Hartford, Connecticut the family simply changed to using gag-nun for their interactions with the English-speaking people around them. In perhaps my first interaction with something linguistuc, I was fascinated by this and had no problem using the 'real' pronunciation. Of course she did not like the name anyway and had no problem shedding it in the matrimonial process for 'Kaplan'.
Though she did go back to her old name after the marriage was over (she discovered that there are worse things than being a Gagnon, I suppose!).
People actually manage to sometimes screw up the pronunciation of Kaplan. Luckily that is just an occasonal problem.
Meanwhile I have sometimes considered changing my name back to the original family name ('Kaplan' is just an Ellis Island name) but I am not sure how people would respond to me as Michael Dragutsky. And then I'd have to worry about whether to fight to get people to pronounce it correctly (dry-goot-ski) or just give up and let people butcher it at will....
First I posted about Whidbey's TextRenderer.
Then I posted about enduserdefined characters.
Two concepts that have nothing in common, right?
Well, actually I am going to juxtapose these two concepts for a moment. :-)
Michael Warning (one of the cool Windows Client developers) had read that TextRenderer post and pointed out another really good reason for the existence of this new class. It boils down to interaction with EUDC!
One of the very interesting usage patterns that customers who use EUDC fonts is the need to dynamically update the font from time to time. Unfortunately, GDI+ locks the font file for the life of the application, which makes dyamic updates impossible.
(As I pointed out yesterday, one the biggest differences between GDI and GDI+ here is the perf. issues with GDI+'s attempts to be stateless. I do find the problem with EUDC to be kind of ironic since it involves GDI+ keeping the state of its font cache.)
Anyway, the TextRenderer does not have this problem. It allows for dynamic updates of EUDC fonts. Yet another reason that Whidbey's international support is cool in ways I never even knew about! :-)
This post brought to you by "傐" (U+5090, yet another CJK ideograph)
EUDC stands for End User Defined Characters. The Platform SDK defines them simply:
End user defined characters (EUDC) are customized characters that users install for viewing and printing documents. They enable users to form names and other words using characters that are not available in standard screen and printer fonts. These characters are available only in Asian-language versions of the system.
I think I will quote a bit more from the Platform SDK; there is not much so I may as well use more of it....
An enduserdefined character is always associated with a double-byte character set (DBCS) and a TrueType font. Applications identify the specified character by using the character's assigned DBCS character value, and the system uses this value to locate shape and style information in a corresponding TrueType font. The shape and style information specifies how the character is drawn on the screen or printed page. The DBCS character values that can be assigned depend on the specified character set. Each set has at least one range of reserved values for use as enduserdefined characters. The system or applications explicitly define these ranges by setting appropriate values under the EUDCCodeRange registry key. Each character set is identified by a unique code-page number. To create an enduserdefined character, the user chooses a character value that is within the specified range and adds the shape and style information to the TrueType font in the entry that corresponds to that character value. Users create the shape and style information using an EUDC editor or by purchasing enduserdefined font packages from font vendors. Any DBCS TrueType font can contain enduserdefined characters. The font is called a separate EUDC font if it contains only enduserdefined characters. The font is an integrated EUDC font if it contains standard characters as well as enduserdefined characters. Separate EUDC fonts are said to be either font-aware or font-unaware. A font-unaware font is designed to be a general purpose font that can be used with fonts of different font styles and of different implementations, such as GDI raster, WIFE, device, and TrueType fonts. A font-aware font is designed for use with a specific TrueType font. The system default EUDC font is a font-unaware font that the system automatically associates with all DBCS fonts except those TrueType fonts that have explicitly associated font-aware fonts. Applications set the system default EUDC font by setting the value of the SystemDefaultEUDCFont name under the EUDC registry key. Similarly, applications can associate font-aware fonts with corresponding TrueType fonts by specifying a font name and associated font file under the EUDC key. Separate EUDC fonts cannot be associated with integrated EUDC fonts. The system hides the system default EUDC and font-aware fonts. This means applications cannot enumerate or otherwise examine these fonts using GDI functions. Applications, such as EUDC editors and Control Panel, must use the registry entries to add, modify, and delete EUDC fonts. Enduserdefined characters can also be used in Unicode-enabled applications. The reserved ranges for each character set are mapped to corresponding values in the Unicode private use area (values 0xE000 and higher).
The DBCS character values that can be assigned depend on the specified character set. Each set has at least one range of reserved values for use as enduserdefined characters. The system or applications explicitly define these ranges by setting appropriate values under the EUDCCodeRange registry key. Each character set is identified by a unique code-page number.
To create an enduserdefined character, the user chooses a character value that is within the specified range and adds the shape and style information to the TrueType font in the entry that corresponds to that character value. Users create the shape and style information using an EUDC editor or by purchasing enduserdefined font packages from font vendors. Any DBCS TrueType font can contain enduserdefined characters. The font is called a separate EUDC font if it contains only enduserdefined characters. The font is an integrated EUDC font if it contains standard characters as well as enduserdefined characters.
Separate EUDC fonts are said to be either font-aware or font-unaware. A font-unaware font is designed to be a general purpose font that can be used with fonts of different font styles and of different implementations, such as GDI raster, WIFE, device, and TrueType fonts. A font-aware font is designed for use with a specific TrueType font.
The system default EUDC font is a font-unaware font that the system automatically associates with all DBCS fonts except those TrueType fonts that have explicitly associated font-aware fonts. Applications set the system default EUDC font by setting the value of the SystemDefaultEUDCFont name under the EUDC registry key. Similarly, applications can associate font-aware fonts with corresponding TrueType fonts by specifying a font name and associated font file under the EUDC key. Separate EUDC fonts cannot be associated with integrated EUDC fonts.
The system hides the system default EUDC and font-aware fonts. This means applications cannot enumerate or otherwise examine these fonts using GDI functions. Applications, such as EUDC editors and Control Panel, must use the registry entries to add, modify, and delete EUDC fonts.
Enduserdefined characters can also be used in Unicode-enabled applications. The reserved ranges for each character set are mapped to corresponding values in the Unicode private use area (values 0xE000 and higher).
Of course not a lot is said about the actual usage here, beyond the fact that East Asian versions of Windows allow you to create these custom characters.
But they are hugely important to many users in East Asia who need to define a custom look to an ideograph (or support a particular ideograph at all in some cases -- there are indeed characters that are not yet added to Unicode). The ability to add Han characters via the EUDC Editor is a pretty important feature for those who need it. I'll talk more about enduserdefined characters soon....
This post brought to you by "丹" (U+4e39, a CJK Unified Ideograph)
Over the weekend, TheMuuj mentioned in a comment:
As far as I know, there are new classes in Whidbey for drawing text with GDI (as a result of GDI+'s questionable screen rendering in some cases). Are these based on DrawText?
This is not exactly the reason. There are two basic problems that come into play with GDI+:
The object is the System.Windows.Forms.TextRenderer class, which has two methods that can be used to render text using GDI/Uniscribe rather than GDI+:
This class and these methods (which have several different overrides) allow WinForms to support new languages as the OS support is added. For example, the ELK support in Windows XP SP2 added font and rendering support to the operating system for Bengali and Malayalam, but versions 1.0 and 1.1 of the .NET Framework would not render these scripts properly, even when the right font was being used. However, version 2.0 (Whidbey) will be able to properly support these scripts whenever the OS can support them....
I have not personally experienced the performance issues but have beentold by people who have that the support can also very useful. I am more of a "language support" guy myself, though. :-)
This post brought to you by "আ" (U+0986, BENGALI LETTER AA)(A letter that was happy to see proper rendering support of Bengali and Assamese conuncts in managed code using XP SP2 and Whidbey!)
Recently overheard in the microsoft.public.dotnet.internationalization newsgroup:
Hi, I'm developing a web application, in that I have to display the short date according to the customized date format of client machine's culture. For example Culture = "fr-CA" and default Short date format = "yyyy-MM-dd", when you customize the shortdate format to "dd.MM.yy". Then the short date should be displayed in the format "dd.MM.yy" to the client. I'm able to display the date in the custom short date format while executing through the VS2005 IDE. But after deploying the same web application, the date is displayed only in the default short date format, though the short date format is customized.
Hi,
I'm developing a web application, in that I have to display the short date according to the customized date format of client machine's culture.
For example Culture = "fr-CA" and default Short date format = "yyyy-MM-dd", when you customize the shortdate format to "dd.MM.yy". Then the short date should be displayed in the format "dd.MM.yy" to the client.
I'm able to display the date in the custom short date format while executing through the VS2005 IDE. But after deploying the same web application, the date is displayed only in the default short date format, though the short date format is customized.
A commonly asked question, though unfortunately the answer is not really one that will make peron asking happy -- those settings (basically user overrides) are not available on the server. They are retrieved through a call to the Win32 NLS GetLocaleInfo function (for the short date you would pass the LOCALE_SSHORTDATE flag).
The only piece of information that the server gets related to locale is the HTTP_ACCEPT_LANGUAGE (which, as I mentioned in the article GEOID -- The LCIDs maligned little brother....) is currently based on the user locale. Anyway, right now it is based on the setting that you can set your CurrentCulture to, but there is no way to get the overrides....
Of course if you have a client side component you could send the information to the server but that has all kinds of dependencies on what is running on the client side, which ASP.NET tries to avoid when it can.
Another option is to have UI in the application that allows your users to set their preferred format if they would like to do so. Most users probably would not bother (most users do not change the setting in the Regional and Language Options UI, either) but those who would like to change a setting would probably not mind doing so.
If anyone complains, you can either blame the IETF for not having an HTTP header with the info, or on the makers of all the big browsers for not agreeing on some sort of standard for passing the data. :-)
On a side note, in that post about GEOIDs I meant what I said about the fact that many settings probably should be configured to use the GEOID where they are using an LCID today. That PM (Matt Ayers) is right. I tend to raz Matt a lot because we just have that kind of a relationship, but I do like to give him credit when I think he's right. We really ought be more region based when it makes sense to be so....
This post brought to you by "♃" (U+2643, a.k.a. JUPITER)
The Win32 DrawText function and its more full-featured cousin DrawTextEx, have been around for a long time. They both have a simple stated set of purposes:
...draws formatted text in the specified rectangle. It formats the text according to the specified method (expanding tabs, justifying characters, breaking lines, and so forth).
Now I have talked about word breaking in the past, and obviously they are related (where else would you break lines but on valid word breaks?). But the DrawText/DrawTextEx functions are from an earlier time -- a time before complex scripts, or good integration of Unicode character properties, of the real existence of mature Unicode character properties.
But let's take a look at its offerings, via the variou flags you can specify that affect the word break behavior:
DT_WORDBREAK - Breaks words. Lines are automatically broken between words if a word extends past the edge of the rectangle specified by the lprc parameter. A carriage return-line feed sequence also breaks the line. DT_NOFULLWIDTHCHARBREAK - Prevents a line break at a DBCS (double-wide character string), so that the line-breaking rule is equivalent to SBCS strings. For example, this can be used in Korean windows, for more readability of icon labels. This value has no effect unless DT_WORDBREAK is specified.
DT_WORDBREAK - Breaks words. Lines are automatically broken between words if a word extends past the edge of the rectangle specified by the lprc parameter. A carriage return-line feed sequence also breaks the line.
DT_NOFULLWIDTHCHARBREAK - Prevents a line break at a DBCS (double-wide character string), so that the line-breaking rule is equivalent to SBCS strings. For example, this can be used in Korean windows, for more readability of icon labels. This value has no effect unless DT_WORDBREAK is specified.
Huh?
What the first one is trying to say is that by default, the text will just keep going and then when the border is reached it will start the new line and possibly break right in the middle. But if you pass the DT_WORDBREAK flag, then you are saying to make the breaks at the boundaries of words in the text. Which is pretty much what people expect (and what controls like EDIT already do themselves).
The second flag was added after many user complaints about the Windows 95/NT 4.0 behavior that treats after each CJK ideograph as a potential word break opportunity. This new flag says to treat CJK the same way everything else is treated -- look for the spaces as the word break opportunities.
Of course you may expect for more than just U+0020 to be handled when I say space. But most of the ones you would expect on such a list would not be there.
Interestingly, all of the following are also looked as word breaking opportunities in East Asian text:
Obviously the functionality in DrawText and DrawTextEx is not quite up to Unscribe standards, when it comes to complex scripts. But you know how I feel about NLS API behavior changes? Well this is core GDI behavior, and both they are MS Typograophy have to worry about even the most minute changes in behavior of their functions once something has shipped. Because you never know who is relying on it. A small change in word break behavior could make the page count of a document double or worse, so even sensible changes can only be made via new flags (or in the case of complex scripts via new functions).
This post brought to you by " " (U+0020, a.k.a. SPACE)
Remember that article I wrote for the March 2005 MSDN Magazine entitled Make the .NET World a Friendlier Place with the Many Faces of the CultureInfo Class? Well, as title of this post hints, it has been translated into Japanese! :-)
(This post is also an experiement to see if Japanese text will work in Community Server, both in the title and in trackbacks!)
See the article now at 多機能な CultureInfo クラスを利用して .NET の世界を身近なものにする.
(via Stephen Toub)
Remember that article I wrote for the March 2005 MSDN Magazine entitled Make the .NET World a Friendlier Place with the Many Faces of the CultureInfo Class? Well, as title of this post hints, it has been translated into Russian! :-)
(This post is also an experiement to see if Russian text will work in Community Server, both in the title and in trackbacks!)
See it now at Многоликий класс CultureInfo — .NET-приложения станут дружелюбнее к пользователю. Check out the localized code identifiers and comments -- easy to do with C# where you can ave your source files as UTF-8 and have comments and variable names in any language....