The Cantonese IME (not for input of characters from Canton, Ohio)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

The Cantonese IME (not for input of characters from Canton, Ohio)

  • Comments 64

Last month I was talking about how Feature ideas don't always turn out to be good ones. And I mentioned how I'd probably talk about other cases in the future.

What can I say besides welcome to the future. :-)

In Vista, from the time when it was just Longhorn, there has been enhanced collation support for all of the CJK locales. The stroke count sorts and Mandarin pronunciation (both Pinyin and Bopomofo) sorts all covered more characters, the Korean Hangul pronunciation sort was enhanced too, and the Japanese locale got a new alternate sort to cover everything in JIS X 0213. Basically a lot of work was done.

But there was one area that was not covered that was really bothering me -- there was no support for a Cantonese sort of any kind.

"But isn't Cantonese," you might ask, "a spoken dialect, not a written one?"

The Wikipedia article Written Cantonese gives a good answer to this question in its introduction:

Written Cantonese refers to the written language used to write colloquial standard Cantonese using Chinese characters.

Cantonese is usually referred to as a spoken variant, and not as a written variant. Spoken vernacular Cantonese is different from standard written Chinese, which is essentially formal Standard Mandarin in written form. Written Chinese spoken word for word in Cantonese sounds overly formal and distant. As a result, the necessity of having a written script which matched the spoken language increased over time. This resulted in the formation of additional Chinese characters to complement the existing characters. Many of these represent phonological sounds not present in Mandarin. A good source for well documented written Cantonese words can be found in the scripts for Cantonese drama and Cantonese opera.

With the advent of the computer and standardization of character sets specifically for Cantonese, many printed materials in predominantly Cantonese spoken areas of the world are written to cater to their population with these written Cantonese characters. As a result, mainstream media such as newspapers and magazines have become progressively less conservative and more colloquial in their dissemination of ideas. Generally speaking, some of the older generation of Cantonese speakers regard this trend as a step "backwards" and away from tradition. This tension between the "old" and "new" is a reflection of a transition that is taking place in the Cantonese speaking population.

And if you look at the major population centers with people who use Cantonese, there are clear efforts to support this development among many of the native speakers (and writers) of Cantonese.

There are some cultural issues that even I was faced with when doing research here that I will discuss further in a follow-up post....

Of course one of the big problems has been that there are multiple romanizations used to represent the pronunciations, and unfortunately they are often used in the same lists (like phonebooks in Macau and elsewhere that allow people to simply enter the pronunciation -- how can you hope to sort the phone book consistently if the people providing the pronunciations have different ideas of how even identical pronunciations are to be represented?

But lots of work has been done to try to help with this issue, for example the Jyutping system produced by the Linguistic Society of Hong Kong (LSHK). And many people have been trying to use it -- for example the government of the Hong Kong SAR's Chinese Language Interface Advisory Committee (CLIAC) has produced the Cantonese Pronunciation List of the Characters for Computers, a huge set of data providing Cantonese "Pinyin-esque" style pronunciations for much of the Hong Kong Supplemental Character Set (HKSCS).

When I first saw that we would have a list of over 30,000 ideographs and their pronunciations, I was excited -- perhaps this data could be used to provide a Cantonese sort for the people in Hong Kong and elsewhere who wanted it?

But unfortunately, while there is much that is good about Jyutping, it has one liability at present, one that it shares with Yale and other romanization systems: and that is that there are several romanization systems. And there is not yet one that is ubiquitous.

Another problem that exists is that for the 30,764 unique ideographs given pronunciations in the CLIAC-provided doc, there are less than 2,000 unique pronunciations (less than 700 if you do not include the tone values).

And yet another problem is in the decision about tones -- some number the tones in Cantonese at nine, while others claim that three of these are unimportant distinctions and that there are only six to worry about. So it is not just different romanization systems, which vary enough with place names like Canton and Guangzhou coming from the same word, but even if people agree on the romnization they may differ on their opinion of the tones (with some believing that tones 7, 8, and 9 actually fold into 1, 3, and 6 respectively).

And the final problem, there is not yet a clear and established standard on how to break ties -- once you decide which Han have the same pronunciation, how do you decide which one comes first?

There was just not enough of a consensus yet to try to push ahead in Windows with providing such a sort. Because Microsoft has no interest in dictating language policy; we just want to identify it so that we can represent things the way customers would like them.

But this now brings us to input methods.

Like I said way back in December of 2004, IMEs have it easy. In this case because (if for no other reason) if you identify a rich new source of pronunciations you can simply add them to the IME if you like them. Or you can provide different IMEs using the different systems, too (assuming you have enough data!).

Anyway, enough of the backstory, right? Let's get to the IME, like I said I would!

The steps are the same as they were with the Unicode IME. Just grab the file from here (871 kb) or you can grab the zipped version here (144 kb).

1) Copy the text file to \Program Files\Windows NT\TableTextService on your Vista machine (if the "Program Files" on your machine is another language, use that directory, do not create a new one!).

2) Open an elevated command prompt and navigate to that directory.

3) Run the following from that command prompt:

rundll32 TableTextService.dll RegisterProfile TableTextServiceCantonese.txt

4) Say OK to the dialog that comes up verifying you want to install it:

You can now add the Chinese Hong Kong Cantonese IME to the Chinese (Hong Kong S.A.R.) locale by going through the following steps that are illustrated here.

Now like the Unicode IME this is a sample, and further this is a work in progress. There are lots of things I would like to do to tweak settings here, like as in how/if the list should be sorted, for example.

(And if I find other huge caches of Cantonese pronunciations in other romanizations I might even see whether they could be productively combined.)

And like I said, in an upcoming post I will talk about many of the cultural issues I ran across while doing the research here -- they are fascinating!

 

This post brought to you by 䕫 (U+2f9b2, an Extension B ideograph in HKSCS with a Jyutping pronunciation of kwai4)

Comment on the blather
Leave a Comment
  • Please add 6 and 5 and type the answer here:
  • Post
Blog - Comment List
  • "This worked for me:

    Right-click on Command Prompt and choose to run as Administrator.

    On the command prompt, first type:

    cd \"Program Files\Windows NT\TableTextService"

    then you can run that rundll32 command to register the IME.

    If using Vista x64, repeat the same but with "Program Files (x86)" instead of "Program Files"."  

    but the words didn't come out as what I really want... can someone upload an improved version of the word selection?

  • 廣東=Kwangtung; 廣州=Canton--Before the introduction of Hanyu Pinyin.

    And now,

    廣東=Guangdong; 廣州=Guangzhou.

    Guangzhou and Canton mean the same: 廣州

    And i guess you can find Canton in most English dictionaries but not Guangzhou.

  • I do not have TableTextService.dll on my computer. What can I do?

    Also, just making sure, can Yale input method be used?

  • Are you running Vista, Server 2008, Windows 7, or Server 2008 R2?

Page 5 of 5 (64 items) 12345