Suzanne has been riffing on me in relation to Vietnamese and then she shifted over to talk about Google and other languages, so I thought I would riff off of her a bit. :-)

By the way Suzanne -- I did not find your terminology to be inaccurate; it was just different. I was explaining why I was confused!

I am going to take her string bãi biển from that first article and run it through various search engines in various forms, trying to look for some patterns. I am not using Suzanne's test (using Google Image) looking for pictures of beaches) since not all of the search engines have a comparable service. it is pretty easy to tell from the text excerpts if the search has found Vietnamese sights or not without too much trouble....

First, the engines:

  1. google.com
  2. search.msn.com
  3. start.com
  4. live.com
  5. altavista.com
  6. ask.com
  7. excite.com
  8. yahoo.com

Second, the strings to test:

  1. Normalization Form C: bãi biển (0062 00e3      0069 0020 0062 0069 1ec3           006e)
  2. Intermediate form   : bãi biển (0062 0061 0303 0069 0020 0062 0069 00ea 0309      006e)
  3.  " "  " w/o accents : bai biên (0062 0061      0069 0020 0062 0069 00ea           006e)
  4. Normalization Form D: bãi biển (0062 0061 0303 0069 0020 0062 0069 0065 0302 0309 006e)
  5.  " "  " w/o accents : bai bien (0062 0061      0069 0020 0062 0069 0065           006e)

Now if you look at these five strings being tested #1, #2, and #4 are all canonically equivalent and thus should give identical results in search engines that conform to Unicode and its principles of canonical equivalence.

I will put the strings in double quotes for all search engines.

And here are the comparison results:

Engine

#1

bãi biển

#2

bãi biển

#3

bai biên

#4

bãi biển

#5

bai bien

google 76,200 1,830 2,360 1,390 1,380
msn 14,702 190 678 0 678
start 14,703 191 679 0 679
live 14,702 190 678 0 678
altavista 59,700 59,700 509 59,700 1,040
ask 98 23 1 631 354
excite 42/18 3/3 55/0 3/0 64/64
yahoo 57,600 57,600 493 57,600 1,030

Conclusions:

  • Excite (using the default settings) does not understand Unicode from UNICEF, and the only reason it had 42 hits from Form C was that it munged that second word into "bi" and linked to inappropriate sights that I would never really be interested in looking at, even if they were in Vietnamese. This is (by the way) why the link is not live. :-)
  • Excite (using advanced search, which defaults to all languages) fared a little better, with Suzanne's post high up on the list for #2 and #5.
  • Lycos (which I did not put in the table) returned reults that were practically identical to ask.com.
  • That one link on ask.com and lycos.com for #3 is to Suzanne's post, which I think is pretty funny. :-)
  • Both altavista and yahoo appear to be using Unicode normalization, returning identical results for all canonically equivalent forms.
  • Google appears to be stripping combining characters out of either its seasrch strings, its indexes, or both.
  • Any claim that Google is normalizing appears to be crap -- at least insofar as one considers Unicode normalization. They are doing their own thing rather than the standard. But then they are only Associate members of Unicode, so I guess they aren't in all the way just yet....
  • Microsoft (msn, start, and live) have a whole bunch of work to do and I am having trouble fathoming what precisely they are doing.
  • Not enough of the Search community is taking the important of canonical equivalence seriously, to the detriment of many language communites, including Vietnamese.
  • Until that time, a better keyboard solution for Vietnamesse in particular suddenly seems more and more compelling.

 

This post brought to you by "" (U+1ed5, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE)