Tuesday, November 15, 2005 12:01 AM
Michael S. Kaplan
SIAO to Search engines -- would you please normalize, already?
Suzanne has been riffing on me in relation to Vietnamese and then she shifted over to talk about Google and other languages, so I thought I would riff off of her a bit. :-)
By the way Suzanne -- I did not find your terminology to be inaccurate; it was just different. I was explaining why I was confused!
I am going to take her string bãi biển from that first article and run it through various search engines in various forms, trying to look for some patterns. I am not using Suzanne's test (using Google Image) looking for pictures of beaches) since not all of the search engines have a comparable service. it is pretty easy to tell from the text excerpts if the search has found Vietnamese sights or not without too much trouble....
First, the engines:
- google.com
- search.msn.com
- start.com
- live.com
- altavista.com
- ask.com
- excite.com
- yahoo.com
Second, the strings to test:
- Normalization Form C: bãi biển (0062 00e3 0069 0020 0062 0069 1ec3 006e)
- Intermediate form : bãi biển (0062 0061 0303 0069 0020 0062 0069 00ea 0309 006e)
- " " " w/o accents : bai biên (0062 0061 0069 0020 0062 0069 00ea 006e)
- Normalization Form D: bãi biển (0062 0061 0303 0069 0020 0062 0069 0065 0302 0309 006e)
- " " " w/o accents : bai bien (0062 0061 0069 0020 0062 0069 0065 006e)
Now if you look at these five strings being tested #1, #2, and #4 are all canonically equivalent and thus should give identical results in search engines that conform to Unicode and its principles of canonical equivalence.
I will put the strings in double quotes for all search engines.
And here are the comparison results:
| Engine |
#1
bãi biển |
#2
bãi biển |
#3
bai biên |
#4
bãi biển |
#5
bai bien |
| google |
76,200 |
1,830 |
2,360 |
1,390 |
1,380 |
| msn |
14,702 |
190 |
678 |
0 |
678 |
| start |
14,703 |
191 |
679 |
0 |
679 |
| live |
14,702 |
190 |
678 |
0 |
678 |
| altavista |
59,700 |
59,700 |
509 |
59,700 |
1,040 |
| ask |
98 |
23 |
1 |
631 |
354 |
| excite |
42/18 |
3/3 |
55/0 |
3/0 |
64/64 |
| yahoo |
57,600 |
57,600 |
493 |
57,600 |
1,030 |
Conclusions:
- Excite (using the default settings) does not understand Unicode from UNICEF, and the only reason it had 42 hits from Form C was that it munged that second word into "bi" and linked to inappropriate sights that I would never really be interested in looking at, even if they were in Vietnamese. This is (by the way) why the link is not live. :-)
- Excite (using advanced search, which defaults to all languages) fared a little better, with Suzanne's post high up on the list for #2 and #5.
- Lycos (which I did not put in the table) returned reults that were practically identical to ask.com.
- That one link on ask.com and lycos.com for #3 is to Suzanne's post, which I think is pretty funny. :-)
- Both altavista and yahoo appear to be using Unicode normalization, returning identical results for all canonically equivalent forms.
- Google appears to be stripping combining characters out of either its seasrch strings, its indexes, or both.
- Any claim that Google is normalizing appears to be crap -- at least insofar as one considers Unicode normalization. They are doing their own thing rather than the standard. But then they are only Associate members of Unicode, so I guess they aren't in all the way just yet....
- Microsoft (msn, start, and live) have a whole bunch of work to do and I am having trouble fathoming what precisely they are doing.
- Not enough of the Search community is taking the important of canonical equivalence seriously, to the detriment of many language communites, including Vietnamese.
- Until that time, a better keyboard solution for Vietnamesse in particular suddenly seems more and more compelling.
This post brought to you by "ổ" (U+1ed5, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE)