Word Breaking Japanese Is Hard

Recently a Japanese user wondered why searching for "あいう" or "あいうえ" would not find a document containing the string "あいうえおかきくけこさしすせそ" even though searching for "あい" or "あいうえお" would find it. Recall that the default for search in Vista is that we look for words beginning with what you typed, so it was resonable to expect that all four strings would find the document.

The reason is that word breaking in Japanese (and various other languages, especially East Asian languages) is difficult and highly context sensitive. In Western languages we usually have whitespace or punctuation between words and except for some quirks and exceptions, breaking a paragraph into words is usually straightforward and unambiguous. In Japanese, on the other hand, words are frequently not separated by spaces and therefore word breaking becomes a guessing game. A good lexicon helps but unless you have a way of verifying that the resulting word broken sentence makes sense semantically (which is what a Japanese speaking human would do), there are usually a number of syntactically valid ways of word breaking a sentence and you have to resort to heuristics for picking what is likelythe right one. Considering this our Japanese word breaker is doing a really nice job but it means that the same characters can be broken differently when you change the characters around them.

So let us look at how various prefixes of "あいうえおかきくけこさしすせそ" are broken:

 

あいうえおかきくけこさしすせそ

あいうえお - かき - くけ - こさ - し - す - せ - そ

あい

あい

あいう

あ - いう

あいうえ

あい - うえ

あいうえお

あいうえお

あいうえおか

あいうえお - か

あいうえおかき

あいうえお - かき

あいうえおかきく

あいうえお - か - きく

あいうえおかきくけ

あいうえお - かき - くけ

あいうえおかきくけこ

あいうえお - かき - くけ - こ

You see that as we take longer and longer prefixes, the word breaking approaches that of the full string. Now recall that when you type a query we will word break it and then see if we can find that sequence of words. Or, if you have “Find partial matches” on (which is the default in Vista's search) we will look for a sequence of words where each words begins with one you typed, in order. So in English if your query was "my foot", we'd find a document containing "mystic footwear" as that is a word beginning with "my" followed by a word beginning with "foot".

Now look at the table above. For example, if your query was "あい", then we will look for documents with words that begin with "あい" and we will find the one with "あいうえおかきくけこさしすせそ" as it has a word "あいうえお" in it.

And if your query was "あいうえおか" we will look for a document that has in it a word beginning with "あいうえお" followed by a word beginning with "か". Then too we will find the document with "あいうえおかきくけこさしすせそ" as it has the word "あいうえお" followed by the word "かき" in it.

But if your query was "あいう", then we will look for a document with a word beginning with "あ" followed by a word beginning with "いう" and then we will not find the document with "あいうえおかきくけこさしすせそ" in it as even though it contains a word "あいうえお", that word is followed by "かき", which doesn’t begin with "いう".

Even though the behaviour is better for real text ("あいうえおかきくけこさしすせそ" is simply the first fifteen Hiragana characters) we obviously wish this wouldn't happen and we are looking into ways of improving the behaviour in a future version. But it may please you to know that it is a negative side effect of a positive way of tackling a hard problem: word breaking of Japanese and other East Asian languages.