What is a word? It’s basically a question we linguists have to answer when we develop spell-checkers, grammar checkers, when we do automatic dictionary look-up, when we try to interpret (and expand) queries for a search engine, etc… I recently wrote a paper to show that doing word-breaking and tokenization is not as easy as some people assume. A few colleagues convinced me I should post this paper here… so here it is…
Here are the full references:
Fontenelle, Thierry (2005): “Identifying tokens: Is word-breaking so easy?” In Hiligsmann, Ph. / Janssens, G. / Vromans, J. (eds.), Woord voor woord. Zin voor zin. Liber Amicorum voor Siegfried Theissen, Ghent: Koninklijke Academie voor Nederlandse Taal- en Letterkunde, pp.109-115.
Microsoft Speech & Natural Language Group
Computational linguistics is an interdisciplinary field concerned with the processing of language by computers, as is pointed out in the recent and superb Oxford Handbook of Computational Linguistics (Mitkov 2003). Applications of natural language processing (NLP) systems range from machine translation to information retrieval, information extraction, question answering, text summarization, term extraction, automatic indexing, computer-assisted language learning, or grammar and spell-checking. Many of these applications rest upon the availability of large lexical resources (monolingual and/or bilingual electronic dictionaries) comprising all the lexical knowledge that is necessary to parse a text, i.e. to describe its syntactic structure. However, before attempting to do any kind of part-of-speech tagging, semantic tagging or machine translation, it is necessary to find the boundaries of the objects which will be compared, counted, translated, looked up in a dictionary, etc. This process is known as tokenization (Grefenstette 1996, Grefenstette & Tapanainen 1994). Mitkov (2003:202) points out that “tokenization is usually considered as a relatively easy and uninteresting part of text processing for languages like English and other segmented languages”. This paper addresses some of the issues the linguist is confronted with when developing real-life applications whose robustness goes beyond university laboratory prototypes. The aim is not to be exhaustive in the description of the problems, but, rather, to show how early decisions may influence the design of an NLP system and constrain (or open up) the range of linguistic phenomena it can deal with.
Since tokenization is concerned with identifying the boundaries of words (this segmentation into word tokens is also frequently referred to as word-breaking), one of the crucial questions which the linguist has to answer first is related to the list of so-called breaking characters. In other terms, what are the characters which act as separators between tokens? Words are usually defined as sequences of characters between blank spaces (a.k.a. white spaces). Punctuation marks usually also act as breaking characters, which means that a comma, a colon, a semi-colon, a question mark or an exclamation mark should be stripped off the preceding string of characters and is normally not part of the preceding “word”.(1)
Mikheev (2003:210) briefly mentions the apostrophe, which often means that a word has been contracted. But for a given language such as English, different tokenization rules must be developed, depending upon the contexts in which the apostrophe can be found. In a verb contraction such as we’re, the apostrophe signals a deleted word and the break should occur just before the apostrophe, which then belongs to the second token (we + ’re). If the apostrophe occurs in a negation, however, the breaking is a bit more complex and takes place one character before the apostrophe (don’t à do + n’t). It is clear that there are other complex cases which should be considered when implementing a robust word-breaker (the apostrophe in rock ’n roll is a case in point). Deciding whether an apostrophe is a breaking character or not can have a huge impact upon the performance of the system since this performance usually (though not exclusively) depends upon the size of the accompanying lexicon, i.e. the number of entries it includes. In the case of the Anglo-Saxon genitive in English (possessive ‘s as in John’s car), a decision needs to be taken as to whether the apostrophe is kept attached to the preceding noun during the word-breaking process. If it is, the size of the lexicon may increase dramatically, since virtually any noun can be turned into a possessive.
Apostrophes in French are even more complex. Words starting with a vowel frequently trigger a process called elision, whereby some determiners, conjunctions or pronouns ending with a vowel are attached to the next word if it starts with a vowel and the ending vowel of the determiner/pronoun/conjunction is deleted and replaced with an apostrophe (le + avion à l’avion; me + avait à m’avait; que + ici à qu’ici).(2)
I would like to examine the implications of decisions concerning the status of the apostrophe in French and to weigh the pros and the cons of such decisions, which often take a binary form. The linguist who writes the specifications (the scope of work) of a tokenizer usually will have to take into account factors such as the frequency or rarity of the linguistic phenomenon the system is supposed to be able to deal with. Potential applications and scenarios also play an important part in this process, as well as considerations of the performance of the system.
Let us start from the assumption that the apostrophe in French is a breaking character. This basic assumption may rest upon the mere frequency of elided forms in any corpus of French texts (l’école, qu’il, s’est, m’a, n’est, c’est, d’un, j’aime, t’offre…). Using a concordancer to examine the context of use of apostrophes reveals that the vast majority of occurrences are elided forms indeed. There are exceptions, however. Any introduction to French linguistics will mention the word aujourd'hui, where the apostrophe is clearly not a breaking character (imagine a dictionary look-up program which would try looking up tokens such as aujourd’ or hui if the tokenization process had not been done properly).
At first glance, the list of exceptions seems to be very limited. As always when one starts having a look at real data with KWIC(3) -line concordancers, one quickly realizes that the list of exceptions should be expanded to include items such as prud’homme (together with its inflected and related forms prud’hommes, prud’hommal, prud’hommales…). Exceptions also include presqu’île (and presqu’ile, if one applies the 1990 spelling reform, as well as verbs such as entr’égorger). In an information retrieval scenario, we do not want to break main-d’oeuvre or hors-d’oeuvre into multiple tokens (main + d’ + oeuvre) and linguists would certainly agree that we have here one word containing both a hyphen and an apostrophe.
A different kind of exception ought to be taken into account. Any analysis of KWIC lines containing apostrophes in a large corpus will reveal that Irish-sounding names should be dealt with appropriately. Proper names such as O’Brien, O’Hara, O’Connors, O’Neill, etc. abound and it is clear that having to list them explicitly as exceptions in a tokenizer lexicon may not be the most practical solution. The implementation of a context-dependent rule relying upon the concept of regular expressions may be more sensible: the apostrophe is not a breaking character in sequences made up of a preceding capital O and a following capital letter. What is important here that the apostrophe should be recognized as belonging to the full last name, which needs to be analyzed as a single token. The analysis is pretty much the same for L’Oréal, except that this cannot lead to a generalization where L’ + any string of characters would be considered as one token, of course, since we wish to go on considering the apostrophe as a breaking character in all other cases (L’Eglise, L’Ecole Saint-François…).
Examination of KWIC lines containing apostrophes also reveals that some types of corpora illustrate instances of an interesting phenomenon related to the widespread use of instant messages. Many people exchange such messages and chat with each other, using non-Azerty keyboards, which make the coding of diacritics (grave and acute accents, umlauts…) rather difficult. To remedy this situation, a number of people replace accents with an apostrophe following or preceding the letter which should have been accentuated(4). The following examples are cases in point:
Colloques des math´ ematiciens du Qu´ ebec n o date lieu 1 1974-01-26 Universit´ e du Qu´ebec ` a Trois-Rivi`eres
Institut d’informatique de gestion Universit´e de Berne
Gestion de la qualit´e des donn´ees dans les syst`emes ERP
It goes without saying that this phenomenon is virtually inexistent in corpora of edited texts (coming from encyclopedia or newspaper articles). Newsgroups and chat forums are rife with such “spelling mistakes”, however, and the word-breaker should be able to deal with these, robustness being an essential characteristic of real-life systems. It is clear that, if one considers the apostrophe as a breaking character by default, the risk is great that strings like Ren’e or qualit’e will be analyzed as two tokens, which will make it very difficult to offer correct suggestions if the word-breaker feeds a spell-checker, for instance. As is shown above, considering that the apostrophe is a breaking character forces the linguist to provide a rather long list of exceptions and to implement a mechanism to deal with the open-ended list of Irish names. While this is feasible, the treatment of spelling mistakes involving apostrophes instead of accented characters is not guaranteed and may force the linguist to explore other avenues.
Let us therefore examine the converse hypothesis, viz. that apostrophes are, by default, an inherent part of the string of characters that surround them. This obviously obviates the need to compile long lists of Irish names in O’. It also obviates the need to list exceptions like prud’homme, aujourd’hui or presqu’île. One of the biggest advantages of this approach is that it also emits the correct token when the apostrophe is used instead of an accent, which means that spell-checking such mistakes is likely to be much easier. Similarly, informal expressions where one vowel is dropped can be accounted for (ma p’tite dame, M’sieur!). Yet, the phenomenon of elision needs to be dealt with and, this time, should be seen as the exception to the rule. On the face of it, accounting for elisions may look like a daunting task indeed. It is not as difficult as one might expect, however. Despite the very high frequencies of occurrences, elisions occur in a very limited number of patterns. The apostrophe should indeed be considered as a breaking character and attached to the preceding string in the following patterns: l’, d’, n’, c’, j’, m’, t’, s’, qu’, puisqu’, lorsqu’, presqu’, jusqu’, ç’.
The formalization of exceptions should be done with care. The words above should be interpreted as elided words if they are preceded by a white space, since we don’t want to break aujourd’hui into 2 tokens because the apostrophe is preceded by “d”. Along the same lines, we still want to keep main-d’oeuvre or hors-d’oeuvre together as one single token, which points to the need to specify more complex patterns (the elided d’ is not a breaking character when occurring immediately after a hyphen).
Word breaking has often been described as a rather uninteresting task. Yet, as I have tried to show here, the pros and the cons of the various alternatives should be weighed very carefully when designing an NLP system since the tokens that are emitted by the word-breaker are going to be used as the building blocks of any subsequent automatic processing. Any attempt to figure out the structure of a text rests upon the system’s ability to find the boundaries of the linguistic objects that are going to be consumed by the other modules of the NLP system (lemmatizers, syntactic parsers…). Whether one wants to translate words in context, look them up in a dictionary or simply count them, one first needs to come up with very clear sets of guidelines which must be turned into implementable algorithms. Decisions concerning the breaking or non-breaking status of some characters need to be made after taking into account the distribution of these characters and a careful examination of their linguistic environment. It is also clear that the status of characters such as apostrophes or hyphens varies as a function of the language. Apostrophes in French are very different from apostrophes in English, as we have seen and Dutch is undoubtedly also very different, as is evidenced by the fact that apostrophes in that language are frequently found inside elided determiners and pronouns, as in ‘t boek, ‘k heb, ‘n wagen, m’n vrouw, z’n huis, in which cases the apostrophes are not breaking characters. Apostrophes are also found in genitive constructions, but this phenomenon is much more constrained than in English, since it is limited to words ending with long vowels (Papa’s boeken). They are also found as breaking characters in some plurals (paraplu’s). There are even a few multi-word entries like ‘s maandags, ‘s avonds, ‘s morgens, or place names like ‘s Gravenhage; the apostrophe here has a special status since capitalization at the beginning of a sentence does not apply to the word which includes it and the second word is in fact the one that has to be capitalized (‘t Is 8 uur. ‘s Maandags…).
I decided to focus here on the apostrophe in French because it poses a number of problems that do not exist in other languages and because the decisions that have to be made can have far-reaching consequences on the size of the accompanying lexicon which supports the NLP system. Failing to take into account the breaking nature of the apostrophe in cases involving elided clitic pronouns (l’aime, s’aime, t’informe, j’informe…) could lead to increasing the number of inflected forms in a ”full-form lexicon” by as many as a million forms. This could obviously have a detrimental effect upon the performance of the system.
The apostrophe is clearly not the only problematic character (in French or in any other language). Hyphens can also cause many headaches. It is clear that, in French, the hyphen in tire-bouchon or in casse-croûte has a very different status than the hyphen in clitics like dit-il or ce livre-là, or even in productive compounds such as les relations patrons-ouvriers or le match Belgique-Allemagne. Solving such issues is crucial if one wishes to access Belgique or Allemagne in an information retrieval perspective. Here again, the performance and quality of the NLP system will depend largely upon the quality of the underlying tokenizer which feeds it and, hence, upon the soundness of the linguistic decisions made upstream.
1. Note that even this simple characterization of punctuation marks as standard breaking characters needs some qualification when one considers the recent use of exclamation marks in some trademarks and company names (Yahoo! is a case in point; the product Microsoft Plus! For Windows is another one). This has implications for word-breaking, but also for sentence separation (i.e. the identification of sentence boundaries) since the sentence does not end after this exclamation mark and the following word does not require to be capitalized.
2. Again, the notion of vowel is an oversimplification since some consonants sometimes behave like vowels: an aspirated h behaves like a consonant in le hibou (l’hibou is unacceptable), while a mute h behaves like a vowel and triggers elision in l’horloge or l’homme. The letter y displays a similar behavior (le yéti, but l’ylang-ylang, le yen vs. l’Yonne). The “vowel-like” status of these letters is totally idiosyncratic and is a property that must be captured in the lexicon for individual lexical entries.
3. KWIC = Key-Word in Context (the basic material used by lexicographers, showing how a given keyword can be used ; each concordance usually shows the keyword centered in the middle of the screen, with, say, 5 words to the left and 5 words to the right to show in what kind of contexts it can appear).
4. The conversion of PDF or PostScript files to HTML with some tools also creates similar problems, as is shown by the relative high frequency of strings like ´e on the Web.
Grefenstette, Gregory (1996): “Approximate Linguistics”, in Proceedings of the 4th Conference on Computational Lexicography and Text Research – COMPLEX’96, Budapest, Hungary, September 1996.
Grefenstette, Gregory & Tapanainen, Patsy (1994): “What is a word, what is a sentence? Problems of tokenization”, in Proceedings of the 3rd Conference on Computational Lexicography and Text Research – COMPLEX’94, Budapest, Hungary, 7-10 July 1994, pp.79-87.
Mikheev, Andrei (2003): “Text Segmentation”, in Mitkov (ed.) The Oxford Handbook of Computational Linguistics, OUP, pp.201-218.
Mitkov, Ruslan (2003): The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press, 2003.