In this post, I would like to share with you the reasons why I love the French word chef-d’œuvre (=masterpiece). My interest for this word has nothing to do with its meaning. As a program manager working with computational linguists, I find it fascinating because it epitomizes the numerous difficult decisions one has to make when building natural language processing systems like word-breakers, tokenizers, spell-checkers, etc.

One of the most vexing problems in NLP is to decide whether hyphens and apostrophes are breaking characters or not. The identification of word boundaries (tokenization) is indeed essential, as I have argued elsewhere in an attempt to show that word-breaking is not easy. Hyphens frequently separate distinct tokens, as in le match France-Italie (nobody would argue that France-Italie is one word and nobody would expect to find the whole string in a dictionary). In chef-d’œuvre, however, the hyphen is part of the word and everyone will expect the whole string to be granted entry status in a dictionary. It should be considered as one token that has little to do with the word chef (which typically refers to a person), unless one considers the etymology of the word. This means that, in a search scenario, a user would not consider a document containing the words chef and oeuvre used separately as relevant if that user typed the keyword chef-d’œuvre.

The apostrophe in chef-d’œuvre is also interesting. Apostrophes are frequently used in French elided forms when a pronoun or a determiner is followed by a word that starts with a vowel (consider l’école [the school), je l’aime [I love her], j’arrive [I’m coming]]. In such cases, it is normal to consider that the string is made up of two distinct tokens (l’école -> l’ + école). The apostrophe in chef-d’œuvre has a distinct status and is an integral part of the word, very much like in other French words such as aujourd’hui, prud’homme, and a few others.

The word chef-d’œuvre is also interesting because it includes a special character, œ, known as a “ligature” (two or more letters joined together). Many other words in French include ligatures such as œ  or æ (œuf [egg], sœur [sister], cœur [heart], cæcum …) and many other languages use characters which are not traditionally found in English (the German β or the Spanish ñ are cases in point). This reminds us that many NLP projects started with applications developed for English initially and subsequently required specific changes to take into account the non-ascii characters found in many other languages. Until very recently, the OpenOffice.org French spell-checker used to flag forms with ligatures like sœur or œuf as incorrect and only verified the incorrect spellings with two distinct characters (soeur, oeuf…). With the advent of Unicode, such problems are probably less frequent today, but it is clear that any multilingual project needs to consider idiosyncrasies such as the use of diacritics and other special characters some languages love so much…

From a morphological point of view, the word chef-d’œuvre is also atypical. While regular nouns typically take a final “s” in the plural in French (singular maison [house] -> plural maisons), the form *chef-d’œuvres is incorrect and should be flagged by a spell-checker, even if œuvres is correct on its own in other contexts. Rather, the plural is formed by adding an “s” at the end of the subtoken chef: des chefs-d’œuvre.

Finally, if you consider how the word is pronounced, it is clear that chef-d’œuvre poses a number of challenges: why is it that the “f” is pronounced in chef [∫εf ], but not in chef-d’œuvre [ ∫εdœvR ]? Interesting problem for my colleagues who create text-to-speech systems…

Chef-d’œuvre  is certainly not the only complex word which encapsulates so many difficulties for those of us who create NLP applications. I could probably also have chosen le hors-d’œuvre [appetizer], which begins with an aspirated ‘h’ and does not admit elision, unlike homme [man] -> l’homme. Main-d’œuvre [manpower] would probably have been a nice candidate, too. In fact, there is no dearth of thorny issues linguists need to deal with. Well, languages are difficult, aren’t they? That’s probably what makes my job here so challenging and so interesting …

 

-- Thierry Fontenelle (Program Manager)