Welcome to MSDN Blogs Sign in | Join | Help

Lets Play Two?

For opening day this year, my beloved Chicago Cubs unveiled a statue of the immortal Ernie Banks.  On its pedestal was engraved the catchphrase that made famous his enthusiasm for the game of baseball: “Let’s Play Two”.  Except that the sculptor forgot one little detail, as reported by the Chicago Tribune: the apostrophe!  He should have checked it out in Microsoft Word, first. The contextual spellchecker would have saved him a lot of grief:

 

 

-- James Lyle, Tester

Posted by nlgblog | 0 Comments

The Tatar Language Interface Pack for Office 2007 is now available

A new language now has its own LIP (Language Interface Pack) for Microsoft Office 2007: Tatar. This LIP has been available for a few days and enables Office users to work in a localized environment. Besides the user interface, which is localized into Tatar, the pack includes a spell-checker on which our group had the opportunity to collaborate. It is the first time proofing tools are available for that language. The Tatar LIP can be freely downloaded from the following URL:

http://www.microsoft.com/downloads/details.aspx?FamilyID=91426c33-ea45-482d-af08-cd8ea8cbfd53&displaylang=tt

Tatar (татарча), a Turkic language with around 7 million native speakers, uses the Cyrillic script and is spoken in Eastern Europe and in Central Asia. It is one of the two official languages of the Republic of Tatarstan, which is part of the Russian Federation.

It may be useful to point out that LIPs are created in the framework of the Microsoft Local Language Program. You will find more information about that program, including a list of languages that have LIPs for Windows and for Office, at the address below. Most Office LIPs include a spell-checker.

http://www.microsoft.com/unlimitedpotential/locallanguageprogram/default.mspx

I once wrote, when the Romansh LIP was launched, that I like this idea of preserving minority and regional languages and defending them with language technologies. I had added that it gives a very special and pleasant flavor to our job… I maintain what I wrote at the time…

Thierry Fontenelle -- Program Manager

 

The Grammar Checker and Rain Man

Thinking about a couple of recent Language Log posts about grammar checkers, passive sentences, and grammatical Cupertinos, it occurred to me that grammar checkers are a bit like Rain Man.  For the benefit of younger readers (no one alive in 1988 could have missed the hype surrounding this movie), Dustin Hoffman plays Raymond (“Rain Man”), an autistic man who is also a mathematical savant.  Tom Cruise is his hotshot younger brother who finds himself with the difficult task of caring for Raymond, whose inability to relate to the real world leads them into bizarre situations.  Tom takes them to Las Vegas, where Raymond’s uncanny mathematical aptitude makes them big winners until casino security accuses them of cheating.

 

What does this have to do with grammar checkers?  A computer grammar checker is like a grammatical Rain Man—a savant with an encyclopedic knowledge of grammar rules, but no common sense understanding of the real world, which sometimes leads to strange behavior.  Take the famous example “Time flies like an arrow”.  This sentence is notable because it has a lot of possible but quite strange interpretations, in addition to its straightforward meaning of “time moves fast the way an arrow moves fast”.  It could mean, for example, something like: whenever you encounter a fly, time it in the same way you would time an arrow (using a stopwatch, perhaps).  We humans naturally disregard such interpretations (they may never even occur to us).  But our grammatical Rain Man sees every such possibility at the same time, and has to decide among them.  And it doesn’t have our common sense to help it.  Instead, it has to figure out the intended meaning of sentence by purely computational means, basically by reasoning only about things that can be counted.  Such as, say, the number of times it has seen “time” used as an imperative verb vs. the number of times it’s seen it used as a noun subject, and so on.

 

Most of the time this all works as advertized, but it can occasionally lead to some odd decisions. Once in a while the grammar checker will make a suggestion that just looks weird, so if you blindly accept all its recommendations, you end up with those cupertinos.  Letting Rain Man call the shots may sound like an easy way to get rich, but like Tom at the casino, if you’re not careful you can end up in trouble.

So should you take advice on grammar from a Rain Man?  Well, sure, if you need it—after all, it knows all the rules, and does a pretty good job most of the time.  But take it only as advice, and look over its shoulder while it counts the cards.  Somebody has to be watching the dealer’s eyes.  That job is better left to a human.

-- James Lyle, Tester

Posted by nlgblog | 2 Comments
Filed under:

Spellchecking ain't easy

A customer wrote to us recently with the observation that our English spellchecker doesn’t recognize the word ain’t, a fact which struck this customer as a tad, well, old-fashioned.  Pedantic, perhaps.  The words “uptight” and “shortsighted” might have been used.  Yikes!  I’ve been accused of a lot of things, but…

 

First, let’s admit that this is not a mistake.  Yes, we deliberately excluded ain’t.  You can tell, because we made sure to get just the right set of words to suggest in its place (isn’t, am not, aren’t) despite these words being pretty far away from ain’t in edit distance.  You could say the same about gonna and wanna, which are also excluded, but for which we suggest going to and want to, respectively.

 

This is one of those tough calls we encounter when building a spellchecker.  As a linguist, I have no inherent objection to the word “ain’t” on any moral, intellectual, or even aesthetic grounds.  It’s a part of my own spoken idiolect, and I tend to use it unselfconsciously in informal contexts.  Ain’t no reason not to, usually.  But clearly, it is still universally regarded as nonstandard, and people naturally want to avoid using it in formal writing.  That goes for me, too--I doubt I’d want to use it when writing to the boss, even when the boss is a pretty informal kind of guy (Hi Bill!).  So from that point of view, flagging this word as an error is a good thing for customers, a lot of whom are using MS Office at work.

 

On the other hand, people also have lots of reasons to want to write ain’t--to be deliberately jocular, say, or to sound folksy, or just because it’s natural.  Or because they’re a Jane Austen character.  And in those cases, they’d prefer to spell it right--there’s only one correct way to spell ain’t, after all, and your fingers can stumble on it just as easily as any other word.  By not recognizing ain’t, we sure ain’t helping folks in those situations.  And that goes for gonna and wanna, too.

 

So what do you think?  Should there be a red squiggle under ain’t, or not?

 

-- James Lyle, Tester

Posted by nlgblog | 11 Comments
Filed under:

Microsoft Office 2008 for Mac now supports the French spelling reform

Microsoft Office 2008 for Mac was released this week. Users who are interested in the French language will notice a change in the French spell-checker, which now takes into account the French spelling reform, which is recommended by official bodies such as the Académie Française, the Conseil Supérieur de la langue française, the French Ministry of Education, the Groupe québécois pour la modernisation de la norme du français, the Réseau pour la nouvelle orthographe du français, etc.

The official texts make it abundantly clear that both the traditional (‘old’) spelling and the ‘new’ (recommended) spelling are valid. As can be seen below, the Mac Office 2008 speller accepts both forms by default (traditional and new spellings). Users of the Office 2007 French speller are already familiar with the three options which enable them to change this default mode. As of this week, Mac Office users can now also decide whether they want their speller to:

(a)  consider the old and new forms as valid (which is the default option)

(b) apply only the traditional (‘old’) spelling (i.e. ‘new’ forms will be squiggled)

(c) apply the ‘new’ (rectified) spelling only (i.e. the ‘old’ forms will be squiggled)

If you are interested in a very brief description of the kinds of changes you might notice with this new speller, you can check out this post, which was written when we released the new speller for Office users a while ago.

 

 -- Thierry Fontenelle (Program Manager) 

 

Posted by nlgblog | 3 Comments

Building corpora with the Live Search API

I just read Building and Exploring Web Corpora, which includes the Proceedings of the 3rd Web as Corpus Workshop (WAC3-2007) held at the University of Louvain-la-Neuve in September 2007. A number of papers describe how computational linguists have been using Microsoft’s Live Search Application Programming Interface (API) to build and clean corpora to be used in natural language processing.

 

One of the papers (Leturia et al.) describes the CorpEus tool, which uses the Live Search (LS) API and which the authors designed to create web corpora for the Basque language.

 

Another very interesting paper, by William Fletcher, describes the various reasons why that API was found to meet the linguists’ requirements to be able to generate concordances for linguistic research. Let me quote Fletcher here:

 

·         Of the Search Engines which provide free APIs to developers, Live Search is the most generous by far: it allows 10,000 queries per application id (AppID) per IP address per day; [TF: Fletcher mentions 10,000 queries per day while Leturia et al. indicate that the API allows 25,000 queries per day. The latter figure is the correct one, in fact, which makes it even more generous]

·         LS provides high-quality search results, with relatively few pages from link farms or “scraper sites”, which repeat content from or link to other pages merely for advertising revenue;

·         It also supports search by location, i.e. by country or even latitude and longitude;

·         Live Search is more responsive to changes on the Web: there is faster turnover in the top hits returned for a given query than with Google or Yahoo!, and documents in the cache tend to be “fresher”, i.e. updated more frequently;

·         The LS cache provides quick, reliable access to the original texts. In documents retrieved from the cache, LS generally detects the character set encoding accurately and converts it to UTF-8, thereby eliminating a potential source of variability and errors;

·         LS also converts Adobe Acrobat PDF documents to HTML which closely reflects the formatting of the original;

·         The Live Search API provides direct links to the cache, and the site responds rapidly and at a high transfer rate, permitting very efficient data collection without delays, redirections or dead links.

 

Here are the full references of that paper, in case you want to read the whole story:

 

William H. Fletcher: Implementing a BNC-Compare-able Web Corpus, in Fairon, C., Naets, H., Kilgarriff, A., de Schrijver, G-M (eds): Building and Exploring Web Corpora – Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval (WAC3-2007, September 2007), UCL, Presses Universitaires de Louvain, 2007.

 

It’s cool to see that this search engine is found useful by some members of the computational linguistics communities and not just because it allows 25,000 queries a day. I’m sure more can be done to meet linguists’ needs, but it’s definitely encouraging to read that feedback. If you are interested in this Live Search API, you can check out its “Terms of Use” here: http://dev.live.com/livesearch/.

 

 

-- Thierry Fontenelle (Program Manager)

 

Posted by nlgblog | 1 Comments

Untied Nations or United Nations?

During my vacation in December 2007, I had a chance to visit a friend of mine who works for the United Nations in Bangkok.

On a Friday evening right before Secretary-General Ban Ki-moon's visit to the UN Bangkok office, I chatted with his colleagues in the UN building over beer and wine.

Many of them said they have a big problem with Office 2003 English spellchecker because it doesn’t correct the most common spelling error in their organization: “the Untied Nations”.

They were fascinated with the idea of upgrading it to Office 2007 which can correctly identify the error and suggest “United” thanks to its Contextual Spellchecker.

 

-- Kazami Uchida (International Product Engineer)

Posted by nlgblog | 1 Comments
Filed under:

MSR blog on the Microsoft Research Machine Translation system

Our colleagues from the Microsoft Research (MSR) group have started blogging about the statistical machine translation (MSR-MT) system they are developing. We announced the Windows Live Translator when it was launched in September. Check out their blog if you want to know all the details about this system. For instance, you will discover how to install and use the Windows Live Toolbar Translator Button, which is now available in (and for) 12 languages.

 

-- Thierry Fontenelle (Program Manager)

Posted by nlgblog | 2 Comments
Filed under:

New Blog on Enterprise Search

There's a new blog out there from the team that's working on Enterprise Search for MOSS (Microsoft Office SharePoint Server). They've got tips and tricks for administrators and will be posting info on features. It's a team that we work closely with, delivering query spelling suggestions and tokenization with morphological analysis. Some of the most recent news there is around a new release of Search. Check it out. http://blogs.msdn.com/enterprisesearch/ 

-- Jay Waltmunson (Program Manager)

 

Posted by nlgblog | 1 Comments

The French spelling reform in the Canadian press

 

For readers who are interested in the French spelling reform, two very recent articles published in Canadian newspapers in Montreal a few days ago discuss the penetration of the spelling reform, its slow but increasing adoption by teachers and the press, in Canada, Belgium, Switzerland and France. Both articles, which quote Chantal Contant from the Groupe québécois pour la modernisation de la norme du français, list the reference dictionaries and computerized tools that take the new spelling into account and, in both cases, the Microsoft Office speller is listed as a tool which covers 100% of the new forms.

 

Chantal Contant fait valoir que certains dictionnaires, tels le Hachette, le Littré et le Bescherelle, ont adopté intégralement les rectifications, tout comme les logiciels de correction Antidote, Myriade, ProLexis, Cordial et le correcteur de Word.

 

(L'Actualité, no. Vol: 32 No: 16, 15 octobre 2007, p. 70: Débat - Le français frisote)

Du côté des ouvrages de référence, Le Petit Robert et Le Petit Larousse sont plus réticents que le dictionnaire Hachette, le Nouveau Littré, les correcteurs Antidote, ProLexis et Word, ou les grammaires Bescherelle et Grevisse, qui intègrent 100 % des changements.

(Le Devoir, LES ACTUALITÉS, mardi 2 octobre 2007, p. A4 : Rectifications de l'orthographe

Les graphies font peu à peu leur chemin)

 

I blogged a few months ago about the three options offered to the users of the Office 2007 speller. You can also find a brief description of the differences between the traditional spelling and the “new” spelling (which is now also recommended by the French Ministry of Education in its official curriculum) here.

 

Thierry Fontenelle – Program Manager

Posted by nlgblog | 4 Comments
Filed under:

Contextual spelling: US English only?

Laurie asked us via the Email/Contact link:

I was always under the impression that the Contextual Spell Checker only works if your language is set to English (US) rather than English (UK). However, I have recently seen the blue squiggly lines appear for English (UK).

Can you confirm whether this has come about as a result of a specific Office update?

The contextual speller works for all varieties of English (UK, US, Australian, Canadian). This has been the case since the launch of Office 2007 and there has not been any specific update for that version of Office. If you write something like this:

(a)    When inserting an Excel chart into a Word document, the chart looses its color when the focus is set to the document.

(b)   When inserting an Excel chart into a Word document, the chart looses its colour when the focus is set to the document.

You will see the blue squiggles under looses whether you are in US English or in UK English mode. If you set (b) to UK English to make sure colour is not red-squiggled, looses will nevertheless be flagged as a contextual mistake and the contextual speller will suggest loses.

Thanks for giving us the opportunity to dispel that rumo(u)r, Laurie.

--  Thierry Fontenelle (Program Manager)

Posted by nlgblog | 3 Comments
Filed under:

When Languages Die

James was talking about endangered languages the other day. I have just finished reading David Harrison’s new book on “When Languages Die – The Extinction of the World’s Languages and the Erosion of Human Knowledge”, which I discovered via Michael Kaplan’s blog. It’s a fascinating account of language disappearance, which takes place because thousands of languages are gradually “crowded out” by bigger languages. Six years ago, there were an estimated 6,900 distinct languages and Harrison points out that by the end of our 21st century, only about half of these languages may still be spoken because their speakers will have abandoned them to turn to more dominant, more prestigious or more widely known languages. Harrison brilliantly demonstrates what language death or language extinction means for us. He focuses on the vast body of knowledge that will soon be lost and explores various knowledge systems (moon phases, folk taxonomies, knowledge encoded in traditional calendars, topographic naming systems…) to show how cultural knowledge is packaged in languages and cannot be transferred when people stop using their language. I found the discussion about number systems enlightening and captivating. He points out that counting systems provide a window into human cognition and that a lot is lost when the speakers of a language decide to move to the decimal counting system. His demonstration is simply superb. Harrison argues that it is urgent to document languages and to do whatever we can to preserve them and to encourage their speakers to go on using them.

Everyone must play their part there. As a software company, we have a number of initiatives to help linguistic communities (see, for instance, the Microsoft Local Language Program which provides Language Interface Packs (LIPs) in a wide range of languages, or the community glossaries of IT terms which are built by local volunteers with the aim of helping local groups promote and preserve their languages – I also talked recently, in French, about a new Breton speller for Office 2007 which was created by a Breton-speaking volunteer who devotes a lot of time and energy to the preservation of his language). We have talked a lot on this blog about proofing tools and building word lists for spellers and other types of tools such as thesauri or word-breakers is certainly something that needs to be done if one wishes to help communities access technology in their languages. To some extent, I feel that Harrison and a group like ours (and several other groups in the company, of course) share a common passion for languages and a common goal: “what scientists can do is to capture an accurate record in the form of recordings and analyses”, he writes. Our technology can certainly help and I hope we will be able to offer even more in the future to help communities preserve their languages. At the same time, Harrison points out that no one but speakers themselves can preserve languages, since there is no such thing as a living human language without speakers (p.10). My sincere hope is that we’ll manage to create the synergies that are necessary to preserve language diversity and perhaps to prevent some languages from dying. Meanwhile, I definitely encourage you to read David Harrison’s book. You won’t regret it.

Thierry Fontenelle – Program Manager

 

Posted by nlgblog | 12 Comments

Fellow linguist blogger in Windows International

Kieran is a fellow linguist on the Windows International team, working closely with the team delivering Windows Desktop Search. She's got some great insight into language and technology on her "Loneliness of the Long Distance Linguist" blog. Check her out here: http://blogs.msdn.com/kierans/. Linguists, we are everywhere! :)

 

-- Jay Waltmunson (Program Manager)

Posted by nlgblog | 1 Comments

Japanese Word Breaking in Windows Desktop Search

Jonas Barklund is a veteran developer working on Windows Desktop Search. He’s got a great post with details on how Windows Search works in Japanese, using our Natural Language Group word breaker. Check it out: http://blogs.msdn.com/jonasbar/archive/2007/09/21/word-breaking-japanese-is-hard.aspx.

-- Jay Waltmunson (Program Manager)

Posted by nlgblog | 1 Comments

Smiley is 25

This week marked the 25th birthday of the smiley :-).  But are emoticons really only 25?  Language Log has some history.

-- James Lyle (Test Lead)

Posted by nlgblog | 1 Comments
More Posts Next page »
 
Page view tracker