Welcome to MSDN Blogs Sign in | Join | Help

Statistical Machine Translation - Guest Blog (Updated with additional paper)

Will Lewis is a program manager on the Microsoft Translator team, working on language quality and data acquisition.  Today's guest blog is a high level explanation of how the engine works:  

As many of you know, under the hood Microsoft Translator is powered by a Statistical Machine Translation (SMT) engine.  Statistical systems are different than rule-based ones in that the “rules” mapping words and phrases from one language to another are learned by the system rather than being hand-coded.  Training an SMT requires amassing a large amount of parallel training data—hopefully of good quality and from heterogeneous sources—and training the engine on that data.  (By parallel, we mean a source of data where the content for one language is the same as the content for the other.)  The engine learns the correspondences between words and phrases in one language and those in another, which are often reinforced by repeated occurrences of the same words and phrases throughout the input.  For instance, in training the English-German system let’s say, if the engine sees the phrase All rights reserved on the English side and also notices Alle Rechte vorbehalten on the German side, it may align these two phrases, and assign some probability to this alignment.  Repeated occurrences of the source and target phrases in the training data will only reinforce this alignment.

Generally, having parallel data for a language pair means we can train engines in both directions (i.e., both the English-German and the German-English systems can be trained on the same input sentences).  Some of you had some questions regarding why it was that we released the English-Spanish system before we released Spanish-English.  There were really two reasons.  First, English-Spanish was the first general domain language pair we released.  Releasing one language pair allowed us to test the infrastructure before we started releasing more.  Second, the technology for Spanish-English was slightly different than that used for English-Spanish, and we need some additional time to do the necessary infrastructural changes to accommodate.  In the future, we plan to release new translation systems in pairs (with a couple of exceptions).  I can’t reveal what languages we have planned next, but do expect some new ones soon!

For those of you interested in technical discussions regarding our engines and how they work, please refer to some of the papers by the researchers who developed them.  Three recent papers of note are:

Chris Quirk, Arul Menezes. Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation May 2006 New York, New York, USA Proceedings of HLT-NAACL 2006

Chris Quirk, Arul Menezes. Dependency Treelet Translation: The convergence of statistical and example-based machine translation? March 2006 Machine Translation 43-65 (Attached file)

Chris Quirk, Arul Menezes. Using Dependency Order Templates to Improve Generality in Translation July 2007 Association for Computational Linguistics

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# a-foton » Statistical Machine Translation - Guest Blog

Wednesday, August 27, 2008 12:58 AM by someone

# re: Statistical Machine Translation - Guest Blog

Hey Machine Translation Team at MS, I've a small suggestion. Google's Translation now has a "Detect language" feature that automatically detects the foreign language which is very useful. Can you add such a feature to Windows Live Translator?

Wednesday, August 27, 2008 8:32 PM by Chris Wendt

# re: Statistical Machine Translation - Guest Blog

Hello someone, thanks for the suggestion. We'll plan that for one of our next updates.

Tuesday, September 02, 2008 1:34 PM by slawek

# re: Statistical Machine Translation - Guest Blog

Hey, Can I expect, that Polish language will be available in near future?

Friday, September 05, 2008 1:08 AM by Lane

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

Hi Slawek,  We are always looking to add more languages to improve our engine, but we do not have a specific timeline for individual languages.

Saturday, September 06, 2008 5:13 PM by Larry Stevens

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

Is there a way for programers to access the tranlation direcly from code?  C# or other dot.net programming languages.  Thanks

Monday, September 08, 2008 8:37 PM by Larry Stevens

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

I am hoping that the machine translation is available as a web service that would allow inputing one language and getting a translation to another language.  I am hoping that this would be availble by making a call from a dotnet programming language such as C# or any of the other dot.net programming languages.  My company is a ISV Microsoft Partner that developes applications for retail and manufacturing companies.  Please let me know if this is available.

Saturday, September 13, 2008 9:02 PM by Larry Stevens

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

Please answer my request.  Can we (As a Microsoft Certified Partern (ISV) access the Microsoft Tranlation service from our program?

Thanks

Sunday, September 14, 2008 6:06 PM by CorrecteurOrthographiqueOffice

# Windows Live Translator utilise uniquement le système de traduction automatique de Microsoft Research

Mes collègues de Microsoft Research l’annonçaient il y a quelques jours : toutes les paires de langues

Tuesday, November 25, 2008 1:40 AM by Microsoft Research Machine Translation (MSR-MT) Team Blog

# New language pair on MicrosoftTranslator.com

The Translator team is excited to announce the availability of the English to Russian language pair on

Wednesday, March 18, 2009 7:19 PM by Microsoft Research Machine Translation (MSR-MT) Team Blog

# Announcing the Microsoft Translator web page widget

The Microsoft Translator team is very proud to announce the technology preview of an innovative offering

Friday, March 20, 2009 12:28 PM by Microsoft Office Project Support Blog

# Announcing the Microsoft Translator web page widget

This is a repost from the Microsoft Research Machine Translation (MSR-MT) Team Blog by permission, and

Tuesday, March 31, 2009 12:54 PM by The Old New Thing

# 2009 Q1 link clearance: Microsoft blogger edition

From elsewhere in the collective.

Monday, June 15, 2009 7:06 AM by Quality Directory

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

This is a helpful tip on how machine translation works. I'm writing a project on language translation techniques and reading this article has given me much insight.

Saturday, June 27, 2009 5:43 AM by Computer products

# re: Statistical Machine Translation - Guest Blog (Updated with additional paper)

It's good to understand how the machine translation works. But an average person doesn't need to understand this to use the tool.

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker