Chris Wendt

News and info about Machine Translation at Microsoft

  • Optimizing a statistical MT system using shared data

    A very substantial feature of statistical machine translation (SMT) systems is the ability to train the system on different bodies of text, and this way customizing it and optimizing it for a certain style and vocabulary. In MT land we call documents of related style and terminology a "domain". The most valuable training data for an SMT system is what we call "parallel data": the same document in two languages. The format that makes parallel data immediately useful is a translation memory (TM) file format, which segments the documents in individual sentences or paragraphs which are perfectly aligned to each other across the two languages.

    In conventional SMT wisdom the best results are achieved by training the system only on data strictly within the domain, enhanced with some generic material that teaches the SMT system about the language in general. Unfortunately most owners of training data do not have enough parallel data to be able to train a system just with their own data. Microsoft has enough data to train a system just with its own parallel data in many languages, but even Microsoft doesn't have sufficient data in the languages where only a few Microsoft products are localized into, or products are localized only partially. General guideline is that above 10 million words it becomes interesting to build an MT engine. More is required for morphologically rich languages - languages that have a large number of variants or inflections of the same word stem.

    The TAUS Data Association (TDA, http://www.tausdata.org) has been collecting parallel data for the purpose of sharing it among the members of the association, in order to allow each member to have a large pool of in-domain data, and to leverage this data in various ways. Building MT systems for themselves, among other uses. The domains that TDA covers today most widely are the IT domain with computer and software-related material, and government data, mostly parliamentary proceedings. TDA shares the data in the form of TMs, most easily consumable by any user of parallel data, including MT system training modules.

    Microsoft is a member of TDA, and we have done experiments to show the utility of shared TM data. The results in detail are published here: http://research.microsoft.com/apps/pubs/default.aspx?id=102244.

    In summary, the experiments show:

    • Sharing data among data owners is necessary for building an MT system for the domain.
    • A level of diversity within the parallel data imrpoves the results, we do not need to define the domain narrowly.
    • Even a large data owner like Microsoft can achieve quality gains by using the shared data, but smaller data owners benefit significantly more.
    • Best results are achieved when training an additional model from target language material only, including the targeted subdomain, for instance a specific company, and letting the system calculate the weight of that model.
  • Notes from the MT Summit in Ottawa

    Last week I attended the MT Summit in Ottawa (http://summitxii.amtaweb.org/),mostly in the commercial user track. Compared to the previous two AMTA conferences I attended, it was striking to see the attention on MT-aided translation and translator productivity, in all 3 tracks at the conference (research, commercial users, and government users). What was also striking is that nobody has actual solid data on post-editing productivity in statistically significant numbers.

    Memorable papers related to translatior productivity were:

    • Elina Lagoudaki's research on translation editing environments and the features translators like and use.
    • Ana Gueberof's data on productivity and quality in MT post-editing (also publish in Multilingual magazine)
    • MT use at Adobe by Ray Flournoy
    • Analysis of MT post-editing patterns by Declan Groves of Traslan and Dag Schmidtke of Microsoft, using Microsoft's own data.
    • Usage of a statistical MT system for English-French translations in Canada, by Michel Simard and and Pierre Isabelle of the Canadian NRC.

    Here is my conference paper, co-authored with Will Lewis, not related to post-editing productivity.

  • Microsoft Knowledge Base

    My team is responsible for the MT engine that creates the foreign language translations of knowledge base (KB) articles on http://support.microsoft.com. Each KB article has at the bottom a feeedback field allowing the reader to rate the article. I pay close attention to this feedback to determine the usefulness of the MT engine for a task like understanding a technical document.

    Feel free to let me know right here of any significant issues or recurring problems you see with the KB translation.

  • Citation in the French press

    I am cited in last weekend's issue of the French newspaper “Libération”: http://www.liberation.fr/transversales/weekend/234277.FR.php under the title "Babelweb". Pretty positive about MT in general.

     

    Thierry Fontenelle blogged about the article in his post  http://blogs.msdn.com/correcteurorthographiqueoffice/.


© 2009 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker