A very substantial feature of statistical machine translation (SMT) systems is the ability to train the system on different bodies of text, and this way customizing it and optimizing it for a certain style and vocabulary. In MT land we call documents of related style and terminology a "domain". The most valuable training data for an SMT system is what we call "parallel data": the same document in two languages. The format that makes parallel data immediately useful is a translation memory (TM) file format, which segments the documents in individual sentences or paragraphs which are perfectly aligned to each other across the two languages.

In conventional SMT wisdom the best results are achieved by training the system only on data strictly within the domain, enhanced with some generic material that teaches the SMT system about the language in general. Unfortunately most owners of training data do not have enough parallel data to be able to train a system just with their own data. Microsoft has enough data to train a system just with its own parallel data in many languages, but even Microsoft doesn't have sufficient data in the languages where only a few Microsoft products are localized into, or products are localized only partially. General guideline is that above 10 million words it becomes interesting to build an MT engine. More is required for morphologically rich languages - languages that have a large number of variants or inflections of the same word stem.

The TAUS Data Association (TDA, http://www.tausdata.org) has been collecting parallel data for the purpose of sharing it among the members of the association, in order to allow each member to have a large pool of in-domain data, and to leverage this data in various ways. Building MT systems for themselves, among other uses. The domains that TDA covers today most widely are the IT domain with computer and software-related material, and government data, mostly parliamentary proceedings. TDA shares the data in the form of TMs, most easily consumable by any user of parallel data, including MT system training modules.

Microsoft is a member of TDA, and we have done experiments to show the utility of shared TM data. The results in detail are published here: http://research.microsoft.com/apps/pubs/default.aspx?id=102244.

In summary, the experiments show:

  • Sharing data among data owners is necessary for building an MT system for the domain.
  • A level of diversity within the parallel data imrpoves the results, we do not need to define the domain narrowly.
  • Even a large data owner like Microsoft can achieve quality gains by using the shared data, but smaller data owners benefit significantly more.
  • Best results are achieved when training an additional model from target language material only, including the targeted subdomain, for instance a specific company, and letting the system calculate the weight of that model.