Okay so no, when we say transliteration, we don’t mean translation. But they do have some things in common.

 

Understanding the difference between language and writing system – which naturally everyone does now, having read the previous blog post – is essential to understanding the difference between translation and transliteration.

 

As discussed earlier, a script, or a writing system, refers to the way in which a language is written down. To moderately belabor the point:

 

 

String

Language

Script

cat

English

Latin

Katze

German

Latin

кошка

Russian

Cyrillic

قطة

Arabic

Arabic

고양이

Korean

Hangul

Chinese

Simplified Chinese

Chinese

Traditional Chinese

 

There is no 1:1 correlation between the languages of the world and their writing systems. In some cases, a particular writing system may be used to transcribe a single language (e.g. Thai, Malayalam), but in other cases, a writing system may be used to transcribe multiple languages (e.g. Latin, Cyrillic). In a few cases labels confuse people; for instance, the Arabic writing system is used to represent the Arabic language, along with several others (e.g. Urdu, Dari, Persian).

 

When people talk about translation, they typically refer to the mapping semantic content from one language into another. Each string in the first column above represents a translation of the English word cat into some other language. This is famously difficult to do well – not just for computers, but even sometimes for bilingual humans. It’s all well and good when you limit yourself to translating the word cat, but gets awfully hard when you move anything with syntactic complexity, and unbelievably hard when you move to anything of significant length. Anyone who has ever studied a foreign language probably has a pretty good idea of how hard this is. Even fluently bilingual people tend to have a hard time doing this to their own satisfaction.

 

Transliteration is another kind of mapping, only instead of mapping from one language to another, transliteration maps from one writing system to another. If you think of a writing system as a purely notational form for designating the spoken language, it’s easy to see where this kind of mapping might come in handy. By transposing strings from an unfamiliar writing system to a familiar one, the reader knows how to pronounce – and in some cases therefore semantically interpret (but more on that below) – the content. Japanese children, for instance, encounter unfamiliar Kanji all the time as they’re learning to read. Oftentimes they may know the word that the Kanji is representing in the spoken language, but they have no way to connect the word they’re used to hearing with the Kanji that they see in front of them. In fact, Japanese has a phonetic notational system – Yomigana – expressly designed to help Japanese speakers (adults as well as children) read unfamiliar Kanji terms, and there is a standard transliteration of Japanese into Latin script that is frequently used by non-native speakers of Japanese to help them pronounce new Japanese words that they encounter.

 

In many cases, transliteration really refers to some kind of phonetic transcription, helpful either for non-native speaker scenarios as above or in cases where a particular language may be standardly written in multiple writing systems (e.g. Serbian, which may be written in Latin or Cyrillic scripts). However, in one of the most common use cases for transliteration – the mapping between Traditional and Simplified Chinese – sound doesn’t enter into it, as both writing systems are ideographic and not alphabetic or syllabic. This mapping counts as transliteration rather than translation because it is a transposition between two writing systems that represent the same underlying spoken language, rather than a transposition between two distinct spoken languages.

 

ELS in Windows 7 exposes programmatic support for several transliteration modules that you can use the help shape user experience for your international customers, including:

 

·         Traditional Chinese <> Simplified Chinese: Use this to reduce localization costs for your application, or to extend the reach of your content to new customers

·         Cyrillic > Latin normalization, with a focus on Serbian and Russian: This can also be used to reduce localization costs, or in some cases to enable second language learning scenarios

·         Indic > Latin normalization (Devanagari, Bengali, Malayalam): These are phonetic transcriptions that are primarily useful in second language learning or communication scenarios