In my last post, I spoke about the problem of character encoding. The ASCII character set works fine in North America (excluding Quebec, Mexico and Miami) but as soon as you leave the continent, you start running into all sorts of weird characters that ASCII doesn't know about. So what do we do? Well, the answer is obvious - you invent another character set that does cover these languages.
We'll start off by going to western Europe. Western Europeans (and Americans) think they rule the world so they came up with a character set to cover the common characters that they use in their various alphabets - Spanish, German, Italian, English, and so forth. These include the following: À, Ä, à, ä, ç, è, é, ü, ß, etc.
One such character encoding is ISO-8859-1. It is less formally referred to as Latin-1. It was originally developed by the ISO, but later jointly maintained by the ISO and the IEC. It consists of 191 characters from the Latin script. This character-encoding scheme is used throughout The Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages. Dutch, Estonian, French and Finish have near complete coverage.
Another common character set of the Latin alphabet is Windows-1252 (also known as WinLatin1), used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages.
The type of character encoding in an email is specified either in the message headers or in the MIME headers. For example, in a recent email I got in my Gmail account, it is specified in the MIME headers:
--000e0cd29d9ca6119d04646c8428 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
My email client sees that the character set is ISO-8859-1 and any characters that map to the ISO-8859-1 are translated to the Latin alphabet character representing that number (more on this in a future post). Similarly, in a recent spam message:
------=SPLITOR00A_001_340918203D Content-Type: text/html; charset="windows-1252" Content-Transfer-Encoding: 7bit
Another place that you can look to see what character set the message is encoded in is in the message headers:
Content-Type: text/html; charset="us-ascii"
From here, we see that the charset is ASCII. Your email client will use this to interpret the characters in the message. Other common character encodings include the following:
ISO 8859
MS-Windows character sets
Russian
Japanese
Chinese
Korean
One thing I commonly do is open up the message and take a look at what the character encoding is to try to understand what the language is. Windows-1254 is Turkish. KOI8-R is the most common Russian encoding, followed by Windows-1251. GB 2312 is most common for Chinese. Japanese's most common encoding is ISO-2022-JP.
But this is not the end of the story for encoding. There is far more. In my next post, we will take a look at multi-byte character encoding.
PingBack from http://www.anith.com/?p=16084