UTF What?

Article
01/13/2008

Years ago life was pretty simple with regard to data input. Most computer programs were limited to ASCII characters and a set of character glyphs mapped into the code points between 0x80 and 0xFF (high or extended ASCII) depending on the language. The set of characters was limited to 256 code points (0x00 through 0xFF) primarily due to the CPU architecture. Multiple languages were made available via ANSI code pages. Modifying the glyphs in the upper 127 character code points between 0x80 and 0xFF worked pretty well expect for East Asian language versions. So, someone came up with the brilliant idea of encoding a character glyph with 2 bytes instead of just one. This double byte encoding worked quite well except that many developers were unaware that a lead byte could be an 0xE5 character and a trail byte could be a reserved character such as 0x5C (backslash). So, an unknowledgeable developer who stepped incrementally though a string byte by byte would often encounter all sorts of defects in their code. Fortunately today, most of us no longer have to deal with ANSI based character streams on a daily basis. Today most operating system platforms, the Internet, and many of our applications implement Unicode for data input, manipulation, data interchange, and data storage.

Unicode was designed to solve a lot of the problems with data interchange between computers, especially between computer systems using different language version platforms. For example, using a Windows 95 operating system there was virtually no way to view a file containing double byte encoded Chinese ideographic characters using Notepad on an English version of Windows 95. But, on Windows Xp or Vista not only can we view the correct character glyph we can also enter Chinese characters by simply installing the appropriate keyboard drivers and fonts. No special language version or language pack necessary! So, if we created a Unicode document using Russian characters those same character glyphs would appear no matter what language version operating system or application I used as long as the OS and application were 100% Unicode compliant.

However, Unicode of course has its own unique problems. Unicode was originally based on the UCS-2 Universal Multiple Octet Coded Character Set defined by ISO/IEC 10646. Essentially, UCS-2 provided an encoding schema in which each character glyph is encoded with 16-bits (or 32-bits for UCS-4). A pure 16-bit or 32-bit encoding format didn't really appeal to a lot of people due to various problems that would arise in string parsing. Most data around the world up to that point (with the exception of East Asian language files) were encoded with 8-bit characters. So, some really creative folks came up with ingenious ways to encode characters that more or less captured the essence of UCS (i.e., one code point == one character) using UCS transformation formats (UTF).

Another problem with UCS-2 and a pure 16-bit encoding was the limitation of 65,635 character code points. It wasn't very long before most people realized this set of code points was not adequate for our data needs. But, instead of adopting a UCS-4 encoding schema the Unicode Consortium redefined a range of character code points in the private use area as surrogates. These surrogate pairs would reference 16-bit character code points in different UCS-4 planes.

A while back I designed a tool called Str2Val to help developers and testers troubleshoot problematic strings. For example, lets assume the following string ṙｭϑӈɅ䩲Ẩլ｡ḩ»ﾓǊĬջḰǝĦ涃ᾬよㇳლȝỄ caused an error in a text box control that accepted a string of Unicode characters. A professional tester would isolate the problematic character or combination of characters causing the error and reference the exact character code point(s) by encoding format in the defect report. I recently upgraded the Str2Val tool to show the same string by various encoding formats such as UTF 16 (big and little endian), UTF-8, UTF-7, and decimal. Not only is this a good tool for trouble shooting problematic strings, it is also a useful training tool to explain the differences in the various common UCS Transformation Formats or UTF encoding methods.

Why is this important as a tester? Well, if you think you represent your customers yet the only characters you use in your testing are the ones labeled on the keyboard that is currently staring you in the face then you are only dealing with a small fraction of the data used by customers around the world (assuming that your software is used outside the country where it is developed, and most English language versions of software are used around the world if they are available on the open market.) If you don't know how the characters are encoded or which types of problems can arise from the various encoding methods then do you really know how to devise good tests, or are you just guessing? Do you know how to design robust tests with stochastic test data, or are you stuck with stale static data strings in flat files that you simply use over and over again? When a defect occurs in a string of characters (since string data is quite common in testing) can you troubleshoot the cause or isolate the code point, or do you simply just say "yea!...I found another bug!" and throw it back at the developer to figure out?

Share via

UTF What?

Additional resources