<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>I. M. Testy : Test Tools</title><link>http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx</link><description>Tags: Test Tools</description><dc:language>en</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Testing is Sampling</title><link>http://blogs.msdn.com/imtesty/archive/2009/07/16/testing-is-sampling.aspx</link><pubDate>Thu, 16 Jul 2009 09:11:21 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9835160</guid><dc:creator>I.M.Testy</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/imtesty/comments/9835160.aspx</comments><wfw:commentRss>http://blogs.msdn.com/imtesty/commentrss.aspx?PostID=9835160</wfw:commentRss><wfw:comment>http://blogs.msdn.com/imtesty/rsscomments.aspx?PostID=9835160</wfw:comment><description>&lt;p&gt;It seems it is about this time of year that I need to detach a bit from the world to reflect back on the past year and reevaluate my personal and professional goals moving forward. Perhaps I am just getting older or perhaps just a bit wiser (that is synonymous with 'sapient' for the C-D crowd), but I find it refreshing to break away this time of year to tend to my gardens, work on my boat, read some novels, and contemplate life's joys. Now, the major work projects are (almost) finished on my boat, the garden is planted and we are harvesting the early produce, and I reset both personal and professional development objectives for the next year and beyond. So, let me get back to sharing some of my ideas about testing.&lt;/p&gt;  &lt;p&gt;Many of you who read this blog also know of my website &lt;a href="http://www.TestingMentor.com"&gt;Testing Mentor&lt;/a&gt; where I post a few job aids and random test data generation tools I've created. I am a big proponent of random test data using an approach I refer to as &lt;em&gt;&lt;strong&gt;probabilistic stochastic test data&lt;/strong&gt;&lt;/em&gt;.&amp;#160; In May I was in Dusseldorf, Germany at the Software &amp;amp; Systems Quality Conference to present a talk on my approach. I especially enjoy these &lt;a href="http://www.sqs-conferences.com/index.htm" target="_blank"&gt;SQS conferences&lt;/a&gt; (now igniteQ) because the attendees are a mix of industry experts and academia, and I was looking for feedback on my approach. I call my approach probabilistic stochastic test generation because the process is a bit more complex than simple random data generation. Similar to random data generation we cannot absolutely predict a &lt;em&gt;probabilistic&lt;/em&gt; system, but we can control the feasibility of specified behaviors. And the adjective &lt;em&gt;stochastic&lt;/em&gt; simply means &amp;quot;pertaining to a process involving a randomly determined sequence of observations each of which is considered as a sample of one element from a probability distribution.&amp;quot; In a nutshell, my approach involves segregating the population into equivalence partitions, then randomly selects elements from specified parameterized equivalence partitions (which is how we know the probability of specific behaviors), finally the data may be mutated until the test data satisfies the defined fitness criteria. By combining equivalence partitioning and basic evolutionary computation (EA) concepts it is possible to generate large amounts of random test data that is representative from a virtually infinite population of possible data.&lt;/p&gt;  &lt;p&gt;One of the questions that came up during the presentation was how many random samples are required for confidence in any given test case; in other words how to we determine the number of tests using randomly generated test data? This is not an easy question to answer because the sample size of any given population depends on several factors such as:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;variability of data &lt;/li&gt;    &lt;li&gt;precision of measurement &lt;/li&gt;    &lt;li&gt;population size &lt;/li&gt;    &lt;li&gt;risk factors &lt;/li&gt;    &lt;li&gt;allowable sampling error &lt;/li&gt;    &lt;li&gt;purpose of experiment or test &lt;/li&gt;    &lt;li&gt;probability of selecting &amp;quot;bad&amp;quot; or uninteresting data &lt;/li&gt; &lt;/ul&gt;  &lt;h4&gt;&lt;strong&gt;Using sampling for equivalence class partition testing&lt;/strong&gt;&lt;/h4&gt;  &lt;p&gt;But, the question also brought to mind a parallel discussion regarding how we go about selecting elements from equivalence class partition subsets. I am adamantly opposed to hard-coding test data in a test case (automated or manual), but a colleague challenged me and said that since any element in an equivalent partition is representative of all elements in that partition then why can't we simple choose a few values from that equivalence subset. I realize this approach is done all the time by many testers; which is perhaps why we sometimes miss problems. But, hard-coding some small subset of values from a relatively large population of possible values is rarely a good idea, and is generally not the most effective approach for robust test design. One problem with hard-coding a variable is that the hard-coded value becomes static, and we know that static test data loses its effectiveness over time in subsequent tests using the same exact test data. Also, by hard-coding specific values in range of values means that we have absolutely 0% probability of including any other values in that range that are not specified. Another problem with hard-coded values stems from the selection criteria used to choose the values from a set of possible values. Typically we select values from a set based on based historical failure indicators, customer data, and our own biased judgment or intuition of ‘interesting’ values. &lt;/p&gt;  &lt;p&gt;However, the problem is that any equivalence class partition is a hypothesis that all elements are equal. Of course, the only way to validate or affirm that hypothesis is to test the entire population of the given equivalence class partition. Using customer-like values, or values based on failure indicators, and especially values we select based on our intuition are biased samples of the population, and may only represent a small portion of the entire population. Also, the number of values selected from any given equivalence partition set is usually fewer than the number required for some reasonable level of statistical confidence. So, while we definitely want to include values representative of our customers, values derived from historical failure indicators, and even our own intuition, we should also apply scientific sampling methods and include unbiased, randomly sampled values or elements from our set of values or population to help reduce uncertainty and increase confidence.&lt;/p&gt;  &lt;p&gt;For example, lets say that we are testing font size in Microsoft Word. Most font sizes range from 1pt through 1638pt and include half-sized fonts as well within that range. That is a population size of 3273 possible values. If we suspected that any value in the population had an equal probability of causing an error the standard deviation would be 50%. In this example, we would need a sample size of 343 statistically unbiased randomly selected values from the population to assert a 95% confidence level with a sampling error or precision of ±5%. Even in this situation, the number of values may appear to be quite large if the tests are manually executed which is perhaps one reason why extremely small subsets of hard-coded values fail to find problems that are exposed by other values within that equivalent partition (all too often after the software is released). Fortunately, statistical sampling is much easier and less costly with automated test cases and probabilistic random test data generation.&lt;/p&gt;  &lt;h4&gt;&lt;strong&gt;Testing is Sampling&lt;/strong&gt;&lt;/h4&gt;  &lt;p&gt;Statistical sampling is commonly used for experimentation in natural sciences as well as studies in social sciences (where I first learned it while studying sociology an anthropology). And, if we really stop to think about it; any testing effort is simply a sample of tests of the virtually impossible infinite population of possible tests. Of course, there is always the probability that sampling misses or overlooks something interesting. But, this is true of any approach to testing and explained by B. Beizer's Pesticide Paradox. The question we must ask ourselves is will statistical sampling of values in equivalence partitions or other test data help improve my confidence when used in conjunction with customer representative data, historical data, and data we intuit based on experience and knowledge?&amp;#160; Will scientifically quantified empirical evidence help increase the confidence of the decision makers?&lt;/p&gt;  &lt;p&gt;In my opinion anything that helps improve confidence and provides empirical evidence is valuable, and statistical sampling is a tool we should understand put into our professional testing toolbox. There are several well established formulas for calculating sample size that can help us establish a baseline for a desired confidence level. But, rather than belabor you with formulas, I decided to whip together a Statistical Sample Size Calculator that I posted to &lt;a href="http://ssscalculator.codeplex.com/" target="_blank"&gt;CodePlex&lt;/a&gt; and also on my &lt;a href="http://www.TestingMentor.com" target="_blank"&gt;Testing Mentor&lt;/a&gt; site to help testers determine the minimum number of samples of statistically unbiased randomly generated test data from a given equivalence partition to use in a test case to help establish a statistically reliable level of confidence. &lt;/p&gt;  &lt;p&gt;&lt;em&gt;&lt;strong&gt;Cockamamie chaos causes confusion; controlled chaos cultivates confidence!&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9835160" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/imtesty/archive/tags/The+Professional+Tester/default.aspx">The Professional Tester</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Testing/default.aspx">Testing</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx">Test Tools</category></item><item><title>Troubleshooting Test Data with String Decoder</title><link>http://blogs.msdn.com/imtesty/archive/2009/02/25/troubleshooting-test-data-with-string-decoder.aspx</link><pubDate>Wed, 25 Feb 2009 13:12:51 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9443867</guid><dc:creator>I.M.Testy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/imtesty/comments/9443867.aspx</comments><wfw:commentRss>http://blogs.msdn.com/imtesty/commentrss.aspx?PostID=9443867</wfw:commentRss><wfw:comment>http://blogs.msdn.com/imtesty/rsscomments.aspx?PostID=9443867</wfw:comment><description>&lt;p&gt;I value static test data that is derived from historical failure indicators, or representative of typical end-users. But, of course a problem with static test data is that it only provides a limited set of all possible data, and becomes stale or provides little new information after multiple iterations of the test. So, I am a proponent of using random data in well-designed tests. Of course, recklessly generating random data is just plain dumb and potentially results in numerous false positives. But, when the data set is well defined and decomposed into equivalence class subsets then it is possible to generate random data that is representative of all possible data elements; probabilistic stochastic test data!&lt;/p&gt;  &lt;p&gt;Last week I released an update to the test tool &lt;a href="http://www.testingmentor.com/tools/tools_pages/babel.htm" target="_blank"&gt;Babel&lt;/a&gt; for generating random strings of Unicode characters. Babel is a useful tool for comprehensive positive or negative testing of a textbox and other edit controls, and API parameters that take string arguments. Using probabilistic stochastic test data significantly increases the breadth of data coverage during a test cycle which increases the probability of exposing anomalies in string parsing and other string manipulation algorithms. But, when using characters from across the Unicode spectrum anomalies are usually caused by a specific character code point (or code points for surrogate pair characters), or combinations of characters. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/blogfiles/imtesty/WindowsLiveWriter/DecodingTestStrings_E10C/image_2.png"&gt;&lt;img title="image" style="border-right: 0px; border-top: 0px; display: inline; margin-left: 0px; border-left: 0px; margin-right: 0px; border-bottom: 0px" height="469" alt="image" src="http://blogs.msdn.com/blogfiles/imtesty/WindowsLiveWriter/DecodingTestStrings_E10C/image_thumb.png" width="288" align="left" border="0" /&gt;&lt;/a&gt;Of course, telling a developer that a string composed of the characters ꁲᱚRבּ䍳㄁܁쭤࿦ኳ causes an unexpected error would most likely be met with that classic deer in headlights look followed by some muttering such as &amp;quot;That's not a real string&amp;quot; and &amp;quot;nobody would ever enter such a string.&amp;quot; Often times developers are likely to shun random strings as test data, and managers might claim it is not representative of 'real' customer scenarios. So, the professional tester knows that instead of simply arguing in favor of random string testing we must troubleshoot the string to identify the specific character code point or code point combination causing the error. Because while a 'real' customer may not likely enter a string of random characters from multiple language scripts, the problem is likely caused by a single character (and sometimes the combination of character code points), and there is some probability of a customer somewhere in the world entering that problematic character! So, as professional's we must find that specific problematic character.&lt;/p&gt;  &lt;p&gt;To help professional testers decode each character in a string to its code point value I recently completed a new tool called &lt;a href="http://www.testingmentor.com/tools/tools_pages/str2val.htm" target="_blank"&gt;String Decoder&lt;/a&gt;. This test tool is an updated version of my old Str2Val tool (which had some serious problems when converting strings with surrogate pair characters). &lt;a href="http://www.testingmentor.com/tools/tools_pages/str2val.htm" target="_blank"&gt;String Decoder&lt;/a&gt; will decode Unicode characters (including surrogate pairs) to their hexadecimal UTF-16 (Big or Little Endian), UTF-8, UTF-7 encoding values, or an integer value (UTF-32).&lt;/p&gt;  &lt;p&gt;For example the characters in the string んޏ᠘㎝Xᔲ뉞ဵ have UTF-16 Big Endian encoding values displayed in the Results list in the image.&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt; Once the specific character code point or combination is identified, the tester can now tell the developer exactly what Unicode character or integer value is causing the anomaly. For example, it is much better to state a Unicode value of U+13BD is causing unexpected functionality as compared to trying to explain how to input the Cherokee letter MU or saying &amp;quot;just enter this character&amp;#160; Ꮍ.&amp;quot;&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;&lt;/p&gt;  &lt;p&gt;&lt;/p&gt;  &lt;p&gt;&lt;/p&gt;  &lt;p&gt;String Decoder can also be used to compare different Unicode transformation format encodings, or convert between Unicode hex values and 32-bit integer values of characters.&lt;/p&gt;  &lt;p&gt;Let me know what you think!&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9443867" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/imtesty/archive/tags/The+Professional+Tester/default.aspx">The Professional Tester</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Testing/default.aspx">Testing</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx">Test Tools</category></item><item><title>Random string generation…Update!</title><link>http://blogs.msdn.com/imtesty/archive/2009/02/17/random-string-generation-update.aspx</link><pubDate>Tue, 17 Feb 2009 09:58:49 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9427186</guid><dc:creator>I.M.Testy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/imtesty/comments/9427186.aspx</comments><wfw:commentRss>http://blogs.msdn.com/imtesty/commentrss.aspx?PostID=9427186</wfw:commentRss><wfw:comment>http://blogs.msdn.com/imtesty/rsscomments.aspx?PostID=9427186</wfw:comment><description>&lt;p&gt;One of the biggest challenges in input testing is the sheer amount of potential characters and the virtually infinite number of permutations of those characters in different character positions in a string. Even if we know about the myriad of language scripts used throughout the world, manually generating characters from multiple language groups would be excruciatingly inefficient. &lt;/p&gt;  &lt;p&gt;Since any modern application should support Unicode character we can assert the strings “abcdefg” and “ڄƥ藖꼩昨”are equivalent for most input testing requiring a Unicode string. So, random string test data generation is useful for easily increasing the breadth of test data tested, and also for testing the robustness of the applications ability to process complex data streams. &lt;/p&gt;  &lt;p&gt;&lt;a href="http://www.testingmentor.com/tools/tools_pages/babel.htm" target="_blank"&gt;Babel 2.0&lt;/a&gt; is a free test tool, and one of the few random string generators that can generate a string of character across the entire Unicode spectrum, since its initial release in 2006 it has been widely popular. So, I am happy to announce that an updated &lt;a href="http://www.testingmentor.com/tools/tools_pages/babel.htm" target="_blank"&gt;Babel 2.0&lt;/a&gt; is released! I know this constitutes a shameless plug…but, sometimes it helps to plug tools we’ve made that can benefit other testers or developers.&lt;/p&gt;  &lt;p&gt;Unlike many string generators that only produce a string of random ASCII characters, Babel can produce a string of random Unicode characters defined in the Unicode 5.1 specification, including surrogate pair characters (which often expose problems in various text boxes…hint, hint). Additional updates to Babel 2.0 include:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;Updated to the Unicode 5.1 spec (including new script groups and character code points)&lt;/li&gt;    &lt;li&gt;Ability to include/exclude combining character code points &lt;/li&gt;    &lt;li&gt;Ability to include/exclude reserved NetBIOS characters&lt;/li&gt;    &lt;li&gt;Custom range allows character generation from 0x01 through 0xFFFF.&lt;/li&gt;    &lt;li&gt;Ability to generate strings with a max length of 100,000 characters&lt;/li&gt;    &lt;li&gt;Improved distribution of characters from the selected language script groups&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;The following illustration provides a basic flow diagram of how Babel generates random strings. Essentially, one script group is randomly selected from all selected script group nodes, and all code points assigned to that script group are put into a collection. Next, one character is randomly selected from that collection and is appended to a string. This process continues until the string length equals a specified number of characters.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://blogs.msdn.com/blogfiles/imtesty/WindowsLiveWriter/RandomstringgenerationUpdate_14310/Babel_4.jpg"&gt;&lt;img title="Babel" style="border-right: 0px; border-top: 0px; display: block; float: none; margin-left: auto; border-left: 0px; margin-right: auto; border-bottom: 0px" height="246" alt="Babel" src="http://blogs.msdn.com/blogfiles/imtesty/WindowsLiveWriter/RandomstringgenerationUpdate_14310/Babel_thumb_1.jpg" width="635" border="0" /&gt;&lt;/a&gt; &lt;/p&gt;  &lt;p&gt;Better distribution of character selection across multiple script groups occurs by preventing the same script group from being selected before at least ½ of the other specified groups are selected. This means that as long as more than one script group node is selected the selected group of characters will be removed from the random selection process until at least half of the other script groups are chosen. This provides a greater distribution as compared to simple random generation.&lt;/p&gt;  &lt;p&gt;The download also includes the Babel.DLL (and the dependent UnicodeData.DLL) for test automation. The older methods are deprecated and no longer supported. The new methods have been simplified and now include:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;public static string Polyglot (int, int, bool, bool, bool, bool, bool)     &lt;br /&gt;Returns a string of random Unicode characters in all Unicode script groups based on a specified seed value.&lt;/p&gt; &lt;/blockquote&gt;  &lt;blockquote&gt;   &lt;p&gt;public static string Polyglot (int, bool, bool, bool, bool, bool, out int)     &lt;br /&gt;Generates a random seed value and returns a string of random Unicode string of characters in all Unicode script groups, and passes a reference to the seed value.&lt;/p&gt; &lt;/blockquote&gt;  &lt;blockquote&gt;   &lt;p&gt;public static string Polyglot ( int, int, bool, bool, bool, bool, bool, char, char)     &lt;br /&gt;Returns a string of random Unicode string of characters in all Unicode script groups based on a specified seed value&lt;/p&gt; &lt;/blockquote&gt;  &lt;blockquote&gt;   &lt;p&gt;public static string Polyglot (int, bool, bool, bool, bool, bool, char, char, out int)     &lt;br /&gt;Generates a random seed value and returns a string of random Unicode string of characters in all Unicode script groups, and passes a reference to the seed value.&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;Get the new release of &lt;a href="http://www.testingmentor.com/tools/tools_pages/babel.htm" target="_blank"&gt;Babel 2.0&lt;/a&gt; !&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9427186" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Automation/default.aspx">Test Automation</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/The+Professional+Tester/default.aspx">The Professional Tester</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Testing/default.aspx">Testing</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx">Test Tools</category></item><item><title>UTF What?</title><link>http://blogs.msdn.com/imtesty/archive/2008/01/14/utf-what.aspx</link><pubDate>Mon, 14 Jan 2008 07:49:45 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:7104477</guid><dc:creator>I.M.Testy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/imtesty/comments/7104477.aspx</comments><wfw:commentRss>http://blogs.msdn.com/imtesty/commentrss.aspx?PostID=7104477</wfw:commentRss><wfw:comment>http://blogs.msdn.com/imtesty/rsscomments.aspx?PostID=7104477</wfw:comment><description>&lt;p&gt;Years ago life was pretty simple with regard to data input. Most computer programs were limited to &lt;a href="http://www.unicode.org/charts/PDF/U0000.pdf" target="_blank"&gt;ASCII characters&lt;/a&gt; and a set of character glyphs mapped into the code points between 0x80 and 0xFF (high or extended ASCII) depending on the language. The set of characters was limited to 256 code points (0x00 through 0xFF) primarily due to the CPU architecture. Multiple languages were made available via &lt;a href="http://www.microsoft.com/globaldev/reference/iso.mspx" target="_blank"&gt;ANSI code pages&lt;/a&gt;. Modifying the glyphs in the upper 127 character code points between 0x80 and 0xFF worked pretty well expect for East Asian language versions. So, someone came up with the brilliant idea of encoding a character glyph with 2 bytes instead of just one. This double byte encoding worked quite well except that many developers were unaware that a lead byte could be an 0xE5 character and a trail byte could be a reserved character such as 0x5C (backslash). So, an unknowledgeable developer who stepped incrementally though a string byte by byte would often encounter all sorts of defects in their code. Fortunately today, most of us no longer have to deal with ANSI based character streams on a daily basis. Today most operating system platforms, the Internet, and many of our applications implement Unicode for data input, manipulation, data interchange, and data storage.&lt;/p&gt; &lt;p&gt;Unicode was designed to solve a lot of the problems with data interchange between computers, especially between computer systems using different language version platforms. For example, using a Windows 95 operating system there was virtually no way to view a file containing double byte encoded Chinese ideographic characters using Notepad on an English version of Windows 95. But, on Windows Xp or Vista not only can we view the correct character glyph we can also enter Chinese characters by simply installing the appropriate keyboard drivers and fonts. No special language version or language pack necessary! So, if we created a Unicode document using Russian characters those same character glyphs would appear no matter what language version operating system or application I used as long as the OS and application were 100% Unicode compliant.&lt;/p&gt; &lt;p&gt;However, Unicode of course has its own unique problems. Unicode was originally based on the UCS-2 &lt;a href="http://dret.net/glossary/ucs" target="_blank"&gt;Universal Multiple Octet Coded Character Set&lt;/a&gt; defined by &lt;a href="http://www.iso.org/iso/iso_catalogue/catalogue_ics/catalogue_detail_ics.htm?csnumber=39921" target="_blank"&gt;ISO/IEC 10646&lt;/a&gt;. Essentially, UCS-2 provided an encoding schema in which each character glyph is encoded with 16-bits (or 32-bits for UCS-4). A pure 16-bit or 32-bit encoding format didn't really appeal to a lot of people due to various problems that would arise in string parsing. Most data around the world up to that point (with the exception of East Asian language files) were encoded with 8-bit characters. So, some really creative folks came up with ingenious ways to encode characters that more or less captured the essence of UCS (i.e., one code point == one character) using &lt;a href="http://czyborra.com/utf/" target="_blank"&gt;UCS transformation formats&lt;/a&gt; (UTF).&lt;/p&gt; &lt;p&gt;Another problem with UCS-2 and a pure 16-bit encoding was the limitation of 65,635 character code points. It wasn't very long before most people realized this set of code points was not adequate for our data needs. But, instead of adopting a UCS-4 encoding schema the Unicode Consortium redefined a range of character code points in the private use area as surrogates. These &lt;a href="http://www.unicode.org/versions/Unicode4.0.0/ch15.pdf" target="_blank"&gt;surrogate pairs&lt;/a&gt; would reference 16-bit character code points in different UCS-4 planes.&lt;/p&gt; &lt;p&gt;A while back I designed a tool called &lt;a href="http://www.testingmentor.com/tool_info/Str2val.html" target="_blank"&gt;Str2Val&lt;/a&gt; to help developers and testers troubleshoot problematic strings. For example, lets assume the following string ṙｭϑӈɅ䩲Ẩլ｡ḩ»ﾓǊĬջḰǝĦ涃ᾬよㇳლȝỄ caused an error in a text box control that accepted a string of Unicode characters. A professional tester would isolate the problematic character or combination of characters causing the error and reference the exact character code point(s) by encoding format in the defect report. I recently upgraded the &lt;a href="http://www.testingmentor.com/tool_info/Str2val.html" target="_blank"&gt;Str2Val&lt;/a&gt; tool to show the same string by various encoding formats such as UTF 16 (big and little endian), UTF-8, UTF-7, and decimal. Not only is this a good tool for trouble shooting problematic strings, it is also a useful training tool to explain the differences in the various common UCS Transformation Formats or UTF encoding methods. &lt;/p&gt; &lt;p&gt;Why is this important as a tester? Well, if you think you represent your customers yet the only characters you use in your testing are the ones labeled on the keyboard that is currently staring you in the face then you are only dealing with a small fraction of the data used by customers around the world (assuming that your software is used outside the country where it is developed, and most English language versions of software are used around the world if they are available on the open market.) If you don't know how the characters are encoded or which types of problems can arise from the various encoding methods then do you really know how to devise good tests, or are you just guessing? Do you know how to design robust tests with stochastic test data, or are you stuck with stale static data strings in flat files that you simply use over and over again? When a defect occurs in a string of characters (since string data is quite common in testing) can you troubleshoot the cause or isolate the code point, or do you simply just say "yea!...I found another bug!" and throw it back at the developer to figure out?&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=7104477" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/imtesty/archive/tags/The+Professional+Tester/default.aspx">The Professional Tester</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Testing/default.aspx">Testing</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx">Test Tools</category></item><item><title>Babel - A 'new' random Unicode string generator test tool</title><link>http://blogs.msdn.com/imtesty/archive/2007/09/20/babel-a-new-random-unicode-string-generator-test-tool.aspx</link><pubDate>Fri, 21 Sep 2007 00:40:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:5019975</guid><dc:creator>I.M.Testy</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/imtesty/comments/5019975.aspx</comments><wfw:commentRss>http://blogs.msdn.com/imtesty/commentrss.aspx?PostID=5019975</wfw:commentRss><wfw:comment>http://blogs.msdn.com/imtesty/rsscomments.aspx?PostID=5019975</wfw:comment><description>For some time I have wanted to add surrogate pair character support to a tool I developed called GString, and this week I managed to find some time to do that work and more! As I developed the methods for surrogate pair support I rewrote (refactored in...(&lt;a href="http://blogs.msdn.com/imtesty/archive/2007/09/20/babel-a-new-random-unicode-string-generator-test-tool.aspx"&gt;read more&lt;/a&gt;)&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=5019975" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Automation/default.aspx">Test Automation</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Testing/default.aspx">Testing</category><category domain="http://blogs.msdn.com/imtesty/archive/tags/Test+Tools/default.aspx">Test Tools</category></item></channel></rss>