<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Funny, It Worked Last Time : I18N</title><link>http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx</link><description>Tags: I18N</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Encodings in Strings are Evil Things (Part 8)</title><link>http://blogs.msdn.com/ryanmy/archive/2005/01/17/354864.aspx</link><pubDate>Tue, 18 Jan 2005 03:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:354864</guid><dc:creator>ryanmy</dc:creator><slash:comments>8</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/354864.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=354864</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.&amp;nbsp; However, most of my "real situations" involve legacy encodings.&amp;nbsp; So, I need to start cracking on transcoding.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The first concern is allowing adapters for arbitrary transcodings.&amp;nbsp; A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue.&amp;nbsp; This is a localization problem.&amp;nbsp; Just to make sure that terminology is clear, &lt;strong&gt;internationalization&lt;/strong&gt; (often abbreviated to &lt;strong&gt;i18n&lt;/strong&gt;) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic.&amp;nbsp;&amp;nbsp;This is almost always paired&amp;nbsp;with &lt;strong&gt;localization&lt;/strong&gt;&amp;nbsp;(sometimes abbreviated to &lt;strong&gt;l10n&lt;/strong&gt;), which is the act of tailoring an already-internationalized program for a specific language/locale.&amp;nbsp; Internationalization may be done by any programmer; localization requires translators.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the case of sorting,&amp;nbsp;a binary sort is often not enough.&amp;nbsp; Context is everything!&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Where do accented characters sort -- the same as their base characters, or after?&amp;nbsp; &lt;em&gt;(For French speakers, accented As come after Z.)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What are you sorting for?&amp;nbsp; &lt;em&gt;(German has a special sorting order for names in phone books!)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What about ligatures such as ch or fi?&amp;nbsp; &lt;em&gt;(Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;For this reason,&amp;nbsp;developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/resources/strings/stringreference/stringfunctions/comparestring.asp"&gt;CompareString&lt;/a&gt; or &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_5s2v.asp"&gt;LCMapString&lt;/a&gt; APIs.&amp;nbsp; For example:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;rmstring&amp;lt;ucs4, bytevector&amp;gt; getfirst( std::list&amp;lt;rmstring&amp;lt;utf8, bytevector&amp;gt; &amp;gt;&amp;nbsp;&amp;amp; lines ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; return (*lines.begin()).transcode&amp;lt;ucs4, bytevector&amp;gt;();&lt;br /&gt;}&lt;/font&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This example is a bit contrived -- a real example would template the container and output encoding, and&amp;nbsp;make the LCID a&amp;nbsp;parameter with a default argument&amp;nbsp;-- but you get the point.&amp;nbsp; &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt;, in this case, is a custom predicate for &lt;font face="Courier New"&gt;std::sort&lt;/font&gt; (see &lt;font face="Courier New"&gt;&amp;lt;algorithm&amp;gt;&lt;/font&gt;) that converts both strings to UTF-16 and then invokes &lt;strong&gt;CompareStringW&lt;/strong&gt; on them, throwing a &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain.&amp;nbsp; Of course, this will hardly be my primary solution!&amp;nbsp; More on that later.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Anyways, similar issues arise for transcoding.&amp;nbsp; (Not to mention the fact that &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt; is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)&amp;nbsp; So, we must include pluggable transcoders.&amp;nbsp; So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt, Engine e = Engine()&amp;nbsp;);&lt;br /&gt;&lt;br /&gt;template &amp;lt;class Engine, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;These functions now put off transcoding to the Engine object, whatever that may be.&amp;nbsp; In the Win32 vein, we could use &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp"&gt;MultiByteToWideChar&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp"&gt;WideCharToMultiByte&lt;/a&gt;&amp;nbsp;-- but that's too easy, not to mention very difficult to wrap.&amp;nbsp; I'd really like to do something that's solely C++ and entirely based in the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt;'s mappings directory.&amp;nbsp; There's a few dilemmas to be solved for that.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of &lt;strong&gt;compatibility characters&lt;/strong&gt;.&amp;nbsp; Compatibility characters are &lt;strong&gt;canonically equivalent&lt;/strong&gt; to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard.&amp;nbsp; For example, ISO8859-2 defines &lt;strong&gt;0x5A&lt;/strong&gt; to be equivalent to a capital letter L with a caron accent (&amp;Lcaron).&amp;nbsp; The "simple" equivalent of this in Unicode is a capital letter L (&lt;strong&gt;U+004C&lt;/strong&gt;) followed by a combining caron (&lt;strong&gt;U+030C&lt;/strong&gt;); however, Unicode also defines a single pre-combined character, &lt;strong&gt;U+013D&lt;/strong&gt;, that is directly equivalent to those two.&amp;nbsp; Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.&amp;nbsp; (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)&amp;nbsp; Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.&amp;nbsp; The most direct solution, it seems, is to generate a finite automata.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've been&amp;nbsp;working on the DFA for the last few days.&amp;nbsp; My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.&amp;nbsp; Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).&amp;nbsp; The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.&amp;nbsp; Legacy-to-legacy transcodes use UCS-4 as an intermediary.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I'm now working on a program that reads in a file from &lt;a href="http://www.unicode.org/Public/MAPPINGS/"&gt;MAPPINGS&lt;/a&gt; and UnicodeData.txt from the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt; and outputs the DFA in C++ format.&amp;nbsp; I'll post more when that's finished.&amp;nbsp; (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=354864" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 7)</title><link>http://blogs.msdn.com/ryanmy/archive/2005/01/10/350325.aspx</link><pubDate>Tue, 11 Jan 2005 03:11:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:350325</guid><dc:creator>ryanmy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/350325.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=350325</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Eugh.&amp;nbsp; Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; -- which means, of course, that this hasn't updated.&amp;nbsp; I haven't given up on it though!&amp;nbsp; (I'm not dead!&amp;nbsp; I don't want to go on the cart...)&amp;nbsp; If anything, my desire to finish&amp;nbsp;it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;..."&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, on to business.&amp;nbsp; First off, the all-important &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; class is done.&amp;nbsp; This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes.&amp;nbsp; The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator.&amp;nbsp; The internals are mostly simple pointer arithmetic; just a lot to be tested.&amp;nbsp; (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for &lt;font face="Courier New"&gt;recv()&lt;/font&gt;ing something in from a TCP socket.&amp;nbsp; If we know that said content is UCS-4, the natural urge is to cast it to an &lt;font face="Courier New"&gt;unsigned long *&lt;/font&gt; to iterate over... except that you can't.&amp;nbsp; Or, at least, you shouldn't.&amp;nbsp; If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or &lt;strong&gt;crash&lt;/strong&gt; (on IA-64, unless &lt;font face="Courier New"&gt;&lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/seterrormode.asp"&gt;SetErrorMode()&lt;/a&gt;&lt;/font&gt; is called to force OS alignment fixups, in which case it will run extremely slowly).&amp;nbsp; Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code.&amp;nbsp; There is also no way for strictly conformant code to check if a given pointer is aligned, since&amp;nbsp;there is no operator to retrieve a type's alignment requirements.&amp;nbsp; The best you can do is assume that no type will have an alignment requirement greater than its size, and &lt;font face="Courier New"&gt;assert(0 == reinterpret_cast&amp;lt;size_t&amp;gt;(ptr) % sizeof(type))&lt;/font&gt;, which is throughly disgusting AND assumes certain things about the host's&amp;nbsp;virtual memory system&amp;nbsp;that may not be true.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Thus, I've opted for the simplest solution: a huge comment in the code that says &lt;em&gt;"These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size.&amp;nbsp; Violating either of these assumptions will result in your program's untimely death."&lt;/em&gt;&amp;nbsp; Sometime later, I might come up with a helper function &lt;font face="Courier New"&gt;alignment_assert&amp;lt;T&amp;gt;(ptr)&lt;/font&gt; that takes advantage of compiler-specific extensions such as MSVC's &lt;font face="Courier New"&gt;__alignof&lt;/font&gt; if available.&amp;nbsp; Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters.&amp;nbsp; The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've also had occasion to rethink my plans for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;.&amp;nbsp; Initially, I planned to use &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; in a way similar to the Boost &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; pseudo-operator.&amp;nbsp; However, it disturbed me that doing so would mean that every call to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would create a temporary in which to store the result, which would then make its way to final storage either by &lt;font face="Courier New"&gt;operator=&lt;/font&gt; or copy constructor.&amp;nbsp; I ended up realizing that a good 70% of the calls to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would be writing the encode into a string that already existed.&amp;nbsp; So, instead, we now have the &lt;font face="Courier New"&gt;transcode&lt;/font&gt; function, which comes in both non-member and member flavors:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt );&lt;br /&gt;&lt;br /&gt;template &amp;lt;class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;With the above, the originally envisioned &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; is now just syntactic sugar for a call to the source string's member &lt;font face="Courier New"&gt;transcode()&lt;/font&gt; function.&amp;nbsp; It also means that the code to do transcodes is now centralized within &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;.&amp;nbsp; Handy!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Oh, and since someone asked: I'm currently developing and testing this on&amp;nbsp;Visual C++&amp;nbsp;.NET 2003 and &lt;a href="http://www.nuwen.net/gcc.html#mingw"&gt;Stephan Lavavej's distribution of MinGW&lt;/a&gt;; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms.&amp;nbsp; My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc.&amp;nbsp; Visual parity will be the hardest to do; in fact, I will likely not do it right away.&amp;nbsp; (Namely, because the Unicode tables do not contain such parity information.)&amp;nbsp; Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.&lt;/p&gt; &lt;p&gt;(Since this was mostly a "what happened while I was gone" article, no point summary.)&lt;/p&gt; &lt;p&gt;(Update 2pm: &lt;A href="http://blogs.msdn.com/michkap/"&gt;Michael Kaplan&lt;/a&gt; nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks!&amp;nbsp; Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=350325" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 6)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/11/04/252439.aspx</link><pubDate>Thu, 04 Nov 2004 18:19:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:252439</guid><dc:creator>ryanmy</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/252439.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=252439</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.&amp;nbsp; Luckily, I had everything backed up; however,&amp;nbsp;I had to copy everything over to, and work on,&amp;nbsp;a single-monitor Longhorn dogfood box with no major apps installed.&amp;nbsp; This&amp;nbsp;went on for&amp;nbsp;a week and a half while I waited for Dell to&amp;nbsp;slog through&amp;nbsp;the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going)&amp;nbsp;and this put me behind schedule for several deadlines.&amp;nbsp; So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog.&amp;nbsp; In about two weeks these deadlines will be behind me, and I can start posting with regularity again.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Also, at this point I'm now primarily doing implementation of previously discussed ideas, so this series of posts will temporarily serve two purposes: discussion of issues, and journal of coding concerns about implementing this in C++.&amp;nbsp; And this post concerns one of the C++ concerns: how do you define &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; for a string that's in a variable-width encoding such as UTF-8?&amp;nbsp; One of the basic assumptions in &lt;font face="Courier New"&gt;std::string&lt;/font&gt; that I intend to honor is that &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a reference to the actual data, not a copy.&amp;nbsp; For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a &lt;font face="Courier New"&gt;unsigned char&lt;/font&gt;/&lt;font face="Courier New"&gt;short&lt;/font&gt;/&lt;font face="Courier New"&gt;long&lt;/font&gt;.&amp;nbsp; However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well.&amp;nbsp; I could do this with covariant returns and unions, but this is horribly ugly -- and I'd need a lot of different returns, since UTF-8 alone can have up to six bytes in a single code point.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My solution is to return a proxy object, &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt;.&amp;nbsp; When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt;.&amp;nbsp; The Vector Wrapper Problem, as&amp;nbsp;some refer to it,&amp;nbsp;deserves a bit of discussion.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ standard defines that all implementations of the STL container &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; should include a specialization &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt; that stores the bits in packed form.&amp;nbsp; (Contrast&amp;nbsp;with an array of bools -- bools can be stored in memory as if they were any of several integral types, depending on situation and the intelligence of the compiler).&amp;nbsp; In this case, if &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a bool, you cannot write expressions such as &lt;font face="Courier New"&gt;a[3] = true;&lt;/font&gt; -- there's no bool back there!&amp;nbsp; You need to return a proxy object containing a pointer/reference to the source container, with &lt;font face="Courier New"&gt;operator=&lt;/font&gt; overloaded, in order to support assignment in this manner.&amp;nbsp; However, this breaks with the definition of &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; -- the standard simultaneously claims that any &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; on a &lt;font face="Courier New"&gt;vector&lt;/font&gt; must return some type that is convertible to &lt;font face="Courier New"&gt;T &amp;amp;&lt;/font&gt;.&amp;nbsp; This bit of doublespeak results in the inability to reliably write certain types of wrappers around&amp;nbsp;vector that can accept bool.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My belief is that this was an oversight of the standardization committee.&amp;nbsp; They took the first step towards solving this by defining &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; (and the iterator's dereference operators) as returning a member typedef, &lt;font face="Courier New"&gt;ref_type&lt;/font&gt;; however, they stopped short of a goal, by saying that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; had to be defined from the allocator for the vector.&amp;nbsp; A better solution would be to define a set of semantics and overloaded operators that suitably encapsulated the intent, purpose, and behavior of references, and defining this as a &lt;em&gt;Reference&lt;/em&gt; typeclass.&amp;nbsp; They could then simply require that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; be some type meeting the &lt;em&gt;Reference(T)&lt;/em&gt; requirements, and all would be well.&amp;nbsp; This is what I intend to do.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The only remaining question is how to handle assignment; at first I planned to make it read-only, but later decided&amp;nbsp;to maintain a reference to the host string and call &lt;font face="Courier New"&gt;replace()&lt;/font&gt; on the&amp;nbsp;encoding/store in response to an &lt;font face="Courier New"&gt;operator=&lt;/font&gt;.&amp;nbsp; This means that a &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt; must be templated on the source string in order to be typesafe.&amp;nbsp; This brings up the question of the string's lifetime and the ref's lifetime being separate; however, traditional C++ says that operations such as destruction may invalidate iterators/references/etc. anyways.&amp;nbsp; In this case, I think it's reasonable to be the same.&amp;nbsp; (This also means it's okay to use a member reference variable; in almost every case, pointers&amp;nbsp;are preferable, since references cannot be assigned to, only copy-constructed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As far as implementation goes, I've completed the &lt;font face="Courier New"&gt;unmanaged_ptr&lt;/font&gt; and &lt;font face="Courier New"&gt;vector_of_bytes&lt;/font&gt; backing stores, and am currently working on the &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; parent class that all fixed width encodings such as UCS2 and ASCII derive from.&amp;nbsp; Next post, I will likely talk about the interactions of encoding and backing store classes, and how I've divided responsibilities between them.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;To finish this post off, though, a quick oddity about the use of &lt;font face="Courier New"&gt;widen()&lt;/font&gt; in iostreams.&amp;nbsp; &lt;font face="Courier New"&gt;widen()&lt;/font&gt; is defined on streams as handling certain platform-specific character conversions, such as converting &lt;font face="Courier New"&gt;'\n'&lt;/font&gt; to the appropriate end-of-line character on your platform (CR for Unix and Mac OS X, CRLF for Windows, LF for Classic MacOS).&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; '\n';&lt;/font&gt; outputs &lt;font face="Courier New"&gt;cout.widen('\n')&lt;/font&gt;, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; "\n";&lt;/font&gt; iterates through all characters in the string (as reported&amp;nbsp;by &lt;font face="Courier New"&gt;traits&amp;lt;char&amp;gt;::length()&lt;/font&gt;) and outputs the result of &lt;font face="Courier New"&gt;cout.widen()&lt;/font&gt; on each one, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; string("\n");&lt;/font&gt; does NOT widen characters.&amp;nbsp; It directly asks for cout's &lt;font face="Courier New"&gt;streambuf&lt;/font&gt;, and &lt;font face="Courier New"&gt;xsputn()&lt;/font&gt;'s the entire contents of &lt;font face="Courier New"&gt;data()&lt;/font&gt; into it.&amp;nbsp; Do not pass locale, do not collect i18n.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm still thinking over how I want to define my behavior for &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=252439" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 5)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/25/247677.aspx</link><pubDate>Tue, 26 Oct 2004 01:46:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:247677</guid><dc:creator>ryanmy</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/247677.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=247677</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx"&gt;In our last episode&lt;/a&gt;, we briefly discussed possible behaviors for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and we discussed how the STL's &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input.&amp;nbsp; We also noted that we could avoid many of the implementation headaches that result,&amp;nbsp;because of our decision to generalize our backing store.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; in &lt;a href="http://www.gotw.ca/publications/xc++s.htm"&gt;Exceptional C++ Style&lt;/a&gt; -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions.&amp;nbsp; However, he's not looking to improve &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s foundation -- he's merely explaining how reducing the number of methods in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; makes the code much easier to maintain.&amp;nbsp; (For example, rather than writing an &lt;font face="Courier New"&gt;empty()&lt;/font&gt; member function, he writes a templated empty function that takes a STL&amp;nbsp;string or container, and returns true if the string's begin and end iterators are equal.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Furthermore, he specifically chooses some less-than-ideal but good-enough implementations as a result of making simplicity the primary goal.&amp;nbsp; For example, in his implementation of &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, he implements the shrinking case by using a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor to make a copy of the first N characters of the string, and then calls &lt;font face="Courier New"&gt;swap()&lt;/font&gt;, so he's incurring a memory allocation and deallocation there that is unneccessary.&amp;nbsp; While Sutter's treatment is good, we have a slightly more ambitious goal in mind (making a better class to replace &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, rather than merely improving upon the existing implementation through decomposition), so we're not duplicating effort.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That said, I agree with his approach of decomposing functions with many overloads such as insert and replace, especially considering that our choice to generalize backing stores eliminates most of my objections to his techniques.&amp;nbsp; So, I've decided to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class after all, in a sense.&amp;nbsp; The &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class will have a single member function for each major piece of functionality, such as insertion or replacement or concatenation.&amp;nbsp; We'll then make an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; wrapper class that provides overloads in a way to make it roughly equivalent to std::string.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, on to a concern I alluded to in the last entry: distinguishing code points and characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Up until now, I've specifically used the word "code point" to refer to a single symbol in the Unicode/UCS tables, even though Unicode refers to them as characters.&amp;nbsp; I chose to do this because of the existence of "combining characters", which are symbols associated with the previous "base character" such as accents, enclosing boxes/circles, formatting marks for subscript/superscript, and so on.&amp;nbsp; Unicode contains unaccented base characters, combining characters, and "precomposed characters" that use a single codepoint to represent a pre-accented base character.&amp;nbsp; These are considered always canonically equivalent to some combination of a base character and one or more composing characters.&amp;nbsp; (See &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx"&gt;Part 1&lt;/a&gt; for an example of this.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Unicode&amp;nbsp;defines a set of &lt;a href="http://www.unicode.org/unicode/reports/tr15/"&gt;normalization forms&lt;/a&gt; that are used to standardize whether to favor combining characters or precomposed characters.&amp;nbsp; However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and &lt;strong&gt;must &lt;/strong&gt;be represented using&amp;nbsp;combining characters.&amp;nbsp; To make things even nastier, there are some combining characters, most notably double diacritics, that can span multiple base characters.&amp;nbsp; (And I haven't even gotten into Arabic and Hebrew scripts that can change the direction of rendering in mid-string!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through!&amp;nbsp; When you hit the right arrow in Microsoft Word to skip over an &amp;Agrave;, you don't go first to an A and then to the A's accent -- you move past the whole "character."&amp;nbsp; (Unicode refers to this rough definition of&amp;nbsp;character as a "grapheme cluster," FYI.)&amp;nbsp; If it weren't for double diacritics, we could shrug and say "Well, a character is a base codepoint plus zero or more combining codepoints."&amp;nbsp; But it may not be that easy.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; After taking a walk to think it over, I ended up deciding to err on the side of the Unicode standard -- we'll treat double diacritics as a glyph problem.&amp;nbsp; Namely, a double diacritic is attached to the preceeding base codepoint only, and the fact that it extends over the following base codepoint as well is a glyphing concern.&amp;nbsp; (This is also due to the fact that most of the double diacritics can also be represented as a pair of "combining halfmark" where half of the glyph is applied to each character as two separate combining characters, and the glyphing engine is expected to recognize this and render it as a single glyph.)&amp;nbsp; So, we can say that a grapheme cluster is a base character, plus zero or more combining code points, plus any uses of the &lt;em&gt;Combining Grapheme Joiner&lt;/em&gt; codepoint.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; So, do we want &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; to take integer index arguments, iterators, etc.&amp;nbsp;as referring to code points, or to grapheme clusters?&amp;nbsp; For the sake of programmer familiarity, we're going to default to clusters, but we'll allow code points.&amp;nbsp; We will have a single iterator class that takes a bool in its construction describing whether &lt;font face="Courier New"&gt;advance()&lt;/font&gt; and related methods should advance by codepoint or by cluster.&amp;nbsp; Our begin, end, and other such iterator methods will be templated with a default template argument to clusters; thus, you can ask for a codepointer iterator by calling &lt;font face="Courier New"&gt;str.begin&amp;lt;codepoints&amp;gt;()&lt;/font&gt;.&amp;nbsp; This is a bit messy, but workable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Before, we listed the methods that seemed worthwhile to carry over.&amp;nbsp; However, many of them can be implemented as versions of the others.&amp;nbsp; Tomorrow, we'll actually write a complete header for &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and start implementing it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That, and I think it's about time I go buy a hardcover copy of the Unicode standard, as I have way too many PDFs on my desktop right now.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=247677" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 4)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx</link><pubDate>Fri, 22 Oct 2004 23:42:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:246539</guid><dc:creator>ryanmy</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/246539.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=246539</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx"&gt;In our last episode&lt;/a&gt;, we established that we wouldn't be able to make a true &lt;font face="Courier New"&gt;std::string&lt;/font&gt; replacement and still handle variable-width encodings.&amp;nbsp; So, we started with the beginning lines of an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; class.&amp;nbsp; However, this doesn't mean we are going to dispense with &lt;font face="Courier New"&gt;std::string&lt;/font&gt; entirely!&amp;nbsp; But first, a quick answer about my choice of names and an explanation about exceptions.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;A friend of mine asked me yesterday, "Don't you intend to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and then have a typedef'd &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; that hardwires a specific specialization, like ASCII?"&amp;nbsp; I'm considering this -- but if I hardwire anything, it will &lt;em&gt;not &lt;/em&gt;be the encoding type.&amp;nbsp; Trying to abstract away the encoding as hidden information is exactly the thinking that got us into this mess with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;!&amp;nbsp; However, what we use for the backing store might be worth standardizing.&amp;nbsp; After all, using a &lt;font face="Courier New"&gt;vector&amp;lt;byte&amp;gt;&lt;/font&gt; to contain our bitstream is a very flexible choice; it's just not the best-performing one.&amp;nbsp; Whenever possible, we should make a library easy to use on the surface, and expose the guts of it to be changed once someone already has the program running and is trying to improve on it (by, for example, using string literals as backing stores and only copying them to heap memory when needed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In a dream world, we would typedef a partial specialization.&amp;nbsp; However, we get bit by one of the most annoying mis-features in C++ -- &lt;a href="http://www.gotw.ca/gotw/079.htm"&gt;you can't template a typedef&lt;/a&gt;.&amp;nbsp; Even the STL is crippled by this, and has to work around it using its &lt;font face="Courier New"&gt;::rebind&lt;/font&gt; member.&amp;nbsp; So, the best we could do is allow someone to &lt;font face="Courier New"&gt;#define rmstring(enc) basic_rmstring&amp;lt;enc, vector_of_bytes&amp;gt;&lt;/font&gt;, and declare a string as &lt;font face="Courier New"&gt;rmstring(iso8859_1) str;&lt;/font&gt;..&amp;nbsp;&amp;nbsp;It'd work, but it makes me cringe.&amp;nbsp; Alternately, we could use a rebind approach like the STL:&amp;nbsp;&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Enc&amp;gt; struct rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font color="#000080"&gt;&lt;font face="Courier New"&gt;typedef&amp;nbsp;basic_rmstring&amp;lt;Enc, vector_of_bytes&amp;gt; type;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;rmstring&amp;lt;iso8859_1&amp;gt;::type str;&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Really, both of them are pretty damned ugly; the preprocessor approach is prettier,&amp;nbsp;IMHO, but is also considerably more dangerous.&amp;nbsp; So, I'm going to leave it as&amp;nbsp;&lt;font face="Courier New"&gt;rmstring&lt;/font&gt; with two template values for the purposes of this&amp;nbsp;blog.&amp;nbsp;&amp;nbsp;Eventually I'll probably opt for the &lt;font face="Courier New"&gt;#define&lt;/font&gt; for my own&amp;nbsp;version of the library, but you can choose whichever is more appealing to you (conciseness versus typesafety), or choose neither.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The second thing I wanted to answer from yesterday were those two exceptions, &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; and &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, that I listed next to the &lt;font face="Courier New"&gt;encoding_cast()&lt;/font&gt; function.&amp;nbsp; What on earth are they for?&amp;nbsp; First off, imagine that you're trying to convert a string from UCS-4 to UCS-2.&amp;nbsp; As I mentioned in &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Part 2&lt;/a&gt;, UCS-2 is a non-universal encoding, and there are some code points that it cannot represent.&amp;nbsp; What happens if our UCS-4 string contains one of those code points?&amp;nbsp; In this case, we will throw the &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception.&amp;nbsp; We will also throw it in the case of converting to legacy character sets that simply do not have a code point defined for a symbol.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;There's something to keep in mind, though.&amp;nbsp; The popularity of JPEG proves that a lossless transform is not always necessary.&amp;nbsp; Imagine that we have the greek letter &lt;strong&gt;&amp;AElig;&lt;/strong&gt; -- is it acceptable to convert this to two characters, &lt;strong&gt;AE&lt;/strong&gt;?&amp;nbsp; The proper answer is neither yes or no;it's "sometimes."&amp;nbsp;&amp;nbsp;Remember, all this time, our definitions of string have been derived from a definition of symbols&amp;nbsp;that a human interprets -- and this means that whether or not a&amp;nbsp;'close enough'&amp;nbsp;translation is acceptable depends on who's looking at the string.&amp;nbsp; Imagine that a blind person is using a screenreader (a program that uses a computerized voice to read text as it appears on the screen).&amp;nbsp; In that case, there's a vast difference between &lt;strong&gt;&amp;AElig;&lt;/strong&gt; and &lt;strong&gt;AE.&lt;/strong&gt;&amp;nbsp; However, for a person with normal sight reading a webpage, however, the two might be interchangable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The computer scientist in me says that I should only allow lossless transforms -- the engineer in me knows better, though, and there's a way to satisfy both.&amp;nbsp; Therefore, we are going to add a third template argument to yesterday's definition of&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and allow it to have a default specialization.&amp;nbsp; This default specialization will be called the "symbol clash resolver" and has a well-known method invoked whenever a missing symbol problem occurs.&amp;nbsp; The default one, &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt;, will throw &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; in all cases.&amp;nbsp; A user can define alternatives, though.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Two possible alternatives immediately occur to me -- one called &lt;font face="Courier New"&gt;visual_parity_resolver&lt;/font&gt; that does replacements like the above, and another called &lt;font face="Courier New"&gt;error_symbol_resolver&lt;/font&gt; that acts like RS232's error character, inserting a compile-time constant instead (such as a box symbol, or an "&amp;lt;ERROR&amp;gt;" string, or whatever suits the user) whenever a symbol cannot be translated.&amp;nbsp; But those can all wait for later -- only &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt; needs to be immediately defined, and its definition is trivial, since it just throws :)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The other exception, &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, comes from if we try to decode a buffer that has an error in it.&amp;nbsp; In the case of UTF-8, there are sequences that decode to illegal or nonsensical numbers, and if we&amp;nbsp;are asked to decode these sequences, we should let the user know.&amp;nbsp; Imagine a scenario where you are writing an Internet&amp;nbsp;server daemon, and expect to recieve a UTF-8 encoded string as the first transmission following a client successfully connecting.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In this scenario, we &lt;font face="Courier New"&gt;recv()&lt;/font&gt; the data from the server into a buffer, and then construct an &lt;font face="Courier New"&gt;rmstring&amp;lt;utf8, &lt;/font&gt;&lt;font face="Courier New"&gt;unmanaged_pointer&amp;gt;&lt;/font&gt; to read it.&amp;nbsp; If there was an error in network transmission, or a malicious client was testing our ability to handle bad data, we should communicate this to the programmer as an error.&amp;nbsp; Thus, if an encoding can detect illegal input (very few encodings can!) it may throw a &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt; exception&amp;nbsp;if you invoke&amp;nbsp;any operations that hit that input,&amp;nbsp;or if you attempt to trans-code it.&amp;nbsp; We will also probably want to make a compile-time flag visible on the encoding class that determines whether or not it can have malformed data.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, with those two issues resolved, let's get down to our dirty business!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I said earlier that we had to pick one of two mutually exclusive goals: be a&amp;nbsp;perfect drop-in replacement for &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, or support variable-width encodings such as UTF-8.&amp;nbsp; Since I think &lt;font face="Courier New"&gt;std::string&lt;/font&gt; is poorly designed &lt;strong&gt;&lt;em&gt;and&lt;/em&gt;&lt;/strong&gt; I demonstrated that not being string-compatible is only a loss for stringstream compatibility, I'm favoring the latter.&amp;nbsp; (Just hating &lt;font face="Courier New"&gt;std::string&lt;/font&gt; alone would not be sufficient reason -- in that case I'd just be suffering from&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Not_Invented_Here"&gt;NIH syndrome&lt;/a&gt;.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;However, this doesn't mean that I can just go roll my own string class in the way that best suits my urges.&amp;nbsp; Many programmers have devoted considerable time and energy to learning &lt;font face="Courier New"&gt;std::string&lt;/font&gt;'s ins and outs, myself included -- so, I should exploit that knowledge by providing similar functions with similar arguments, as long as it doesn't compromise my design's principles.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Looking at &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s definition in the C++ Standard is an exercise in mental stamina.&amp;nbsp; It defines six constructors (one of which requires some very &lt;a href="http://www.mpi-sb.mpg.de/~kettner/courses/lib_design_03/notes/meta.html"&gt;special trickery with templating and the SFINAE principle&lt;/a&gt; to implement, as we'll see later) and over 100 methods, plus a host of non-member operators such as &lt;font face="Courier New"&gt;&amp;lt;&amp;lt;&lt;/font&gt; and &lt;font face="Courier New"&gt;+&lt;/font&gt;.&amp;nbsp; However, looking at the expected behavior for each function, most of them are overloads that call a base function.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In other words, a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; has one or two core definitions at most for each core method (such as &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;/font&gt;, &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, etc.), which take &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;s as their input.&amp;nbsp; Every other overload is defined as equivalent to calling that root function, with a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor meant to convert some other form of string (char pointer, run of chars, pair of iterators, etc.) to a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that the "core implementation" can grok.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Of course, they don't all implement them like that, because it'd mean frivolously making a copy of the input data in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; form for each trivial overload.&amp;nbsp; Instead, a typical implementation of &lt;font face="Courier New"&gt;std::string&lt;/font&gt; has an optimized version for each&amp;nbsp;variant, making maintenance a nightmare.&amp;nbsp; But we don't have that problem -- because, instead of requiring an STL allocator, we can accept an arbitrary backing store!&amp;nbsp; So, suppose we have a working implementation of append:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt; class Encoding, class BackingStore &amp;gt; class rmstring {&lt;br /&gt;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// &lt;strong&gt;Appends &lt;em&gt;n&lt;/em&gt;&amp;nbsp;codepoints of &lt;em&gt;str&lt;/em&gt;, starting at &lt;em&gt;pos&lt;/em&gt;, to the&amp;nbsp;string.&lt;/strong&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* Will throw an out_of_range exception if &lt;em&gt;pos&lt;/em&gt; &amp;gt;= &lt;em&gt;str&lt;/em&gt;.length()&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* If &lt;em&gt;pos&lt;/em&gt; is in range, but&amp;nbsp;&lt;em&gt;pos&lt;/em&gt; +&amp;nbsp;&lt;em&gt;n&lt;/em&gt;&amp;nbsp;&amp;gt; &lt;em&gt;str&lt;/em&gt;.length(), &lt;em&gt;n&lt;/em&gt; is&amp;nbsp;truncated so that &lt;em&gt;pos&lt;/em&gt; + &lt;em&gt;n&lt;/em&gt; = &lt;em&gt;str&lt;/em&gt;.length().&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// *&amp;nbsp;Will throw an length_error exception if the resulting string would be larger than&amp;nbsp;&lt;em&gt;BackingStore&lt;/em&gt;'s max_size().&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt; class OtherBS &amp;gt; rmstring &amp;amp; append( rmstring&amp;lt;Encoding, OtherBS&amp;gt;&amp;nbsp;const &amp;amp; str, size_type pos, size_type n ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;font color="#000080"&gt;&lt;em&gt;/* implementation */&lt;br /&gt;&lt;/em&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;...&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Note that I've defined the above in terms of code points, not symbols.&amp;nbsp; There can be multiple codepoints representing a single symbol.&amp;nbsp; I'll discuss this problem, and the related problem of Unicode normalization forms, in a later post -- namely because I'm still working on a solution.&amp;nbsp; :-P This is a learning exercise for me too!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Because &lt;font face="Courier New"&gt;OtherBS&lt;/font&gt; is arbitrary, we can directly implement the other overloads of &lt;font face="Courier New"&gt;append()&lt;/font&gt; as calls to &lt;font face="Courier New"&gt;append()&lt;/font&gt; with a &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; constructor, without worrying about needlessly duplicating information.&amp;nbsp; If we want to use a &lt;font face="Courier New"&gt;char *&lt;/font&gt; from an ANSI C function, we can just use a &lt;font face="Courier New"&gt;unmanaged_pointer&lt;/font&gt; backing store.&amp;nbsp; If we want to use n repetitions of some character c, we can just use a &lt;font face="Courier New"&gt;run_of_chars&amp;lt;n, c&amp;gt;&lt;/font&gt; backing store.&amp;nbsp; We pass the &lt;em&gt;exact same information&lt;/em&gt; as if we were doing it the old way, but abstracted inside a templated class, so there's no overhead except at compiletime.&amp;nbsp; Beautiful!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, what should we implement from &lt;font face="Courier New"&gt;std::string&lt;/font&gt;?&amp;nbsp; Here's the core functions from &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that seem worth carrying over:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;strong&gt;Size functions&lt;/strong&gt;: &lt;font face="Courier New"&gt;size()&lt;/font&gt; and &lt;font face="Courier New"&gt;length()&lt;/font&gt;, &lt;font face="Courier New"&gt;max_size()&lt;/font&gt;, &lt;font face="Courier New"&gt;capacity()&lt;/font&gt;, &lt;font face="Courier New"&gt;reserve()&lt;/font&gt;, &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, &lt;font face="Courier New"&gt;empty()&lt;/font&gt;, &lt;font face="Courier New"&gt;clear()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Iterators&lt;/strong&gt;: &lt;font face="Courier New"&gt;begin()&lt;/font&gt;, &lt;font face="Courier New"&gt;end()&lt;/font&gt;, &lt;font face="Courier New"&gt;rbegin()&lt;/font&gt;, &lt;font face="Courier New"&gt;rend()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Accessors&lt;/strong&gt;: &lt;font face="Courier New"&gt;operator[]&lt;/font&gt;, &lt;font face="Courier New"&gt;at()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Replacers&lt;/strong&gt;: &lt;font face="Courier New"&gt;assign()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Appenders&lt;/strong&gt;: &lt;font face="Courier New"&gt;push_back()&lt;/font&gt;, &lt;font face="Courier New"&gt;push_front()&lt;/font&gt;, &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Modifiers&lt;/strong&gt;: &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, &lt;font face="Courier New"&gt;erase()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Searchers&lt;/strong&gt; (evil): &lt;font face="Courier New"&gt;find()&lt;/font&gt;, &lt;font face="Courier New"&gt;rfind()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_not_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_not_of()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Utilities&lt;/strong&gt;: &lt;font face="Courier New"&gt;substr()&lt;/font&gt;, &lt;font face="Courier New"&gt;copy()&lt;/font&gt;, &lt;font face="Courier New"&gt;swap()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Comparators&lt;/strong&gt; (also evil): &lt;font face="Courier New"&gt;compare()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator==&lt;/font&gt;, &lt;font face="Courier New"&gt;operator!=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Streams:&lt;/strong&gt; &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&amp;gt;&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Backwards compatibility:&lt;/strong&gt; &lt;font face="Courier New"&gt;c_str()&lt;/font&gt;, &lt;font face="Courier New"&gt;data()&lt;br /&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;That's a lot of stuff to implement!&amp;nbsp; But not only does it gain us good-will by allowing programmers to code much like they did with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, it also means that we can make a &lt;font face="Courier New"&gt;typedef rmstring&amp;lt;&lt;em&gt;RMS_COMPILER_SPECIFIC_ENCODING&lt;/em&gt;, vector_of_bytes&amp;gt;&amp;nbsp;rstring&lt;/font&gt;, and be pretty damned close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt;-equivalent.&amp;nbsp; (The compiler-specific encoding can be set in a header file, or specified on the command line -- I'll likely set it to &lt;font face="Courier New"&gt;iso8859_1&lt;/font&gt; for string and &lt;font face="Courier New"&gt;ucs2&lt;/font&gt; for wstring in a header.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;But before I get to that, I'll have a nastier problem to tackle, and that's combining characters.&amp;nbsp; Not only do we have codepoints that can take up variable amounts of space (thanks to encoding), but we also have symbols that can take up variable amounts of codepoints!&amp;nbsp; (See Part 1 and search for "diaeresis" if you're not sure why this is.)&amp;nbsp; Unicode, luckily, comes to the rescue again with a standard that determines when and how a character symbol or should not be broken down into combining characters.&amp;nbsp;&amp;nbsp;These are called&amp;nbsp;normalization forms, and we'll tackle those on Monday.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Next episode: Normalization forms and chain of command (which does not involve rmstring covering its ass if things go FUBAR).&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Takeaways from Part 4:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;We're specifically designing &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; to&amp;nbsp;force the programmer into awareness of encodings -- we don't want&amp;nbsp;to hide that with a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; being typedefed.&amp;nbsp; (We couldn't anyways, because we can't template typedefs.)&amp;nbsp; So, for now, we'll leave it as-is.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Not only are all encodings inequal, not all trans-coding schemes are equal either!&amp;nbsp; Be aware of this, and think about how you want to handle errors!&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Even if we think&amp;nbsp;&lt;font face="Courier New"&gt;std::string&lt;/font&gt; is evil, we can still gain good will from our potential users by making ourselves as close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt; as possible.&amp;nbsp; This, unfortunately, means lots of work.&amp;nbsp; But not as much as if we were actually implementing &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, due to our luck in choosing to template our backing store.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;However, all our methods need to be defined in terms of symbols, not code points (and certainly not bytes of encoded data!).&amp;nbsp; This makes our life difficult again.&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=246539" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 3)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx</link><pubDate>Thu, 21 Oct 2004 00:08:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:245417</guid><dc:creator>ryanmy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/245417.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=245417</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&lt;em&gt;&amp;nbsp;(Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long.&amp;nbsp; So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts.&amp;nbsp; I'll also try to pace my paragraphs more regularly.&amp;nbsp; Thanks for the advice!)&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Yesterday&lt;/a&gt;, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for&amp;nbsp;encoding and decoding code point indices on a binary computer.&amp;nbsp; At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points.&amp;nbsp; Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Before we do that, though, there's one more nasty digression into standards-land that I'd like to take.&amp;nbsp; This is a fairly general definition of what a string is, and you don't really write libraries unless you intend for them to be general-purpose enough to be reused.&amp;nbsp;&amp;nbsp;So,&amp;nbsp;it might be a worthwhile goal to make our new string library compatible with the &lt;font face="Courier New"&gt;string&lt;/font&gt; class in the C++ Standard Template Library, so that anyone could gain its benefits simply by using a different &lt;font face="Courier New"&gt;#include&lt;/font&gt;.&amp;nbsp; Unfortunately, there's some restrictions that the C++ Standard (which I would highly suggest purchasing if you code in C++ for a living -- it's &lt;a href="http://webstore.ansi.org/ansidocstore/product.asp?sku=INCITS/ISO/IEC+14882-2003"&gt;$18 in PDF form direct from ISO&lt;/a&gt;) which prevent us from doing so -- namely, that many parts of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; are hard-wired to require a constant-size encoding and will not work with encodings such as UTF-8.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ Standard starts by defining &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; as templated on three classes -- a character type (&lt;font face="Courier New"&gt;charT&lt;/font&gt;), a specialization of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; for that type, and an allocator for that type.&amp;nbsp; (Nothing SAYS we have to implement&amp;nbsp;it with exactly those template parameters, but we're screwed anyways, as you'll see.)&amp;nbsp; It then defines two static typedefs for that specialization: &lt;font face="Courier New"&gt;traits_type&lt;/font&gt;, which typedefs to the templated traits specialization, and &lt;font face="Courier New"&gt;value_type&lt;/font&gt;, which&amp;nbsp;typedefs to&amp;nbsp;&lt;font face="Courier New"&gt;traits_type::value_type&lt;/font&gt;... which, by definition, is also required to be &lt;font face="Courier New"&gt;charT&lt;/font&gt;.&amp;nbsp; The definition of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; requires that &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; be specialized only on &lt;a href="http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.7"&gt;PODs&lt;/a&gt; (which are always constant-size), and its definitions all are written to assume uniformly-sized characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;If the traits problem wasn't enough, on top of that, a conformant &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; implementation requires that &lt;font face="Courier New"&gt;s[i]&lt;/font&gt; return the same value as &lt;font face="Courier New"&gt;s.data()[i]&lt;/font&gt;, and data is required to return a &lt;font face="Courier New"&gt;const charT *&lt;/font&gt;.&amp;nbsp; So, even if we could get around the&amp;nbsp;traits problem, variable-length encodings still screw us because &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; and a pointer offset will no longer agree.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, we will have to abandon hopes of being a drop-in replacement for &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;.&amp;nbsp; But, really, this isn't too bad -- there's only three other libraries in the STL that require the use of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;!&amp;nbsp; The first is in &lt;font face="Courier New"&gt;locale&lt;/font&gt;, and hardly anyone uses C++'s built-in locales anyways, favoring OS functionality.&amp;nbsp; The second is the &lt;font face="Courier New"&gt;bitset&lt;/font&gt; container, which hardly anyone uses either.&amp;nbsp; The third is&amp;nbsp;its use as&amp;nbsp;a backing store for &lt;font face="Courier New"&gt;stringstreams&lt;/font&gt; and as the &lt;font face="Courier New"&gt;stringbuf&lt;/font&gt;&amp;nbsp;wrapper that is the foundation of &lt;font face="Courier New"&gt;iostream&lt;/font&gt;, and this is a bigger loss.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The loss of direct compatibility with&amp;nbsp;stringbuf is a big pain.&amp;nbsp; However, when you're getting to I/O, you need to have already converted your string to the encoding your user is expecting -- we shouldn't expect a prompt expecting ASCII to be able to deal with a stream of UCS-2 characters!&amp;nbsp; So, it's perfectly okay if stringbuf&amp;nbsp;is left&amp;nbsp;alone, as long as we find a way to&amp;nbsp;convert strings between different encodings.&amp;nbsp; So, stringstreams are the only real loss, and we can make our own stringstream, if need be.&amp;nbsp; (Thanks to templates, we may be able to avoid having to re-invent the wheel, which is always good.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm going to start with policy-based design, which &lt;a href="http://www.moderncppdesign.com/"&gt;Alexandrescu&lt;/a&gt; introduced a few years ago in Modern C++ Design.&amp;nbsp; (Actually, the STL beat him to the punch by using allocators as a template argument for most of its &lt;font color="#000000"&gt;containers, but he popularized its use for general customization.)&amp;nbsp; In fact, he already demonstrated policy-based design in a &lt;/font&gt;&lt;a href="http://www.cuj.com/"&gt;&lt;font color="#000000"&gt;CUJ&lt;/font&gt;&lt;/a&gt;&lt;font color="#000000"&gt; article a year or two ago by making a basic_string replacement that allowed customizing copy-on-write semantics -- but I'm a bit more ambitious :)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My first stab at the class will be based directly off our most recent definition of string -- an encoding, and an ordered sequence of bits:&lt;/font&gt;&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace encodings {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... utf8, iso8859_1, big5, mac_roman, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace backing_stores {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... string_literal, vector_of_uchars, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;class Encoding, class&amp;nbsp;Bits&amp;gt; class rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;public:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;typedef Encoding encoding_type;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;private:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Bits _data;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;};&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Not much, but it's a start&lt;/font&gt;!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I want to reference something I said earlier about I/O -- when you're doing I/O, whether that's taking a string in or sending a string out, your stream of bits needs to have the same encoding as the device you're talking with, or Bad Things happen.&amp;nbsp; We need some way to denote, inside code, that an encoding change needs to take place.&amp;nbsp; (Guessing ahead, this will probably be the&amp;nbsp;most tedious&amp;nbsp;part of development -- creating UCS-to-encoding and encoding-to-UCS transitions for each encoding and character set we support.)&amp;nbsp; I'm going to take a nod from the excellent &lt;a href="http://www.boost.org/"&gt;Boost&lt;/a&gt; library here, and make an analogue to their &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; class.&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// these are the major exceptions...&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;class&amp;nbsp;missing_symbol;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;class malformed_data;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// ... that are thrown by:&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;br /&gt;&lt;font color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;typename Target, typename Source&amp;gt; Target encoding_cast(Source str);&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt; &lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the near future I'll probably alter this to take only &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;s as input and output and template on encoding types in/out, since right now it accepts any pair of types -- but this is only a prototype.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The goal for doing this is to minimize conversions.&amp;nbsp; Some of my coworkers who have been kind enough to proofread have remarked, &lt;em&gt;"I'd just throw&amp;nbsp;up my hands and convert everything internally to UCS-4 and use a basic_string&amp;lt;unsigned long&amp;gt;; after all, memory is cheap."&lt;/em&gt;&amp;nbsp; In a way, they're right -- doing this would mean I'd only have to write encoding_cast() for each encoding, and not even need the&amp;nbsp;new&amp;nbsp;string&amp;nbsp;class.&amp;nbsp; But, I'm a performance guy, a bit twiddler&amp;nbsp;at heart.&amp;nbsp; I don't want to do a conversion unless I need to, or if the performance gains from a fixed-width format like UCS-4 outweigh the performance loss of having to trans-code everything.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(It's rather like image formats -- TGA is lossless and can hold damn near anything, but that doesn't mean we always convert everything to TGA first before working with it, and then convert back when we're done.&amp;nbsp; Not everything has to be "worked on," and not all work is equally difficult.&amp;nbsp; This is especially true if we're using a compile-time string literal as a backing store, since it won't be modifiable unless you make a copy!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The general plan is to use &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; as a &lt;a href="http://hillside.net/patterns/DPBook/DPBook.html"&gt;Facade pattern&lt;/a&gt; for the Encoding class we're templated on.&amp;nbsp; Most of &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s methods will actually call the Encoding class and pass in state and a pointer to our Bits object as needed; the Encoding class will handle all the work of character traversal.&amp;nbsp; Since many of the encodings we're planning to&amp;nbsp;deal with are fixed-width (UCS-2, UCS-4, and most old systems like ISO 8859 and ASCII), I'll likely create a FixedWidthEncoding base class that does most of the work of locating offsets and insertion/deletion, and inherit most of the Encodings from it.&amp;nbsp; This means, the main thing that will be unique for each Encoding will be the translation tables used for converting the symbol sets for non-Unicode systems to Unicode code points, since most of the older encodings are simple fixed-width affairs and just have non-standard symbol sets.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Tomorrow, we'll start fleshing out &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s body with constructors and methods, and explain what those two exceptions&amp;nbsp;next to&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; are for.&amp;nbsp; We'll also take a brief look at screen-readers and web browsers, and make a change to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; to handle "looks-close-enough" trans-codes.&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;The definitions of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; and &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; in the C++ Standard prevent use of variable-width encodings;&amp;nbsp;therefore, we cannot make&amp;nbsp;a perfect drop-in replacement for the STL string class.&amp;nbsp; However, that's okay -- the only STL object we'll have to duplicate functionality for is stringstream.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;We can't expect I/O with external devices/programs to conform to whatever encoding we want -- they're expecting a specific encoding, and we need to present our data in that format -- or die a horrible, painful death.&amp;nbsp; So, the ability to trans-code is absolutely necessary.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Trans-coding can be expensive, but can have some gains, especially if going to UCS-4 for speed in manipulation or going to UTF-8 for compatibility with legacy C APIs.&amp;nbsp; Do it when necessary or justified, but avoid it if it's not absolutely necessary.&amp;nbsp; The coder should be allowed to pick an encoding and work with strings in that encoding as easily as possible.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=245417" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 2)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx</link><pubDate>Wed, 20 Oct 2004 01:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:244865</guid><dc:creator>ryanmy</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/244865.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=244865</wfw:commentRss><description>&lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At the end of the &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx"&gt;last post&lt;/a&gt;, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points."&amp;nbsp; (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.)&amp;nbsp; Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer.&amp;nbsp; And someone who is reading a string from a file, or from memory, needs to use the exact same encoding scheme, or we're off in la-la land.&amp;nbsp; And not all encodings are equal.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;First off, the simplest route.&amp;nbsp; There are 2&lt;sup&gt;31&lt;/sup&gt; possible Unicode code points, and an&amp;nbsp;x86&amp;nbsp;register is 32 bits wide, so let's just add a zero and encode everything as a 32-bit unsigned binary!&amp;nbsp; The ISO-10646 standard calls this &lt;strong&gt;UCS-4&lt;/strong&gt;.&amp;nbsp; Only one catch -- it doesn't specify endianness.&amp;nbsp; Of course, this poses a problem if you want to trade text files between PCs and Macs.&amp;nbsp; So, UCS-4 actually is three different encodings -- &lt;strong&gt;UCS-4LE&lt;/strong&gt; (little endian), &lt;strong&gt;UCS-4BE&lt;/strong&gt; (big endian), and just plain &lt;strong&gt;UCS-4&lt;/strong&gt;, which means that no endian is specified and you should assume that it's the host's encoding unless told otherwise.&amp;nbsp; (There are ways to tell otherwise -- but I'll mention them later.)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Now, the ISO-10646 guys recognized that the majority of the written languages used on the Internet today can be expressed using a tiny subset of the 2&lt;sup&gt;31&lt;/sup&gt; symbols, and it seems a waste to use four bytes for every character if the high bytes are 0 most of the time.&amp;nbsp; So, ISO-10646 also defines &lt;strong&gt;UCS-2&lt;/strong&gt;, which uses a 16-bit unsigned binary, but can only represent the lower 2&lt;sup&gt;16&lt;/sup&gt; code points.&amp;nbsp; (The lower 2&lt;sup&gt;16&lt;/sup&gt; codepoints&amp;nbsp;are thus referred to as the Basic Multilingual Plane, or BMP.&amp;nbsp; This includes Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, as well as many mathematical symbols and a small set of basic &lt;/font&gt;&lt;a href="http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?Han"&gt;&lt;font color="#000000"&gt;Han ideographs&lt;/font&gt;&lt;/a&gt;&lt;font color="#000000"&gt;.)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This is the first encoding we'll encounter that is non-universal -- there are some strings that are expressible using Unicode characters which UCS-2 cannot be used to encode.&amp;nbsp; Sadly, UCS-2 was adopted by early versions of the Unicode specification, and so UCS-2 is what most people think of when they hear "Unicode".&amp;nbsp; We can't blame them, though -- it took until &lt;strong&gt;2001&lt;/strong&gt; for ISO to use up all 2&lt;sup&gt;16&lt;/sup&gt; code points in the BMP, and by then they were adding Han ideographs in bulk.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Next, we start diving into encodings that were invented for reverse compatibility with older standards.&amp;nbsp; As we said, early versions of Unicode specifiy UCS-2 as a standard, back when nothing existed in the UCS tables beyond the BMP.&amp;nbsp; When it became obvious that eventually people would need to use codepoints beyond 2&lt;sup&gt;16&lt;/sup&gt;, a hybrid encoding called UTF-16 was created.&amp;nbsp; The Unicode Consortium reserved a high range of codepoints (D800 to DFFF) to be used as "surrogate characters," so that up to 1024&lt;sup&gt;2&lt;/sup&gt; characters above the BMP border could be represented as two consecutive surrogate characters, without breaking existing UCS-2 content.&amp;nbsp; This adds a brand new level of complexity to string handling, because now a single codepoint could be either 2 or 4 bytes.&amp;nbsp; This&amp;nbsp;makes even simple tasks such as iterating over the string with a for-loop difficult.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Later on, &lt;strong&gt;UTF-32&lt;/strong&gt; was introduced.&amp;nbsp; UTF-32 is effectively identical to UCS-4 -- its sole difference is in the specification.&amp;nbsp; UTF-32 claims that it should not be used to represent characters above 0x10FFFF.&amp;nbsp; (Nothing is stopping it, though -- it's still just a unsigned long int.)&amp;nbsp; I mention it mostly for completeness, and so you'll recognize the name.&amp;nbsp; And don't forget that all of these encodings have endianness to worry about, so we've really covered 12 encodings for Unicode so far: UCS-4(BE/LE/host), UCS-2(BE/LE/host), UTF-16(BE/LE/host), and UTF32(BE/LE/host).&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Windows-specific digression here: WCHAR is typedef'd inside winnt.h to wchar_t, whose size is determined by the compiler you're using.&amp;nbsp; On Visual C++ .NET 2004, wchar_t is currently an 'unsigned short' and uses UCS-2LE; on gcc, unless specified otherwise it's an 'int'.&amp;nbsp; The encoding for gcc varies by version and by compiler setting, though, and gcc 3.3 in particular &lt;/font&gt;&lt;/em&gt;&lt;a href="http://lists.suse.com/archive/m17n/2004-Aug/0039.html"&gt;&lt;em&gt;&lt;font color="#000000"&gt;is horribly buggy and can corrupt your string literals&lt;/font&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;&lt;font color="#000000"&gt;.)&lt;/font&gt;&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Everything's fine and dandy thus far, except for one catch -- we can't send strings in these encodings to old webservers that use C functions like strcmp(), strlen(), strcpy(), etc. -- or any other function that relies on the presence of a null byte to denote where the string ends.&amp;nbsp; Why?&amp;nbsp; Because, for any string that uses only the Latin alphabet (i.e. one that you could write in plain old ASCII), the first byte in any of the above encodings will be 00.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Because of this, there's one more standard Unicode encoding, and that's the notorious &lt;strong&gt;UTF-8&lt;/strong&gt;.&amp;nbsp; UTF-8 can be thought of as&amp;nbsp;a relative of&amp;nbsp;Huffman encoding -- it guarantees that all codepoints less than or equal to 0x7F are encoded as single unsigned bytes (i.e. direct 7-bit ASCII correspondence), and that all codepoints greater than 0x7F are encoded as a multi-byte sequence.&amp;nbsp; All bytes in a multi-byte sequence have their MSB set, and the first byte of such a codepoint contains the number of bytes that follow.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, in exchange for being a messy variable-length format that's hard to work with,&amp;nbsp;UTF-8 can encode the entire set of Unicode codepoints &lt;strong&gt;and&lt;/strong&gt; guarantees that any UTF-8 string will be correctly handled by a&amp;nbsp;function expecting a null-terminated string.&amp;nbsp; Also, since UTF-8 is specifically meant to be handled a byte at a time, it avoids the entire messy problem of endianness.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Historical note: UTF stands for UCS Transformation Format.&amp;nbsp; The infamous &lt;/font&gt;&lt;/em&gt;&lt;a href="http://en.wikipedia.org/wiki/Ken_Thompson"&gt;&lt;em&gt;&lt;font color="#000000"&gt;Ken Thompson&lt;/font&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;&lt;font color="#000000"&gt; created UTF-8 in 1992 on a napkin in a New Jersey diner, for use in Plan9, and reported their success with it to the 1993 USENIX conference.&amp;nbsp; Unicode and ISO both formally standardized it in 2001, although the Unicode adds the extra clause that it should not be used to express codepoints above 0x10FFFF, just like UTF-32.)&lt;/font&gt;&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Now, I mentioned earlier that the other formats had to choose endianness: explicitly specify, or shrug and assume that it's the same as the host.&amp;nbsp; There is a common solution to this -- and that's to use a marker to determine the endianness.&amp;nbsp; This marker is known as the &lt;strong&gt;Byte Order Mark&lt;/strong&gt;, or BOM for short, and is Unicode code point 0xFEFF ("ZERO-WIDTH NO-BREAK SPACE" -- a null symbol, effectively).&amp;nbsp; If you encounter the character 0xFFFE while decoding, you know that the file you're reading was written on a machine of opposite endianness, and you should flip bytes.&amp;nbsp; (Unicode code point 0xFFFE has been specifically designated as an invalid character for this purpose.)&amp;nbsp; Keep in mind that you may encounter multiple BOMs in a string and may have to switch back and forth!&amp;nbsp; (This could happen if, for example, you used UNIX cat to concatenate two text files, and one was UCS-2BE and one was UCE-2LE.)&amp;nbsp; UTF-8, being specifically designed to be parsed on a byte-by-byte basis, does not need a BOM.&amp;nbsp; &lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;There's a few other standard Unicode encodings, but the big 13 above are the only ones that you see regularly.&amp;nbsp; I'll mention the other ones briefly, mainly because they show up in some old internet protocols:&lt;/font&gt;&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;UTF-7&lt;/strong&gt; was an early attempt to translate Unicode points to 7-bit-ASCII text for use in MIME-encoded emails.&amp;nbsp; It specified that all 7-bit characters should be transmitted as single bytes, like UTF-8.&amp;nbsp; However, rather than use the eighth bit to denote a multibyte character, it overloaded the + sign as a sentinel.&amp;nbsp; "+-" denoted that a normal plus should appear; for any other following character, the following three bytes were the UCS-2 encoding, re-encoded in Base64.&amp;nbsp; It could not transmit anything outside the BMP.&amp;nbsp; UTF-7 is used, slightly modified, in parts of the IMAP mail protocol; for POP3 and SMTP, however, it has mostly been bypassed in favor of UTF-8.&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;SCSU&lt;/strong&gt; (Standard Compression Scheme for Unicode) was an early attempt at a variable-length encoding like UTF-8 proposed by Reuters News, that added light compression as well.&amp;nbsp; However, small compression schemes like this are painfully inefficient compared to larger schemes like LZW or BWT, and they makes it very difficult to handle internally.&amp;nbsp; SCSU is not used in any major protocol or file format that I know of today.&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;Punycode&lt;/strong&gt; (RFC 3942) is similar to UTF-7 and uses the string "xn--" as a sentinel.&amp;nbsp; Punycode is only used in one situation -- the IDNA (Internationalizing Domain Names in Applications) protocol used to handle use of Unicode domain names in DNS.&amp;nbsp; An IDNA-capable web browser will capture a string from the address bar, translate it to ASCII text using the Punycode system, and send the converted string as a standard getaddrbyname() DNS request, and the DNS server translates it back to Unicode upon reciept before doing the lookup.&amp;nbsp; If you're making a better bind, or fixing Firefox, this will be of interest to you; I do not expect to encounter files or other strings encoded in this system.&lt;br /&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, that's the set of encodings that directly encode Unicode code points.&amp;nbsp; There's only one catch -- there's also encodings out there that don't directly map to&amp;nbsp;Unicode codepoints!&amp;nbsp; In this case, we have to do an two-part mapping to get to Unicode -- first, decoding to a symbol number in the source that matches that encoding's symbol set, and then converting that to a Unicode codepoint!&amp;nbsp; &lt;strong&gt;Yuck.&lt;/strong&gt;&amp;nbsp; And we're going to encounter a lot of these too, because these have names we recognize like ASCII Code Page 437 and ISO 8859-1 and Windows DBCS and GB and Big5 -- all those legacy formats, some of which are also variable-length like UTF-8.&amp;nbsp; We've got our work cut out for us!&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Of course, I never mentioned what it is we actually were working on...&amp;nbsp; I'm out to create a string class for C++ that doesn't suck.&amp;nbsp; Now, there's three ways that C++'s &lt;font face="Courier New"&gt;std::string&lt;/font&gt; sucks.&amp;nbsp; The first sin is that you have to use a backing store -- you can't&amp;nbsp;tell it to use a string literal like &lt;font face="Courier New"&gt;L"Sch&amp;ouml;ne Gr&amp;uuml;&amp;szlig;e"&lt;/font&gt; as a source, since &lt;font face="Courier New"&gt;allocator&amp;lt;char&amp;gt;&lt;/font&gt; requires that the target be modifiable.&amp;nbsp; All contents have to be copied, because contents are always mutable.&amp;nbsp; The second sin is that&amp;nbsp;it assumes that the compiler and author knows what they're doing when they manipulate its contents.&amp;nbsp; To C++, a &lt;font face="Courier New"&gt;basic_string&amp;lt;T&amp;gt;&lt;/font&gt; is really just a pretty interface on a &lt;font face="Courier New"&gt;vector&amp;lt;T&amp;gt;&lt;/font&gt;.&amp;nbsp; The third sin may vary to some people; for me, it exists in the forms of some stupid promises that 14882 (the ISO C++ standard) wasn't willing to make, most notably that the &lt;font face="Courier New"&gt;c_str()&lt;/font&gt; method is capable of invalidating references, pointers, and iterators.&amp;nbsp; This was mostly done to accomodate copy on write and other implementation details, but it makes writing conformant string-handling code infuriatingly difficult if you ever have to interface &lt;font face="Courier New"&gt;std::string&lt;/font&gt; with C functions that need C strings&amp;nbsp;(such as, say, the Win32 API!).&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm opting to only fix the first two sins&amp;nbsp;for now -- backing store handling, and encoding awareness.&amp;nbsp; The third sin, you can handle according to your level of offendedness.&amp;nbsp; &lt;/font&gt;&lt;font color="#000000"&gt;Tomorrow: Policy-based design using templates, minimizing conversions, and why 14882's &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; makes it impossible to make a strictly conformant&lt;font face="Courier New"&gt; std::string&lt;/font&gt; that supports variable-length encodings.&lt;/font&gt;&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;We have to store the code points in a string somehow.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;A lot of pain comes from wanting to retain reverse compatibility with old character sets and old encodings.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Large fixed-width formats like UCS-2 and UCS-4 make string manipulation very easy since they allow random access to individual code points, but are not compatible with old C functions that expect null-terminated strings.&amp;nbsp; However, keep an eye out for endianness.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Variable-width formats like UTF-8 are compatible with null-termination functions, but have to be parsed sequentially.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=244865" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category></item><item><title>Encodings In Strings Are Evil Things (Part 1)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx</link><pubDate>Tue, 19 Oct 2004 02:54:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:244284</guid><dc:creator>ryanmy</dc:creator><slash:comments>4</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/244284.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=244284</wfw:commentRss><description>&lt;p&gt;&lt;i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; What is a string?&lt;/i&gt;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; About six months ago at the Game Developers Conference in San Jose, I sat in on a talk about performance tuning in Xbox games.&amp;nbsp; The presenter had a slide that read:&amp;nbsp; "Programmers love strings.&amp;nbsp; &lt;b&gt;Love hurts.&lt;/b&gt;"&amp;nbsp; This was shown while he described a game which was using a string identifier for every object in the game world and hashing on them, and was incurring a huge performance hit from thousands of strcmp()s each frame.&amp;nbsp; I nodded -- but my mind was thinking, "The same would be true if they had used GUIDs, or any other large identifier.&amp;nbsp; After all, strcmp is just a bounded memcmp."&amp;nbsp; So, what actually IS a string?&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I think it's safe to say that a string is &lt;i&gt;something&lt;/i&gt; that a human interprets and derives meaning from.&amp;nbsp; In this case, that something is almost always an ordered sequence of symbols (note: the symbols may not be co-linear!) that conveys meaning.&amp;nbsp; Now, let's assume from here on that a string is an ordered sequence of 2D glyphs.&amp;nbsp; A glyph is three pieces of data: a symbol, the dimensions to render that symbol at, and the location where it should be rendered.&amp;nbsp; This is still describing a very abstract, human-centric thing.&amp;nbsp; To express this in the programming world, we have to identify these glyphs somehow.&amp;nbsp; A vector drawing or bitmap approximation of a glyph would suffice.&amp;nbsp; But we don't want to require that people deal with these just to print "Hello World" to the screen.&amp;nbsp; So, let's put the glyphs somewhere in the system, and assign indices to them.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; And thus, we have &lt;a href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819"&gt;ISO 10646&lt;/a&gt;, known as the Universal Character Set or UCS.&amp;nbsp; UCS is a simple mapping of decimal indices (called code points) and formal names, to symbols.&amp;nbsp; For example, in the UCS, code point 0x41 is "Latin capital letter A" and corresponds to, of course, the letter A.&amp;nbsp; The goal of UCS is to be a superset of all character sets.&amp;nbsp; So, given a set of characters such as 7-bit ASCII, or ISO 8859-1, or EBCDIC, we can find some mapping (preferably 1:1, but we're not always so lucky) to UCS.&amp;nbsp; So, our definition of glyph now converts to a tuple containing a UCS code point, a size, and a distance from the last render point.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; We now find ourselves asking a few more questions about glyphs.&amp;nbsp; Size is fairly easy to measure -- just a box that bounds the symbol.&amp;nbsp; However, distance is difficult, because good typesetting requires that the distance between characters be measured from any number of points inside that box.&amp;nbsp; For simple Roman alphabets, we might want to measure from the baseline; accents might have to go relative to baseline + ascent; some characters may have an advance width that is greater than their bounding box; and this doesn't even begin to address script-based languages like Arabic!&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; On top of this, UCS allows two ways to represent accented characters.&amp;nbsp; Most accented characters have a dedicated UCS code point; however, an accented character can also be represented as the code point for the un-accented character, followed by code points for one or more accents as stand-alone symbols.&amp;nbsp; UCS calls symbols which are meant to be applied to the previous character "combining characters," and refers to symbols containing preaccented letters as "precomposed characters."&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; For example, the symbol &lt;b&gt;&amp;Auml;&lt;/b&gt; can be represented by either the precomposed UCS code point 0xC4 ("Latin capital letter A with diaeresis") or by the code point 0x41 ("Latin capital letter A") immediately followed by code point 0x308 ("combining diaeresis").&amp;nbsp; And don't forget that there needs to be size and direction between the diaeresis and the letter, and that there can be more than one combining character following a single symbol, including some symbols which can vary their positioning depending on their combination with other combiners!&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; The UCS took the easy way (or, as some would argue, the &lt;b&gt;sanest&lt;/b&gt; way) out of dealing with all the positioning problems of glyphs -- it simply refused to acknowledge their existence.&amp;nbsp; The UCS is simply a symbol table that includes combining characters, nothing more.&amp;nbsp;&amp;nbsp; The UCS also doesn't deal with any of the properties that we assign to specific symbols; for example, it doesn't recognize case.&amp;nbsp; It cannot say that &lt;b&gt;&amp;Auml;&lt;/b&gt; and &lt;b&gt;A&lt;/b&gt; are upper-case and &lt;b&gt;a&lt;/b&gt; is lower-case, or that &lt;b&gt;&amp;Auml;&lt;/b&gt; and &lt;b&gt;A&lt;/b&gt; have the same root letter and differ only by accent, or that &lt;b&gt;a&lt;/b&gt; is the same root letter as those two -- they're simply different symbols with no relation.&amp;nbsp; As a result, the UCS isn't very well known, despite the fact that it has existed for over a decade.&amp;nbsp; This is where &lt;a href="http://www.unicode.org/"&gt;Unicode&lt;/a&gt; comes in.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Unicode originally started out in the late 1980s as an ad-hoc standard agreed on by a group of companies making multi-lingual software products.&amp;nbsp; Initially, Unicode was developed separately from UCS; however, starting in 1991 Unicode merged its code table with UCS, and &lt;a href="http://www.unicode.org/standard/versions/"&gt;all versions of Unicode&lt;/a&gt; from 1.1 (June 1992) forward match the UCS.&amp;nbsp; Unicode does not define glyph data, or the vectors that are used to render a symbol.&amp;nbsp; However, it does provide lots of normative semantic information that UCS code points lack.&amp;nbsp; For example, a Unicode code point not only contains the UCS symbol, but also data such as the symbol's case (upper/lower/title), category (letter, mark/accent, digit, punctuation, separator, etc.), and numeric interpretations of digit symbols (i.e. the symbol 4 represents four things).&amp;nbsp; Alongside this, we have the Unicode Technical Standards, which define culturally appropriate comparison, sorting, and searching algorithms, character boundaries in script languages, how to handle newlines (CR/LF/CRLF/NEL), and other such handy information.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Let us assume, for now, that given an ordered sequence of Unicode code points, the OS can convert them to glyphs and render them in a way that's appropriate.&amp;nbsp; Of course, this is a huge and almost entirely false assumption -- and I'll be coming back to it later.&amp;nbsp; But it's also a very convenient assumption, because it allows us to reduce the definition of a string down to something that's easy to tackle: &lt;i&gt;a finite ordered sequence of Unicode code points&lt;/i&gt;.&amp;nbsp; Of course, in order to store a decimal on a computer, it has to be converted to binary.&amp;nbsp; However, not all binary representations are the same, and not everyone thinks it's worth using 31 bits of information for every character.&amp;nbsp; Tomorrow's episode: encoding systems, and the major character sets that love them.&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Strings should be thought of as human-centric, rather than tied to a video card's interpretation of regularly-sized bits.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Strings are composed of glyphs.&amp;nbsp; A glyph consists of a symbol, plus typesetting information.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;There's already a standard table called ISO 10646, or UCS, that maps code points (numbers) to symbols.&amp;nbsp; Unicode adds semantics like case, comparison rules, and sorting algorithms to UCS.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Typesetting information is really tricky to store portably.&amp;nbsp; UCS and Unicode ignore its existence.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;If the OS can be relied on to handle glyphing, we can store a string as an ordered sequence of Unicode code points.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Oh, and since this is the first post here that's visible to the public -- I'm Ryan Myers, a geek-of-all-trades currently on the Windows Client Performance team.&amp;nbsp; I intend to use this blog as an ongoing set of essays about various facets of programming I've encountered.&amp;nbsp; (I use essays as the textual equivalent of sitting in front of a whiteboard reasoning things out, rather than a polished report of what I wish I had done the first time.&amp;nbsp; So, conclusions may change from post to post, and I welcome all comments and counterpoints.)&amp;nbsp; So, pardon the mess and enjoy the show.&lt;br /&gt;&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=244284" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category></item></channel></rss>