<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Funny, It Worked Last Time : C++</title><link>http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx</link><description>Tags: C++</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Encodings in Strings are Evil Things (Part 8)</title><link>http://blogs.msdn.com/ryanmy/archive/2005/01/17/354864.aspx</link><pubDate>Tue, 18 Jan 2005 03:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:354864</guid><dc:creator>ryanmy</dc:creator><slash:comments>8</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/354864.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=354864</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.&amp;nbsp; However, most of my "real situations" involve legacy encodings.&amp;nbsp; So, I need to start cracking on transcoding.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The first concern is allowing adapters for arbitrary transcodings.&amp;nbsp; A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue.&amp;nbsp; This is a localization problem.&amp;nbsp; Just to make sure that terminology is clear, &lt;strong&gt;internationalization&lt;/strong&gt; (often abbreviated to &lt;strong&gt;i18n&lt;/strong&gt;) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic.&amp;nbsp;&amp;nbsp;This is almost always paired&amp;nbsp;with &lt;strong&gt;localization&lt;/strong&gt;&amp;nbsp;(sometimes abbreviated to &lt;strong&gt;l10n&lt;/strong&gt;), which is the act of tailoring an already-internationalized program for a specific language/locale.&amp;nbsp; Internationalization may be done by any programmer; localization requires translators.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the case of sorting,&amp;nbsp;a binary sort is often not enough.&amp;nbsp; Context is everything!&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Where do accented characters sort -- the same as their base characters, or after?&amp;nbsp; &lt;em&gt;(For French speakers, accented As come after Z.)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What are you sorting for?&amp;nbsp; &lt;em&gt;(German has a special sorting order for names in phone books!)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What about ligatures such as ch or fi?&amp;nbsp; &lt;em&gt;(Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;For this reason,&amp;nbsp;developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/resources/strings/stringreference/stringfunctions/comparestring.asp"&gt;CompareString&lt;/a&gt; or &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_5s2v.asp"&gt;LCMapString&lt;/a&gt; APIs.&amp;nbsp; For example:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;rmstring&amp;lt;ucs4, bytevector&amp;gt; getfirst( std::list&amp;lt;rmstring&amp;lt;utf8, bytevector&amp;gt; &amp;gt;&amp;nbsp;&amp;amp; lines ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; return (*lines.begin()).transcode&amp;lt;ucs4, bytevector&amp;gt;();&lt;br /&gt;}&lt;/font&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This example is a bit contrived -- a real example would template the container and output encoding, and&amp;nbsp;make the LCID a&amp;nbsp;parameter with a default argument&amp;nbsp;-- but you get the point.&amp;nbsp; &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt;, in this case, is a custom predicate for &lt;font face="Courier New"&gt;std::sort&lt;/font&gt; (see &lt;font face="Courier New"&gt;&amp;lt;algorithm&amp;gt;&lt;/font&gt;) that converts both strings to UTF-16 and then invokes &lt;strong&gt;CompareStringW&lt;/strong&gt; on them, throwing a &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain.&amp;nbsp; Of course, this will hardly be my primary solution!&amp;nbsp; More on that later.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Anyways, similar issues arise for transcoding.&amp;nbsp; (Not to mention the fact that &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt; is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)&amp;nbsp; So, we must include pluggable transcoders.&amp;nbsp; So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt, Engine e = Engine()&amp;nbsp;);&lt;br /&gt;&lt;br /&gt;template &amp;lt;class Engine, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;These functions now put off transcoding to the Engine object, whatever that may be.&amp;nbsp; In the Win32 vein, we could use &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp"&gt;MultiByteToWideChar&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp"&gt;WideCharToMultiByte&lt;/a&gt;&amp;nbsp;-- but that's too easy, not to mention very difficult to wrap.&amp;nbsp; I'd really like to do something that's solely C++ and entirely based in the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt;'s mappings directory.&amp;nbsp; There's a few dilemmas to be solved for that.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of &lt;strong&gt;compatibility characters&lt;/strong&gt;.&amp;nbsp; Compatibility characters are &lt;strong&gt;canonically equivalent&lt;/strong&gt; to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard.&amp;nbsp; For example, ISO8859-2 defines &lt;strong&gt;0x5A&lt;/strong&gt; to be equivalent to a capital letter L with a caron accent (&amp;Lcaron).&amp;nbsp; The "simple" equivalent of this in Unicode is a capital letter L (&lt;strong&gt;U+004C&lt;/strong&gt;) followed by a combining caron (&lt;strong&gt;U+030C&lt;/strong&gt;); however, Unicode also defines a single pre-combined character, &lt;strong&gt;U+013D&lt;/strong&gt;, that is directly equivalent to those two.&amp;nbsp; Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.&amp;nbsp; (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)&amp;nbsp; Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.&amp;nbsp; The most direct solution, it seems, is to generate a finite automata.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've been&amp;nbsp;working on the DFA for the last few days.&amp;nbsp; My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.&amp;nbsp; Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).&amp;nbsp; The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.&amp;nbsp; Legacy-to-legacy transcodes use UCS-4 as an intermediary.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I'm now working on a program that reads in a file from &lt;a href="http://www.unicode.org/Public/MAPPINGS/"&gt;MAPPINGS&lt;/a&gt; and UnicodeData.txt from the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt; and outputs the DFA in C++ format.&amp;nbsp; I'll post more when that's finished.&amp;nbsp; (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=354864" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 7)</title><link>http://blogs.msdn.com/ryanmy/archive/2005/01/10/350325.aspx</link><pubDate>Tue, 11 Jan 2005 03:11:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:350325</guid><dc:creator>ryanmy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/350325.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=350325</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Eugh.&amp;nbsp; Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; -- which means, of course, that this hasn't updated.&amp;nbsp; I haven't given up on it though!&amp;nbsp; (I'm not dead!&amp;nbsp; I don't want to go on the cart...)&amp;nbsp; If anything, my desire to finish&amp;nbsp;it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;..."&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, on to business.&amp;nbsp; First off, the all-important &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; class is done.&amp;nbsp; This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes.&amp;nbsp; The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator.&amp;nbsp; The internals are mostly simple pointer arithmetic; just a lot to be tested.&amp;nbsp; (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for &lt;font face="Courier New"&gt;recv()&lt;/font&gt;ing something in from a TCP socket.&amp;nbsp; If we know that said content is UCS-4, the natural urge is to cast it to an &lt;font face="Courier New"&gt;unsigned long *&lt;/font&gt; to iterate over... except that you can't.&amp;nbsp; Or, at least, you shouldn't.&amp;nbsp; If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or &lt;strong&gt;crash&lt;/strong&gt; (on IA-64, unless &lt;font face="Courier New"&gt;&lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/seterrormode.asp"&gt;SetErrorMode()&lt;/a&gt;&lt;/font&gt; is called to force OS alignment fixups, in which case it will run extremely slowly).&amp;nbsp; Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code.&amp;nbsp; There is also no way for strictly conformant code to check if a given pointer is aligned, since&amp;nbsp;there is no operator to retrieve a type's alignment requirements.&amp;nbsp; The best you can do is assume that no type will have an alignment requirement greater than its size, and &lt;font face="Courier New"&gt;assert(0 == reinterpret_cast&amp;lt;size_t&amp;gt;(ptr) % sizeof(type))&lt;/font&gt;, which is throughly disgusting AND assumes certain things about the host's&amp;nbsp;virtual memory system&amp;nbsp;that may not be true.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Thus, I've opted for the simplest solution: a huge comment in the code that says &lt;em&gt;"These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size.&amp;nbsp; Violating either of these assumptions will result in your program's untimely death."&lt;/em&gt;&amp;nbsp; Sometime later, I might come up with a helper function &lt;font face="Courier New"&gt;alignment_assert&amp;lt;T&amp;gt;(ptr)&lt;/font&gt; that takes advantage of compiler-specific extensions such as MSVC's &lt;font face="Courier New"&gt;__alignof&lt;/font&gt; if available.&amp;nbsp; Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters.&amp;nbsp; The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've also had occasion to rethink my plans for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;.&amp;nbsp; Initially, I planned to use &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; in a way similar to the Boost &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; pseudo-operator.&amp;nbsp; However, it disturbed me that doing so would mean that every call to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would create a temporary in which to store the result, which would then make its way to final storage either by &lt;font face="Courier New"&gt;operator=&lt;/font&gt; or copy constructor.&amp;nbsp; I ended up realizing that a good 70% of the calls to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would be writing the encode into a string that already existed.&amp;nbsp; So, instead, we now have the &lt;font face="Courier New"&gt;transcode&lt;/font&gt; function, which comes in both non-member and member flavors:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt );&lt;br /&gt;&lt;br /&gt;template &amp;lt;class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;With the above, the originally envisioned &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; is now just syntactic sugar for a call to the source string's member &lt;font face="Courier New"&gt;transcode()&lt;/font&gt; function.&amp;nbsp; It also means that the code to do transcodes is now centralized within &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;.&amp;nbsp; Handy!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Oh, and since someone asked: I'm currently developing and testing this on&amp;nbsp;Visual C++&amp;nbsp;.NET 2003 and &lt;a href="http://www.nuwen.net/gcc.html#mingw"&gt;Stephan Lavavej's distribution of MinGW&lt;/a&gt;; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms.&amp;nbsp; My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc.&amp;nbsp; Visual parity will be the hardest to do; in fact, I will likely not do it right away.&amp;nbsp; (Namely, because the Unicode tables do not contain such parity information.)&amp;nbsp; Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.&lt;/p&gt; &lt;p&gt;(Since this was mostly a "what happened while I was gone" article, no point summary.)&lt;/p&gt; &lt;p&gt;(Update 2pm: &lt;A href="http://blogs.msdn.com/michkap/"&gt;Michael Kaplan&lt;/a&gt; nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks!&amp;nbsp; Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=350325" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 6)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/11/04/252439.aspx</link><pubDate>Thu, 04 Nov 2004 18:19:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:252439</guid><dc:creator>ryanmy</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/252439.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=252439</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.&amp;nbsp; Luckily, I had everything backed up; however,&amp;nbsp;I had to copy everything over to, and work on,&amp;nbsp;a single-monitor Longhorn dogfood box with no major apps installed.&amp;nbsp; This&amp;nbsp;went on for&amp;nbsp;a week and a half while I waited for Dell to&amp;nbsp;slog through&amp;nbsp;the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going)&amp;nbsp;and this put me behind schedule for several deadlines.&amp;nbsp; So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog.&amp;nbsp; In about two weeks these deadlines will be behind me, and I can start posting with regularity again.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Also, at this point I'm now primarily doing implementation of previously discussed ideas, so this series of posts will temporarily serve two purposes: discussion of issues, and journal of coding concerns about implementing this in C++.&amp;nbsp; And this post concerns one of the C++ concerns: how do you define &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; for a string that's in a variable-width encoding such as UTF-8?&amp;nbsp; One of the basic assumptions in &lt;font face="Courier New"&gt;std::string&lt;/font&gt; that I intend to honor is that &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a reference to the actual data, not a copy.&amp;nbsp; For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a &lt;font face="Courier New"&gt;unsigned char&lt;/font&gt;/&lt;font face="Courier New"&gt;short&lt;/font&gt;/&lt;font face="Courier New"&gt;long&lt;/font&gt;.&amp;nbsp; However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well.&amp;nbsp; I could do this with covariant returns and unions, but this is horribly ugly -- and I'd need a lot of different returns, since UTF-8 alone can have up to six bytes in a single code point.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My solution is to return a proxy object, &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt;.&amp;nbsp; When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt;.&amp;nbsp; The Vector Wrapper Problem, as&amp;nbsp;some refer to it,&amp;nbsp;deserves a bit of discussion.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ standard defines that all implementations of the STL container &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; should include a specialization &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt; that stores the bits in packed form.&amp;nbsp; (Contrast&amp;nbsp;with an array of bools -- bools can be stored in memory as if they were any of several integral types, depending on situation and the intelligence of the compiler).&amp;nbsp; In this case, if &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a bool, you cannot write expressions such as &lt;font face="Courier New"&gt;a[3] = true;&lt;/font&gt; -- there's no bool back there!&amp;nbsp; You need to return a proxy object containing a pointer/reference to the source container, with &lt;font face="Courier New"&gt;operator=&lt;/font&gt; overloaded, in order to support assignment in this manner.&amp;nbsp; However, this breaks with the definition of &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; -- the standard simultaneously claims that any &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; on a &lt;font face="Courier New"&gt;vector&lt;/font&gt; must return some type that is convertible to &lt;font face="Courier New"&gt;T &amp;amp;&lt;/font&gt;.&amp;nbsp; This bit of doublespeak results in the inability to reliably write certain types of wrappers around&amp;nbsp;vector that can accept bool.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My belief is that this was an oversight of the standardization committee.&amp;nbsp; They took the first step towards solving this by defining &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; (and the iterator's dereference operators) as returning a member typedef, &lt;font face="Courier New"&gt;ref_type&lt;/font&gt;; however, they stopped short of a goal, by saying that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; had to be defined from the allocator for the vector.&amp;nbsp; A better solution would be to define a set of semantics and overloaded operators that suitably encapsulated the intent, purpose, and behavior of references, and defining this as a &lt;em&gt;Reference&lt;/em&gt; typeclass.&amp;nbsp; They could then simply require that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; be some type meeting the &lt;em&gt;Reference(T)&lt;/em&gt; requirements, and all would be well.&amp;nbsp; This is what I intend to do.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The only remaining question is how to handle assignment; at first I planned to make it read-only, but later decided&amp;nbsp;to maintain a reference to the host string and call &lt;font face="Courier New"&gt;replace()&lt;/font&gt; on the&amp;nbsp;encoding/store in response to an &lt;font face="Courier New"&gt;operator=&lt;/font&gt;.&amp;nbsp; This means that a &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt; must be templated on the source string in order to be typesafe.&amp;nbsp; This brings up the question of the string's lifetime and the ref's lifetime being separate; however, traditional C++ says that operations such as destruction may invalidate iterators/references/etc. anyways.&amp;nbsp; In this case, I think it's reasonable to be the same.&amp;nbsp; (This also means it's okay to use a member reference variable; in almost every case, pointers&amp;nbsp;are preferable, since references cannot be assigned to, only copy-constructed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As far as implementation goes, I've completed the &lt;font face="Courier New"&gt;unmanaged_ptr&lt;/font&gt; and &lt;font face="Courier New"&gt;vector_of_bytes&lt;/font&gt; backing stores, and am currently working on the &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; parent class that all fixed width encodings such as UCS2 and ASCII derive from.&amp;nbsp; Next post, I will likely talk about the interactions of encoding and backing store classes, and how I've divided responsibilities between them.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;To finish this post off, though, a quick oddity about the use of &lt;font face="Courier New"&gt;widen()&lt;/font&gt; in iostreams.&amp;nbsp; &lt;font face="Courier New"&gt;widen()&lt;/font&gt; is defined on streams as handling certain platform-specific character conversions, such as converting &lt;font face="Courier New"&gt;'\n'&lt;/font&gt; to the appropriate end-of-line character on your platform (CR for Unix and Mac OS X, CRLF for Windows, LF for Classic MacOS).&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; '\n';&lt;/font&gt; outputs &lt;font face="Courier New"&gt;cout.widen('\n')&lt;/font&gt;, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; "\n";&lt;/font&gt; iterates through all characters in the string (as reported&amp;nbsp;by &lt;font face="Courier New"&gt;traits&amp;lt;char&amp;gt;::length()&lt;/font&gt;) and outputs the result of &lt;font face="Courier New"&gt;cout.widen()&lt;/font&gt; on each one, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; string("\n");&lt;/font&gt; does NOT widen characters.&amp;nbsp; It directly asks for cout's &lt;font face="Courier New"&gt;streambuf&lt;/font&gt;, and &lt;font face="Courier New"&gt;xsputn()&lt;/font&gt;'s the entire contents of &lt;font face="Courier New"&gt;data()&lt;/font&gt; into it.&amp;nbsp; Do not pass locale, do not collect i18n.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm still thinking over how I want to define my behavior for &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=252439" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 5)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/25/247677.aspx</link><pubDate>Tue, 26 Oct 2004 01:46:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:247677</guid><dc:creator>ryanmy</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/247677.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=247677</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx"&gt;In our last episode&lt;/a&gt;, we briefly discussed possible behaviors for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and we discussed how the STL's &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input.&amp;nbsp; We also noted that we could avoid many of the implementation headaches that result,&amp;nbsp;because of our decision to generalize our backing store.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; in &lt;a href="http://www.gotw.ca/publications/xc++s.htm"&gt;Exceptional C++ Style&lt;/a&gt; -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions.&amp;nbsp; However, he's not looking to improve &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s foundation -- he's merely explaining how reducing the number of methods in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; makes the code much easier to maintain.&amp;nbsp; (For example, rather than writing an &lt;font face="Courier New"&gt;empty()&lt;/font&gt; member function, he writes a templated empty function that takes a STL&amp;nbsp;string or container, and returns true if the string's begin and end iterators are equal.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Furthermore, he specifically chooses some less-than-ideal but good-enough implementations as a result of making simplicity the primary goal.&amp;nbsp; For example, in his implementation of &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, he implements the shrinking case by using a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor to make a copy of the first N characters of the string, and then calls &lt;font face="Courier New"&gt;swap()&lt;/font&gt;, so he's incurring a memory allocation and deallocation there that is unneccessary.&amp;nbsp; While Sutter's treatment is good, we have a slightly more ambitious goal in mind (making a better class to replace &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, rather than merely improving upon the existing implementation through decomposition), so we're not duplicating effort.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That said, I agree with his approach of decomposing functions with many overloads such as insert and replace, especially considering that our choice to generalize backing stores eliminates most of my objections to his techniques.&amp;nbsp; So, I've decided to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class after all, in a sense.&amp;nbsp; The &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class will have a single member function for each major piece of functionality, such as insertion or replacement or concatenation.&amp;nbsp; We'll then make an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; wrapper class that provides overloads in a way to make it roughly equivalent to std::string.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, on to a concern I alluded to in the last entry: distinguishing code points and characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Up until now, I've specifically used the word "code point" to refer to a single symbol in the Unicode/UCS tables, even though Unicode refers to them as characters.&amp;nbsp; I chose to do this because of the existence of "combining characters", which are symbols associated with the previous "base character" such as accents, enclosing boxes/circles, formatting marks for subscript/superscript, and so on.&amp;nbsp; Unicode contains unaccented base characters, combining characters, and "precomposed characters" that use a single codepoint to represent a pre-accented base character.&amp;nbsp; These are considered always canonically equivalent to some combination of a base character and one or more composing characters.&amp;nbsp; (See &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx"&gt;Part 1&lt;/a&gt; for an example of this.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Unicode&amp;nbsp;defines a set of &lt;a href="http://www.unicode.org/unicode/reports/tr15/"&gt;normalization forms&lt;/a&gt; that are used to standardize whether to favor combining characters or precomposed characters.&amp;nbsp; However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and &lt;strong&gt;must &lt;/strong&gt;be represented using&amp;nbsp;combining characters.&amp;nbsp; To make things even nastier, there are some combining characters, most notably double diacritics, that can span multiple base characters.&amp;nbsp; (And I haven't even gotten into Arabic and Hebrew scripts that can change the direction of rendering in mid-string!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through!&amp;nbsp; When you hit the right arrow in Microsoft Word to skip over an &amp;Agrave;, you don't go first to an A and then to the A's accent -- you move past the whole "character."&amp;nbsp; (Unicode refers to this rough definition of&amp;nbsp;character as a "grapheme cluster," FYI.)&amp;nbsp; If it weren't for double diacritics, we could shrug and say "Well, a character is a base codepoint plus zero or more combining codepoints."&amp;nbsp; But it may not be that easy.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; After taking a walk to think it over, I ended up deciding to err on the side of the Unicode standard -- we'll treat double diacritics as a glyph problem.&amp;nbsp; Namely, a double diacritic is attached to the preceeding base codepoint only, and the fact that it extends over the following base codepoint as well is a glyphing concern.&amp;nbsp; (This is also due to the fact that most of the double diacritics can also be represented as a pair of "combining halfmark" where half of the glyph is applied to each character as two separate combining characters, and the glyphing engine is expected to recognize this and render it as a single glyph.)&amp;nbsp; So, we can say that a grapheme cluster is a base character, plus zero or more combining code points, plus any uses of the &lt;em&gt;Combining Grapheme Joiner&lt;/em&gt; codepoint.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; So, do we want &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; to take integer index arguments, iterators, etc.&amp;nbsp;as referring to code points, or to grapheme clusters?&amp;nbsp; For the sake of programmer familiarity, we're going to default to clusters, but we'll allow code points.&amp;nbsp; We will have a single iterator class that takes a bool in its construction describing whether &lt;font face="Courier New"&gt;advance()&lt;/font&gt; and related methods should advance by codepoint or by cluster.&amp;nbsp; Our begin, end, and other such iterator methods will be templated with a default template argument to clusters; thus, you can ask for a codepointer iterator by calling &lt;font face="Courier New"&gt;str.begin&amp;lt;codepoints&amp;gt;()&lt;/font&gt;.&amp;nbsp; This is a bit messy, but workable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Before, we listed the methods that seemed worthwhile to carry over.&amp;nbsp; However, many of them can be implemented as versions of the others.&amp;nbsp; Tomorrow, we'll actually write a complete header for &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and start implementing it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That, and I think it's about time I go buy a hardcover copy of the Unicode standard, as I have way too many PDFs on my desktop right now.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=247677" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 4)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx</link><pubDate>Fri, 22 Oct 2004 23:42:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:246539</guid><dc:creator>ryanmy</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/246539.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=246539</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx"&gt;In our last episode&lt;/a&gt;, we established that we wouldn't be able to make a true &lt;font face="Courier New"&gt;std::string&lt;/font&gt; replacement and still handle variable-width encodings.&amp;nbsp; So, we started with the beginning lines of an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; class.&amp;nbsp; However, this doesn't mean we are going to dispense with &lt;font face="Courier New"&gt;std::string&lt;/font&gt; entirely!&amp;nbsp; But first, a quick answer about my choice of names and an explanation about exceptions.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;A friend of mine asked me yesterday, "Don't you intend to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and then have a typedef'd &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; that hardwires a specific specialization, like ASCII?"&amp;nbsp; I'm considering this -- but if I hardwire anything, it will &lt;em&gt;not &lt;/em&gt;be the encoding type.&amp;nbsp; Trying to abstract away the encoding as hidden information is exactly the thinking that got us into this mess with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;!&amp;nbsp; However, what we use for the backing store might be worth standardizing.&amp;nbsp; After all, using a &lt;font face="Courier New"&gt;vector&amp;lt;byte&amp;gt;&lt;/font&gt; to contain our bitstream is a very flexible choice; it's just not the best-performing one.&amp;nbsp; Whenever possible, we should make a library easy to use on the surface, and expose the guts of it to be changed once someone already has the program running and is trying to improve on it (by, for example, using string literals as backing stores and only copying them to heap memory when needed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In a dream world, we would typedef a partial specialization.&amp;nbsp; However, we get bit by one of the most annoying mis-features in C++ -- &lt;a href="http://www.gotw.ca/gotw/079.htm"&gt;you can't template a typedef&lt;/a&gt;.&amp;nbsp; Even the STL is crippled by this, and has to work around it using its &lt;font face="Courier New"&gt;::rebind&lt;/font&gt; member.&amp;nbsp; So, the best we could do is allow someone to &lt;font face="Courier New"&gt;#define rmstring(enc) basic_rmstring&amp;lt;enc, vector_of_bytes&amp;gt;&lt;/font&gt;, and declare a string as &lt;font face="Courier New"&gt;rmstring(iso8859_1) str;&lt;/font&gt;..&amp;nbsp;&amp;nbsp;It'd work, but it makes me cringe.&amp;nbsp; Alternately, we could use a rebind approach like the STL:&amp;nbsp;&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Enc&amp;gt; struct rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font color="#000080"&gt;&lt;font face="Courier New"&gt;typedef&amp;nbsp;basic_rmstring&amp;lt;Enc, vector_of_bytes&amp;gt; type;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;rmstring&amp;lt;iso8859_1&amp;gt;::type str;&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Really, both of them are pretty damned ugly; the preprocessor approach is prettier,&amp;nbsp;IMHO, but is also considerably more dangerous.&amp;nbsp; So, I'm going to leave it as&amp;nbsp;&lt;font face="Courier New"&gt;rmstring&lt;/font&gt; with two template values for the purposes of this&amp;nbsp;blog.&amp;nbsp;&amp;nbsp;Eventually I'll probably opt for the &lt;font face="Courier New"&gt;#define&lt;/font&gt; for my own&amp;nbsp;version of the library, but you can choose whichever is more appealing to you (conciseness versus typesafety), or choose neither.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The second thing I wanted to answer from yesterday were those two exceptions, &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; and &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, that I listed next to the &lt;font face="Courier New"&gt;encoding_cast()&lt;/font&gt; function.&amp;nbsp; What on earth are they for?&amp;nbsp; First off, imagine that you're trying to convert a string from UCS-4 to UCS-2.&amp;nbsp; As I mentioned in &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Part 2&lt;/a&gt;, UCS-2 is a non-universal encoding, and there are some code points that it cannot represent.&amp;nbsp; What happens if our UCS-4 string contains one of those code points?&amp;nbsp; In this case, we will throw the &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception.&amp;nbsp; We will also throw it in the case of converting to legacy character sets that simply do not have a code point defined for a symbol.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;There's something to keep in mind, though.&amp;nbsp; The popularity of JPEG proves that a lossless transform is not always necessary.&amp;nbsp; Imagine that we have the greek letter &lt;strong&gt;&amp;AElig;&lt;/strong&gt; -- is it acceptable to convert this to two characters, &lt;strong&gt;AE&lt;/strong&gt;?&amp;nbsp; The proper answer is neither yes or no;it's "sometimes."&amp;nbsp;&amp;nbsp;Remember, all this time, our definitions of string have been derived from a definition of symbols&amp;nbsp;that a human interprets -- and this means that whether or not a&amp;nbsp;'close enough'&amp;nbsp;translation is acceptable depends on who's looking at the string.&amp;nbsp; Imagine that a blind person is using a screenreader (a program that uses a computerized voice to read text as it appears on the screen).&amp;nbsp; In that case, there's a vast difference between &lt;strong&gt;&amp;AElig;&lt;/strong&gt; and &lt;strong&gt;AE.&lt;/strong&gt;&amp;nbsp; However, for a person with normal sight reading a webpage, however, the two might be interchangable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The computer scientist in me says that I should only allow lossless transforms -- the engineer in me knows better, though, and there's a way to satisfy both.&amp;nbsp; Therefore, we are going to add a third template argument to yesterday's definition of&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and allow it to have a default specialization.&amp;nbsp; This default specialization will be called the "symbol clash resolver" and has a well-known method invoked whenever a missing symbol problem occurs.&amp;nbsp; The default one, &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt;, will throw &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; in all cases.&amp;nbsp; A user can define alternatives, though.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Two possible alternatives immediately occur to me -- one called &lt;font face="Courier New"&gt;visual_parity_resolver&lt;/font&gt; that does replacements like the above, and another called &lt;font face="Courier New"&gt;error_symbol_resolver&lt;/font&gt; that acts like RS232's error character, inserting a compile-time constant instead (such as a box symbol, or an "&amp;lt;ERROR&amp;gt;" string, or whatever suits the user) whenever a symbol cannot be translated.&amp;nbsp; But those can all wait for later -- only &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt; needs to be immediately defined, and its definition is trivial, since it just throws :)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The other exception, &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, comes from if we try to decode a buffer that has an error in it.&amp;nbsp; In the case of UTF-8, there are sequences that decode to illegal or nonsensical numbers, and if we&amp;nbsp;are asked to decode these sequences, we should let the user know.&amp;nbsp; Imagine a scenario where you are writing an Internet&amp;nbsp;server daemon, and expect to recieve a UTF-8 encoded string as the first transmission following a client successfully connecting.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In this scenario, we &lt;font face="Courier New"&gt;recv()&lt;/font&gt; the data from the server into a buffer, and then construct an &lt;font face="Courier New"&gt;rmstring&amp;lt;utf8, &lt;/font&gt;&lt;font face="Courier New"&gt;unmanaged_pointer&amp;gt;&lt;/font&gt; to read it.&amp;nbsp; If there was an error in network transmission, or a malicious client was testing our ability to handle bad data, we should communicate this to the programmer as an error.&amp;nbsp; Thus, if an encoding can detect illegal input (very few encodings can!) it may throw a &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt; exception&amp;nbsp;if you invoke&amp;nbsp;any operations that hit that input,&amp;nbsp;or if you attempt to trans-code it.&amp;nbsp; We will also probably want to make a compile-time flag visible on the encoding class that determines whether or not it can have malformed data.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, with those two issues resolved, let's get down to our dirty business!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I said earlier that we had to pick one of two mutually exclusive goals: be a&amp;nbsp;perfect drop-in replacement for &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, or support variable-width encodings such as UTF-8.&amp;nbsp; Since I think &lt;font face="Courier New"&gt;std::string&lt;/font&gt; is poorly designed &lt;strong&gt;&lt;em&gt;and&lt;/em&gt;&lt;/strong&gt; I demonstrated that not being string-compatible is only a loss for stringstream compatibility, I'm favoring the latter.&amp;nbsp; (Just hating &lt;font face="Courier New"&gt;std::string&lt;/font&gt; alone would not be sufficient reason -- in that case I'd just be suffering from&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Not_Invented_Here"&gt;NIH syndrome&lt;/a&gt;.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;However, this doesn't mean that I can just go roll my own string class in the way that best suits my urges.&amp;nbsp; Many programmers have devoted considerable time and energy to learning &lt;font face="Courier New"&gt;std::string&lt;/font&gt;'s ins and outs, myself included -- so, I should exploit that knowledge by providing similar functions with similar arguments, as long as it doesn't compromise my design's principles.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Looking at &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s definition in the C++ Standard is an exercise in mental stamina.&amp;nbsp; It defines six constructors (one of which requires some very &lt;a href="http://www.mpi-sb.mpg.de/~kettner/courses/lib_design_03/notes/meta.html"&gt;special trickery with templating and the SFINAE principle&lt;/a&gt; to implement, as we'll see later) and over 100 methods, plus a host of non-member operators such as &lt;font face="Courier New"&gt;&amp;lt;&amp;lt;&lt;/font&gt; and &lt;font face="Courier New"&gt;+&lt;/font&gt;.&amp;nbsp; However, looking at the expected behavior for each function, most of them are overloads that call a base function.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In other words, a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; has one or two core definitions at most for each core method (such as &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;/font&gt;, &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, etc.), which take &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;s as their input.&amp;nbsp; Every other overload is defined as equivalent to calling that root function, with a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor meant to convert some other form of string (char pointer, run of chars, pair of iterators, etc.) to a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that the "core implementation" can grok.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Of course, they don't all implement them like that, because it'd mean frivolously making a copy of the input data in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; form for each trivial overload.&amp;nbsp; Instead, a typical implementation of &lt;font face="Courier New"&gt;std::string&lt;/font&gt; has an optimized version for each&amp;nbsp;variant, making maintenance a nightmare.&amp;nbsp; But we don't have that problem -- because, instead of requiring an STL allocator, we can accept an arbitrary backing store!&amp;nbsp; So, suppose we have a working implementation of append:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt; class Encoding, class BackingStore &amp;gt; class rmstring {&lt;br /&gt;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// &lt;strong&gt;Appends &lt;em&gt;n&lt;/em&gt;&amp;nbsp;codepoints of &lt;em&gt;str&lt;/em&gt;, starting at &lt;em&gt;pos&lt;/em&gt;, to the&amp;nbsp;string.&lt;/strong&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* Will throw an out_of_range exception if &lt;em&gt;pos&lt;/em&gt; &amp;gt;= &lt;em&gt;str&lt;/em&gt;.length()&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* If &lt;em&gt;pos&lt;/em&gt; is in range, but&amp;nbsp;&lt;em&gt;pos&lt;/em&gt; +&amp;nbsp;&lt;em&gt;n&lt;/em&gt;&amp;nbsp;&amp;gt; &lt;em&gt;str&lt;/em&gt;.length(), &lt;em&gt;n&lt;/em&gt; is&amp;nbsp;truncated so that &lt;em&gt;pos&lt;/em&gt; + &lt;em&gt;n&lt;/em&gt; = &lt;em&gt;str&lt;/em&gt;.length().&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// *&amp;nbsp;Will throw an length_error exception if the resulting string would be larger than&amp;nbsp;&lt;em&gt;BackingStore&lt;/em&gt;'s max_size().&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt; class OtherBS &amp;gt; rmstring &amp;amp; append( rmstring&amp;lt;Encoding, OtherBS&amp;gt;&amp;nbsp;const &amp;amp; str, size_type pos, size_type n ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;font color="#000080"&gt;&lt;em&gt;/* implementation */&lt;br /&gt;&lt;/em&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;...&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Note that I've defined the above in terms of code points, not symbols.&amp;nbsp; There can be multiple codepoints representing a single symbol.&amp;nbsp; I'll discuss this problem, and the related problem of Unicode normalization forms, in a later post -- namely because I'm still working on a solution.&amp;nbsp; :-P This is a learning exercise for me too!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Because &lt;font face="Courier New"&gt;OtherBS&lt;/font&gt; is arbitrary, we can directly implement the other overloads of &lt;font face="Courier New"&gt;append()&lt;/font&gt; as calls to &lt;font face="Courier New"&gt;append()&lt;/font&gt; with a &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; constructor, without worrying about needlessly duplicating information.&amp;nbsp; If we want to use a &lt;font face="Courier New"&gt;char *&lt;/font&gt; from an ANSI C function, we can just use a &lt;font face="Courier New"&gt;unmanaged_pointer&lt;/font&gt; backing store.&amp;nbsp; If we want to use n repetitions of some character c, we can just use a &lt;font face="Courier New"&gt;run_of_chars&amp;lt;n, c&amp;gt;&lt;/font&gt; backing store.&amp;nbsp; We pass the &lt;em&gt;exact same information&lt;/em&gt; as if we were doing it the old way, but abstracted inside a templated class, so there's no overhead except at compiletime.&amp;nbsp; Beautiful!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, what should we implement from &lt;font face="Courier New"&gt;std::string&lt;/font&gt;?&amp;nbsp; Here's the core functions from &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that seem worth carrying over:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;strong&gt;Size functions&lt;/strong&gt;: &lt;font face="Courier New"&gt;size()&lt;/font&gt; and &lt;font face="Courier New"&gt;length()&lt;/font&gt;, &lt;font face="Courier New"&gt;max_size()&lt;/font&gt;, &lt;font face="Courier New"&gt;capacity()&lt;/font&gt;, &lt;font face="Courier New"&gt;reserve()&lt;/font&gt;, &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, &lt;font face="Courier New"&gt;empty()&lt;/font&gt;, &lt;font face="Courier New"&gt;clear()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Iterators&lt;/strong&gt;: &lt;font face="Courier New"&gt;begin()&lt;/font&gt;, &lt;font face="Courier New"&gt;end()&lt;/font&gt;, &lt;font face="Courier New"&gt;rbegin()&lt;/font&gt;, &lt;font face="Courier New"&gt;rend()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Accessors&lt;/strong&gt;: &lt;font face="Courier New"&gt;operator[]&lt;/font&gt;, &lt;font face="Courier New"&gt;at()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Replacers&lt;/strong&gt;: &lt;font face="Courier New"&gt;assign()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Appenders&lt;/strong&gt;: &lt;font face="Courier New"&gt;push_back()&lt;/font&gt;, &lt;font face="Courier New"&gt;push_front()&lt;/font&gt;, &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Modifiers&lt;/strong&gt;: &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, &lt;font face="Courier New"&gt;erase()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Searchers&lt;/strong&gt; (evil): &lt;font face="Courier New"&gt;find()&lt;/font&gt;, &lt;font face="Courier New"&gt;rfind()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_not_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_not_of()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Utilities&lt;/strong&gt;: &lt;font face="Courier New"&gt;substr()&lt;/font&gt;, &lt;font face="Courier New"&gt;copy()&lt;/font&gt;, &lt;font face="Courier New"&gt;swap()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Comparators&lt;/strong&gt; (also evil): &lt;font face="Courier New"&gt;compare()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator==&lt;/font&gt;, &lt;font face="Courier New"&gt;operator!=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Streams:&lt;/strong&gt; &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&amp;gt;&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Backwards compatibility:&lt;/strong&gt; &lt;font face="Courier New"&gt;c_str()&lt;/font&gt;, &lt;font face="Courier New"&gt;data()&lt;br /&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;That's a lot of stuff to implement!&amp;nbsp; But not only does it gain us good-will by allowing programmers to code much like they did with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, it also means that we can make a &lt;font face="Courier New"&gt;typedef rmstring&amp;lt;&lt;em&gt;RMS_COMPILER_SPECIFIC_ENCODING&lt;/em&gt;, vector_of_bytes&amp;gt;&amp;nbsp;rstring&lt;/font&gt;, and be pretty damned close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt;-equivalent.&amp;nbsp; (The compiler-specific encoding can be set in a header file, or specified on the command line -- I'll likely set it to &lt;font face="Courier New"&gt;iso8859_1&lt;/font&gt; for string and &lt;font face="Courier New"&gt;ucs2&lt;/font&gt; for wstring in a header.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;But before I get to that, I'll have a nastier problem to tackle, and that's combining characters.&amp;nbsp; Not only do we have codepoints that can take up variable amounts of space (thanks to encoding), but we also have symbols that can take up variable amounts of codepoints!&amp;nbsp; (See Part 1 and search for "diaeresis" if you're not sure why this is.)&amp;nbsp; Unicode, luckily, comes to the rescue again with a standard that determines when and how a character symbol or should not be broken down into combining characters.&amp;nbsp;&amp;nbsp;These are called&amp;nbsp;normalization forms, and we'll tackle those on Monday.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Next episode: Normalization forms and chain of command (which does not involve rmstring covering its ass if things go FUBAR).&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Takeaways from Part 4:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;We're specifically designing &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; to&amp;nbsp;force the programmer into awareness of encodings -- we don't want&amp;nbsp;to hide that with a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; being typedefed.&amp;nbsp; (We couldn't anyways, because we can't template typedefs.)&amp;nbsp; So, for now, we'll leave it as-is.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Not only are all encodings inequal, not all trans-coding schemes are equal either!&amp;nbsp; Be aware of this, and think about how you want to handle errors!&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Even if we think&amp;nbsp;&lt;font face="Courier New"&gt;std::string&lt;/font&gt; is evil, we can still gain good will from our potential users by making ourselves as close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt; as possible.&amp;nbsp; This, unfortunately, means lots of work.&amp;nbsp; But not as much as if we were actually implementing &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, due to our luck in choosing to template our backing store.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;However, all our methods need to be defined in terms of symbols, not code points (and certainly not bytes of encoded data!).&amp;nbsp; This makes our life difficult again.&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=246539" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item><item><title>Encodings in Strings are Evil Things (Part 3)</title><link>http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx</link><pubDate>Thu, 21 Oct 2004 00:08:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:245417</guid><dc:creator>ryanmy</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/ryanmy/comments/245417.aspx</comments><wfw:commentRss>http://blogs.msdn.com/ryanmy/commentrss.aspx?PostID=245417</wfw:commentRss><description>&lt;p&gt;&amp;nbsp;&amp;nbsp;&lt;em&gt;&amp;nbsp;(Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long.&amp;nbsp; So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts.&amp;nbsp; I'll also try to pace my paragraphs more regularly.&amp;nbsp; Thanks for the advice!)&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Yesterday&lt;/a&gt;, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for&amp;nbsp;encoding and decoding code point indices on a binary computer.&amp;nbsp; At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points.&amp;nbsp; Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Before we do that, though, there's one more nasty digression into standards-land that I'd like to take.&amp;nbsp; This is a fairly general definition of what a string is, and you don't really write libraries unless you intend for them to be general-purpose enough to be reused.&amp;nbsp;&amp;nbsp;So,&amp;nbsp;it might be a worthwhile goal to make our new string library compatible with the &lt;font face="Courier New"&gt;string&lt;/font&gt; class in the C++ Standard Template Library, so that anyone could gain its benefits simply by using a different &lt;font face="Courier New"&gt;#include&lt;/font&gt;.&amp;nbsp; Unfortunately, there's some restrictions that the C++ Standard (which I would highly suggest purchasing if you code in C++ for a living -- it's &lt;a href="http://webstore.ansi.org/ansidocstore/product.asp?sku=INCITS/ISO/IEC+14882-2003"&gt;$18 in PDF form direct from ISO&lt;/a&gt;) which prevent us from doing so -- namely, that many parts of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; are hard-wired to require a constant-size encoding and will not work with encodings such as UTF-8.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ Standard starts by defining &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; as templated on three classes -- a character type (&lt;font face="Courier New"&gt;charT&lt;/font&gt;), a specialization of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; for that type, and an allocator for that type.&amp;nbsp; (Nothing SAYS we have to implement&amp;nbsp;it with exactly those template parameters, but we're screwed anyways, as you'll see.)&amp;nbsp; It then defines two static typedefs for that specialization: &lt;font face="Courier New"&gt;traits_type&lt;/font&gt;, which typedefs to the templated traits specialization, and &lt;font face="Courier New"&gt;value_type&lt;/font&gt;, which&amp;nbsp;typedefs to&amp;nbsp;&lt;font face="Courier New"&gt;traits_type::value_type&lt;/font&gt;... which, by definition, is also required to be &lt;font face="Courier New"&gt;charT&lt;/font&gt;.&amp;nbsp; The definition of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; requires that &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; be specialized only on &lt;a href="http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.7"&gt;PODs&lt;/a&gt; (which are always constant-size), and its definitions all are written to assume uniformly-sized characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;If the traits problem wasn't enough, on top of that, a conformant &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; implementation requires that &lt;font face="Courier New"&gt;s[i]&lt;/font&gt; return the same value as &lt;font face="Courier New"&gt;s.data()[i]&lt;/font&gt;, and data is required to return a &lt;font face="Courier New"&gt;const charT *&lt;/font&gt;.&amp;nbsp; So, even if we could get around the&amp;nbsp;traits problem, variable-length encodings still screw us because &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; and a pointer offset will no longer agree.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, we will have to abandon hopes of being a drop-in replacement for &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;.&amp;nbsp; But, really, this isn't too bad -- there's only three other libraries in the STL that require the use of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;!&amp;nbsp; The first is in &lt;font face="Courier New"&gt;locale&lt;/font&gt;, and hardly anyone uses C++'s built-in locales anyways, favoring OS functionality.&amp;nbsp; The second is the &lt;font face="Courier New"&gt;bitset&lt;/font&gt; container, which hardly anyone uses either.&amp;nbsp; The third is&amp;nbsp;its use as&amp;nbsp;a backing store for &lt;font face="Courier New"&gt;stringstreams&lt;/font&gt; and as the &lt;font face="Courier New"&gt;stringbuf&lt;/font&gt;&amp;nbsp;wrapper that is the foundation of &lt;font face="Courier New"&gt;iostream&lt;/font&gt;, and this is a bigger loss.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The loss of direct compatibility with&amp;nbsp;stringbuf is a big pain.&amp;nbsp; However, when you're getting to I/O, you need to have already converted your string to the encoding your user is expecting -- we shouldn't expect a prompt expecting ASCII to be able to deal with a stream of UCS-2 characters!&amp;nbsp; So, it's perfectly okay if stringbuf&amp;nbsp;is left&amp;nbsp;alone, as long as we find a way to&amp;nbsp;convert strings between different encodings.&amp;nbsp; So, stringstreams are the only real loss, and we can make our own stringstream, if need be.&amp;nbsp; (Thanks to templates, we may be able to avoid having to re-invent the wheel, which is always good.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm going to start with policy-based design, which &lt;a href="http://www.moderncppdesign.com/"&gt;Alexandrescu&lt;/a&gt; introduced a few years ago in Modern C++ Design.&amp;nbsp; (Actually, the STL beat him to the punch by using allocators as a template argument for most of its &lt;font color="#000000"&gt;containers, but he popularized its use for general customization.)&amp;nbsp; In fact, he already demonstrated policy-based design in a &lt;/font&gt;&lt;a href="http://www.cuj.com/"&gt;&lt;font color="#000000"&gt;CUJ&lt;/font&gt;&lt;/a&gt;&lt;font color="#000000"&gt; article a year or two ago by making a basic_string replacement that allowed customizing copy-on-write semantics -- but I'm a bit more ambitious :)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My first stab at the class will be based directly off our most recent definition of string -- an encoding, and an ordered sequence of bits:&lt;/font&gt;&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace encodings {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... utf8, iso8859_1, big5, mac_roman, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace backing_stores {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... string_literal, vector_of_uchars, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;class Encoding, class&amp;nbsp;Bits&amp;gt; class rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;public:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;typedef Encoding encoding_type;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;private:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Bits _data;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;};&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Not much, but it's a start&lt;/font&gt;!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I want to reference something I said earlier about I/O -- when you're doing I/O, whether that's taking a string in or sending a string out, your stream of bits needs to have the same encoding as the device you're talking with, or Bad Things happen.&amp;nbsp; We need some way to denote, inside code, that an encoding change needs to take place.&amp;nbsp; (Guessing ahead, this will probably be the&amp;nbsp;most tedious&amp;nbsp;part of development -- creating UCS-to-encoding and encoding-to-UCS transitions for each encoding and character set we support.)&amp;nbsp; I'm going to take a nod from the excellent &lt;a href="http://www.boost.org/"&gt;Boost&lt;/a&gt; library here, and make an analogue to their &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; class.&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// these are the major exceptions...&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;class&amp;nbsp;missing_symbol;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;class malformed_data;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// ... that are thrown by:&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;br /&gt;&lt;font color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;typename Target, typename Source&amp;gt; Target encoding_cast(Source str);&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt; &lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the near future I'll probably alter this to take only &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;s as input and output and template on encoding types in/out, since right now it accepts any pair of types -- but this is only a prototype.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The goal for doing this is to minimize conversions.&amp;nbsp; Some of my coworkers who have been kind enough to proofread have remarked, &lt;em&gt;"I'd just throw&amp;nbsp;up my hands and convert everything internally to UCS-4 and use a basic_string&amp;lt;unsigned long&amp;gt;; after all, memory is cheap."&lt;/em&gt;&amp;nbsp; In a way, they're right -- doing this would mean I'd only have to write encoding_cast() for each encoding, and not even need the&amp;nbsp;new&amp;nbsp;string&amp;nbsp;class.&amp;nbsp; But, I'm a performance guy, a bit twiddler&amp;nbsp;at heart.&amp;nbsp; I don't want to do a conversion unless I need to, or if the performance gains from a fixed-width format like UCS-4 outweigh the performance loss of having to trans-code everything.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(It's rather like image formats -- TGA is lossless and can hold damn near anything, but that doesn't mean we always convert everything to TGA first before working with it, and then convert back when we're done.&amp;nbsp; Not everything has to be "worked on," and not all work is equally difficult.&amp;nbsp; This is especially true if we're using a compile-time string literal as a backing store, since it won't be modifiable unless you make a copy!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The general plan is to use &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; as a &lt;a href="http://hillside.net/patterns/DPBook/DPBook.html"&gt;Facade pattern&lt;/a&gt; for the Encoding class we're templated on.&amp;nbsp; Most of &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s methods will actually call the Encoding class and pass in state and a pointer to our Bits object as needed; the Encoding class will handle all the work of character traversal.&amp;nbsp; Since many of the encodings we're planning to&amp;nbsp;deal with are fixed-width (UCS-2, UCS-4, and most old systems like ISO 8859 and ASCII), I'll likely create a FixedWidthEncoding base class that does most of the work of locating offsets and insertion/deletion, and inherit most of the Encodings from it.&amp;nbsp; This means, the main thing that will be unique for each Encoding will be the translation tables used for converting the symbol sets for non-Unicode systems to Unicode code points, since most of the older encodings are simple fixed-width affairs and just have non-standard symbol sets.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Tomorrow, we'll start fleshing out &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s body with constructors and methods, and explain what those two exceptions&amp;nbsp;next to&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; are for.&amp;nbsp; We'll also take a brief look at screen-readers and web browsers, and make a change to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; to handle "looks-close-enough" trans-codes.&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;The definitions of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; and &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; in the C++ Standard prevent use of variable-width encodings;&amp;nbsp;therefore, we cannot make&amp;nbsp;a perfect drop-in replacement for the STL string class.&amp;nbsp; However, that's okay -- the only STL object we'll have to duplicate functionality for is stringstream.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;We can't expect I/O with external devices/programs to conform to whatever encoding we want -- they're expecting a specific encoding, and we need to present our data in that format -- or die a horrible, painful death.&amp;nbsp; So, the ability to trans-code is absolutely necessary.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Trans-coding can be expensive, but can have some gains, especially if going to UCS-4 for speed in manipulation or going to UTF-8 for compatibility with legacy C APIs.&amp;nbsp; Do it when necessary or justified, but avoid it if it's not absolutely necessary.&amp;nbsp; The coder should be allowed to pick an encoding and work with strings in that encoding as easily as possible.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=245417" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx">I18N</category><category domain="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx">C++</category></item></channel></rss>