<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/atom.xsl" media="screen"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-US"><title type="html">Funny, It Worked Last Time</title><subtitle type="html">... and other odd mutterings of a performance junkie</subtitle><id>http://blogs.msdn.com/ryanmy/atom.xml</id><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/default.aspx" /><link rel="self" type="application/atom+xml" href="http://blogs.msdn.com/ryanmy/atom.xml" /><generator uri="http://communityserver.org" version="2.1.61025.2">Community Server</generator><updated>2004-10-18T19:54:00Z</updated><entry><title>Event Tracing for Windows (ETW) -- Part 2</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2005/06/09/427520.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2005/06/09/427520.aspx</id><published>2005-06-10T00:42:00Z</published><updated>2005-06-10T00:42:00Z</updated><content type="html">&amp;nbsp;&amp;nbsp;&amp;nbsp; So, there were two major groups of comments on the last post, and I'll try to address them.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; The first was a question about &lt;b&gt;managed support&lt;/b&gt;
for ETW.&amp;nbsp; I talked to the ETW team, and the current state is that
there is no official managed interface for ETW.&amp;nbsp; Being a standard
Win32 API, it is posisble to &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconconsumingunmanageddllfunctions.asp"&gt;PInvoke&lt;/a&gt;
the functions involved, and several internal teams have written their
own managed wrappers around ETW.&amp;nbsp; This isn't expected to change up
through Whidbey (Visual Studio 2005); for Orcas (the version of VS
after that!) an official managed interface is on the table.&amp;nbsp; The
second is one of &lt;b&gt;backwards compatibility&lt;/b&gt;
-- ETW is only available in Win2K and later OSes.&amp;nbsp; Users will
expect that software works similarly on all OSes; thus, if you want to
support 9x-era OSes, you have to write your own logging code
anyways.&amp;nbsp; So, if you are already putting in old-style logging, why
use ETW?&amp;nbsp; I'll try to answer that with this entry.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; One of the gains of ETW is that it's fast; you can
spit out thousands of events per second while using relatively little
CPU, far faster than you can fprintf() a string to disk.&amp;nbsp; The
biggest gain, though, is combining multiple sources -- including
information outside your own process.&amp;nbsp; And the most notable
external source is the kernel.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; The XP and Codename Longhorn kernels are &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/perfmon/base/nt_kernel_logger_constants.asp"&gt;extremely extensive providers&lt;/a&gt;, and can be enabled to log any or all of these to a log, and we publish decoding information for:&lt;br&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;i&gt;Hardware configuration events&lt;/i&gt; -- notes on the system's CPUs, hard drives, NICs, video card, and ACPI power states&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Disk-level I/O&lt;/i&gt; -- every I/O on the system, including IRP flags, operation time in microseconds, number of bytes, and target disk&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;File-level I/O&lt;/i&gt; -- every access to every file on the system (including information to tie it to the disk I/Os above)&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Image layouts&lt;/i&gt; -- filenames, locations in memory, and PIDs for every image in the system&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Page faults&lt;/i&gt; -- pointers
to instructions and pages whenever a fault occurs (including COWs,
demand-zero faults, hard page faults, transition faults, and guard
pages)&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Network I/O&lt;/i&gt; -- all TCP and UDP actions, including connects/accepts, transmits (and retransmits!), recieves, etc.&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Registry I/O&lt;/i&gt; -- all Registry key/value&amp;nbsp; creation/deletions/changes, registry flushes, etc.&lt;/li&gt;
  &lt;li&gt;&lt;i&gt;Process and thread info&lt;/i&gt; -- all creations/deletions of processes and threads&lt;/li&gt;
&lt;/ul&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; And Codename Longhorn adds even more events -- most notably,
the ability to trace extremely fine-grained high-frequency events such
as individual context switches, interrupts (ISRs and DPCs), etc.&amp;nbsp;&amp;nbsp;  &lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; (FYI, since all of the above information is
exceedingly detailed, you can only enable the kernel provider if you
have Administrator privileges, are part of the Performance Log Users
group, or a service running as LocalSystem, LocalService, or
NetworkService.)&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; Thus, ETW can be used as an effective debugging
tool.&amp;nbsp; By allowing ETW to pull from, sort, and combine events from
multiple providers, you can get a powerful log of everything the system
was doing, probably the most accurate log available (save for running
the entire OS in a debugger).&amp;nbsp; It's an incredible tool for
noticing "hey, things act strangely when X, Y, and Z, but not W, are
happening" at a system level, as well as a code level, and it takes far
less time than getting symbols and attaching a debugger/profiler to the
system.&amp;nbsp; And it's all available to devs -- and even to users,
given a generic ETW tool such as tracelog in Server 2003!&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; Even if you don't specifically use the kernel as an
information source, ETW's ability to combine providers is useful for
mixing and matching information from multiple DLLs, EXEs, etc. in a
system.&amp;nbsp; ETW events are timestamped by the kernel to extremely
high resolution (RDTSC on stable machines, converted to microsecond
intervals; MM timers on others) and are automatically sorted at process
time, so you don't have to write or parse plaintext date/time formats.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; The goal I'm driving at is that when you have more
than one DLL or EXE providing information, individually implementing
logging for each component means that you usually need a third app to
read in the logs from each component and combine them into a single log
with coherent event ordering, and this can be difficult -- especially
if you have to tie it to some event.&amp;nbsp; ETW allows you to automate
all that, and it's exceedingly efficient at it as well.&amp;nbsp; Even if
you are only personally maintaining one component, ETW can log very
quickly, and it can be shipped in retail builds -- and if you publish
the structures of some or all of the events you provide, you can give
valuable information to your consumers and to future devs without ever
needing to work with them.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; Next entry, I'll start discussing how providers are
written, starting with thread structures and common ways of publishing
event structs.&lt;br&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=427520" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="Performance" scheme="http://blogs.msdn.com/ryanmy/archive/tags/Performance/default.aspx" /></entry><entry><title>Event Tracing for Windows (ETW)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2005/05/27/422772.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2005/05/27/422772.aspx</id><published>2005-05-28T05:13:00Z</published><updated>2005-05-28T05:13:00Z</updated><content type="html">&amp;nbsp;&amp;nbsp;&amp;nbsp; A lot of work in performance tuning is
organizational.&amp;nbsp; There's only so much work one can do with a
profiler and a single module.&amp;nbsp; A good example is the Registry --
we can attach profilers to the Registry access routines and optimize
them until they run as smooth as silk, but performance will still be
impacted if you do thousands of Registry accesses per second.&amp;nbsp; For
many problems, the cause is systemic: several components in a chain of
command that are individually well-tuned, but didn't expect to call
each other in a huge chain.&amp;nbsp; A good example of that is DirectShow
-- no matter how skillfully crafted an individual filter is, if the
mean path in a DirectShow graph is ten filters deep (with memory
management between each one for passing buffers of audio or video
around), latency is going to be high.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; More often than not, the best solution is simply
logging.&amp;nbsp; Log when filters are instantiated or connected, log when
Registry accesses are made, etc..&amp;nbsp; You want to mark high-level
concepts, and try to get a picture for what's going on with the system
as a whole.&amp;nbsp; This works fine if you only have one application that
has to log... but more often that not, these systemic problems have
hundreds of files involved, most of which aren't coded by you!&amp;nbsp; If
every programmer performs their logging in a different way, it can be a
nightmare to combine all those logs together, mixing different types of
timestamps and different methods of delivery, and get a single ordered
log of what happened over time.&amp;nbsp; Of course, that's exactly what we
need... and that's where Event Tracing for Windows, or ETW for short,
comes in.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ETW is, at its core, a unified system for one-way
packetized I/O managed by the Windows kernel, built for logging.&amp;nbsp;
Every use of ETW has three participants in it -- the controller, the
provider, and the consumer:&lt;br&gt;
&lt;ul&gt;
  &lt;li&gt;
A &lt;b&gt;provider&lt;/b&gt; is an module (DLL/EXE) doing something worth logging.&amp;nbsp;
Most of the time, it runs without logging; it can, however, be
"enabled" by a controller, at which point it recieves a handle from the
kernel and starts logging "events" to that handle.&amp;nbsp; An event is
an arbitrary struct (binary block) of data, the only condition being
that it start with a 48-byte header.&amp;nbsp; This header contains a
timestamp and identifying information.&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;
  &lt;li&gt;
A &lt;b&gt;controller&lt;/b&gt; controls the actual act of logging.&amp;nbsp; The controller
can ask the kernel to start a logging session, creating a handle and
specifying that the kernel should take any events delivered to that
handle and save them to a file.&amp;nbsp; (That file is usually on a hard
drive, although we occasionally save them to RAM drives to ensure
minimal interference.)&amp;nbsp; The controller can also enable and disable
logging by providers, passing them a handle to log to.&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;
  &lt;li&gt;
A &lt;b&gt;consumer&lt;/b&gt; reads events out of a file created by a logging session and
parses them.&amp;nbsp; (It is also technically possible to have a consumer
directly attach to a logging session's handle and retrive events in
real-time, but this is rare.)&lt;/li&gt;
&lt;/ul&gt;



&amp;nbsp;&amp;nbsp;&amp;nbsp; So, why use this system over your own homebrew system?&lt;br&gt;
&lt;ul&gt;
  &lt;li&gt;
    &lt;b&gt;Uniformity&lt;/b&gt;.&amp;nbsp; If you're debugging systemic problems involving
multiple components, and all the involved components use ETW, you can
have them all deliver their information to a single log file with
uniform, steady timestamps, and write a single application that parses
them all.&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;b&gt;Speed&lt;/b&gt;.&amp;nbsp; ETW is extremely fast for providers to use, since all the
I/O is handled by the kernel instead of by your module.&amp;nbsp; It
typically takes &lt;b&gt;only 1500-2000 cycles&lt;/b&gt;, depending on settings, to
deliver an event and return to your code.&amp;nbsp; One can easily deliver
thousands of events per second even on ancient machines.&amp;nbsp; We've
achieved &lt;b&gt;20,000 events per second while only using 5% CPU load
on a P3 500MHz&lt;/b&gt;!&amp;nbsp; &lt;i&gt;(Yes, we have machines that old in our
perf testing labs -- not everyone who uses Longhorn will be using a
modern machine!)&lt;/i&gt;&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;&lt;li&gt;
    &lt;b&gt;Consistency&lt;/b&gt;.&amp;nbsp; With fprintf() or other homebrew systems, logging
tends to be very slow and intrusive and is thus usually compiled
in.&amp;nbsp; With ETW, logging is extremely fast; furthermore, since
logging is turned on by a controller and is usually off by default, you
can actually leave the ETW events in final shipping code!&amp;nbsp; If
problems are found in the field, send the tester an app that starts a
trace and turns on the provider, then read it later.&amp;nbsp; Many, many
components in Longhorn will ship as ETW providers.&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;b&gt;
Reliability&lt;/b&gt;.&amp;nbsp; ETW isn't a new thing -- it's actually been in the
OS and actively used since Win2K, and has been constantly refined since
then.&amp;nbsp; Furthermore, ETW is available in both user-mode apps and
kernel components.&amp;nbsp; (The latter access it through a
MJ_SYSTEM_CONTROL IRP.)&amp;nbsp; This leads to...&lt;br&gt;
    &lt;br&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;b&gt;
OS cooperation&lt;/b&gt;.&amp;nbsp; The Windows kernel can provide many highly useful
events via ETW for diagnosing performance problems.&amp;nbsp; Find out when
and where disk I/Os, registry accesses, hard faults, and other
performance problems happen!&amp;nbsp; More on this later...&lt;/li&gt;
&lt;/ul&gt;





&amp;nbsp;&amp;nbsp;&amp;nbsp; I'll start discussing the actual APIs in the next
entry -- those whose curiosity has been piqued can jump into the &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/perfmon/base/event_tracing.asp"&gt;MSDN
documentation&lt;/a&gt;, which is not very good IMO but better than
nothing.&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=422772" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="Performance" scheme="http://blogs.msdn.com/ryanmy/archive/tags/Performance/default.aspx" /></entry><entry><title>Misinformation and the The Prefetch Flag</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2005/05/25/421882.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2005/05/25/421882.aspx</id><published>2005-05-26T02:27:00Z</published><updated>2005-05-26T02:27:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Hello!&amp;nbsp; I haven't updated this blog in a
while; work and other events have conspired to keep me from
writing.&amp;nbsp; Also, blogs.msdn.com moved internally from .Text to
Telligent Community Server, and my CSS markup was an unfortunate
casualty of the move, so I'm working on redesigning the blog's visual
appearance.&amp;nbsp; More entries will be coming eventually.&amp;nbsp; :)&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the meantime, I want to defuse a long-standing controversy -- the /prefetch flag.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;With modern computing, the absolute worst thing
you can ever do for performance is having to touch the hard drive -- or
any non-memory storage for that matter.&amp;nbsp; The fastest hard drives
on earth are still horridly slow compared to a PC's main memory; even
with solid state drives, in order to access the drive,&amp;nbsp;one has to
jump into system code and drivers, and this will push your own
program's code out of the CPU's L2 cache.&amp;nbsp; (This is called a
locality loss.)&amp;nbsp; There's two typical reasons one has to touch the
disk --&amp;nbsp;the first is when the application requests it explicitly
(Word asks the OS to load blog.doc into memory), and the other is a
"hard fault" -- when the application tries to use memory that has been
paged out to disk via "virtual memory" and needs to be paged back in.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Now, imagine that a DVD player program always
starts playback by loading a DLL to decode MPEG-2 video.&amp;nbsp; Wouldn't
it be nice if we could attempt to pre-load the MPEG-2 DLL whenever we
loaded the DVD player's EXE?&amp;nbsp; That way, when it tries to run code
on that DLL, one doesn't have to hard fault and go to disk for
it!&amp;nbsp;&amp;nbsp; This&amp;nbsp;is what a prefetcher does: it tracks what
code pages are used by an application, and&amp;nbsp;the next time that
application loads, it loads those pages in advance as soon as it's got
some idle time.&amp;nbsp; A prefetcher was added to Windows in XP, and is
vastly improved in Windows Longhorn.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;XP systems have a Prefetch directory underneath
the windows root directory,&amp;nbsp;full of .pf files -- these are lists
of pages to load.&amp;nbsp; The file names are generated from hashing the
EXE to load -- whenever you load the EXE, we hash, see if there's a
matching &lt;strong&gt;(exename)&lt;/strong&gt;-(&lt;strong&gt;hash).pf&lt;/strong&gt;
file in the prefetch directory, and if so we load those pages.&amp;nbsp;
(If it doesn't exist, we track what pages it loads, create that file,
and pick a handful of them to save to it.)&amp;nbsp; So, first off, &lt;em&gt;it is a&amp;nbsp;bad idea to periodically clean out that folder&lt;/em&gt;
as some tech sites suggest.&amp;nbsp; For one thing, XP will just re-create
that data anyways; secondly, it trims the files anyways if there's ever
more than 128 of them so that it doesn't needlessly consume space.&amp;nbsp; So not only is deleting the directory
totally unnecessary, but you're also putting a temporary dent in your
PC's performance.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Secondly, one can specify a &lt;strong&gt;/prefetch:#&lt;/strong&gt;
flag when launching an app.&amp;nbsp; Many&amp;nbsp;people have noticed that
auto-generated shortcuts to Windows Media Player do this, and the
number varies depending on what it does.&amp;nbsp; For example, the
shortcut used by the shell when you double-click a WMV file to play it
has one prefetch number; the auto-run shortcut to play or rip music
that appears when you insert a music CD have other numbers.&amp;nbsp; Some
sites have guessed that this switch turns on prefetching, and suggest
that you add that to every executable you care about -- this has
appeared on &lt;a href="http://www.iamnotageek.com/a/67-p1.php"&gt;so&lt;/a&gt; &lt;a href="http://www.pcmech.com/show/optimize/677/7"&gt;many&lt;/a&gt;, &lt;a href="http://www.winguides.com/forums/showflat.php?Cat=&amp;amp;Board=brdNewTweaks&amp;amp;Number=90351&amp;amp;page=6&amp;amp;view=collapsed&amp;amp;sb=5&amp;amp;part=1"&gt;many&lt;/a&gt;, &lt;a href="http://www.softwaretipsandtricks.com/windowsxp/articles/416/1/Load-Applications-Faster"&gt;&lt;em&gt;many&lt;/em&gt;&lt;/a&gt; &lt;a href="http://www.tweakguides.com/Firefox_12.html"&gt;sites&lt;/a&gt; to be urban legend.&amp;nbsp; &lt;a href="http://www.edbott.com/weblog/archives/000621.html"&gt;Other sites&lt;/a&gt;
write this off as garbage and guess that it's a switch specific to
Media Player, guessing from references to prefetching in the Windows
driver subsystem.&amp;nbsp; &lt;em&gt;Both guesses are incorrect.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The &lt;strong&gt;/prefetch:#&lt;/strong&gt; flag is looked at
by the OS when we create the process -- however, it has one (and only
one) purpose.&amp;nbsp; We add the passed number to the hash.&amp;nbsp;
Why?&amp;nbsp; WMP is a multipurpose application and may do many different
things.&amp;nbsp; The DLLs and code that it touches will be very different
when&amp;nbsp;playing a WMV than when playing a DVD, or when ripping a CD,
or when listening to a Shoutcast stream, or any of the other things
that WMP can do.&amp;nbsp; If we only had one hash for WMP, then the
prefetch would only be correct for one such use.&amp;nbsp; Having incorrect
prefetch data would not be a fatal error -- it'd just load pages into
memory that'd never get used, and then get swapped back out to disk as
soon as possible.&amp;nbsp; Still, it's counterproductive.&amp;nbsp; By
specifying a &lt;strong&gt;/prefetch:#&lt;/strong&gt; flag with a different number
for each "mode" that WMP can do, each mode gets its own separate hash
file, and thus we properly prefetch.&amp;nbsp; (This behavior isn't specific to WMP -- it does the same for any app.)&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This flag is looked at when we create the first thread in the process, but it is &lt;em&gt;not&lt;/em&gt;
removed by CreateProcess from the command line, so any app that chokes
on unrecognized command line parameters will not work with it.&amp;nbsp;
This is why so many people notice that Kazaa and other apps crash or
otherwise refuse to start when it's added.&amp;nbsp; Of course, WMP knows
that it may be there, and just silently ignores its existence.&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I suspect that the "add /prefetch:1 to make rocket
go now" urban legend will never die, though.&amp;nbsp; I know that at least
one major company ships products with it in their shortcuts, without
ever asking us... just for good measure, I guess.&amp;nbsp; :-P&amp;nbsp; All
it does is change your hash number -- the OS is doing exactly the same
thing it did before, and just saving the prefetch pages to a different
file.&lt;br&gt;
&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(ATTENTION: This is merely an informative
article; this information is completely unsupported, and the
functionality may change or disappear entirely in future versions of
Windows or service packs.&amp;nbsp; Furthermore, it is merely a hint for
the XP prefetcher, and it may choose to ignore it if it wishes.)&lt;/em&gt;&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=421882" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="Performance" scheme="http://blogs.msdn.com/ryanmy/archive/tags/Performance/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 8)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2005/01/17/354864.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2005/01/17/354864.aspx</id><published>2005-01-18T03:01:00Z</published><updated>2005-01-18T03:01:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.&amp;nbsp; However, most of my "real situations" involve legacy encodings.&amp;nbsp; So, I need to start cracking on transcoding.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The first concern is allowing adapters for arbitrary transcodings.&amp;nbsp; A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue.&amp;nbsp; This is a localization problem.&amp;nbsp; Just to make sure that terminology is clear, &lt;strong&gt;internationalization&lt;/strong&gt; (often abbreviated to &lt;strong&gt;i18n&lt;/strong&gt;) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic.&amp;nbsp;&amp;nbsp;This is almost always paired&amp;nbsp;with &lt;strong&gt;localization&lt;/strong&gt;&amp;nbsp;(sometimes abbreviated to &lt;strong&gt;l10n&lt;/strong&gt;), which is the act of tailoring an already-internationalized program for a specific language/locale.&amp;nbsp; Internationalization may be done by any programmer; localization requires translators.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the case of sorting,&amp;nbsp;a binary sort is often not enough.&amp;nbsp; Context is everything!&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Where do accented characters sort -- the same as their base characters, or after?&amp;nbsp; &lt;em&gt;(For French speakers, accented As come after Z.)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What are you sorting for?&amp;nbsp; &lt;em&gt;(German has a special sorting order for names in phone books!)&lt;/em&gt;&lt;/li&gt; &lt;li&gt;What about ligatures such as ch or fi?&amp;nbsp; &lt;em&gt;(Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)&lt;/em&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;For this reason,&amp;nbsp;developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/resources/strings/stringreference/stringfunctions/comparestring.asp"&gt;CompareString&lt;/a&gt; or &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_5s2v.asp"&gt;LCMapString&lt;/a&gt; APIs.&amp;nbsp; For example:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;rmstring&amp;lt;ucs4, bytevector&amp;gt; getfirst( std::list&amp;lt;rmstring&amp;lt;utf8, bytevector&amp;gt; &amp;gt;&amp;nbsp;&amp;amp; lines ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; return (*lines.begin()).transcode&amp;lt;ucs4, bytevector&amp;gt;();&lt;br /&gt;}&lt;/font&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This example is a bit contrived -- a real example would template the container and output encoding, and&amp;nbsp;make the LCID a&amp;nbsp;parameter with a default argument&amp;nbsp;-- but you get the point.&amp;nbsp; &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt;, in this case, is a custom predicate for &lt;font face="Courier New"&gt;std::sort&lt;/font&gt; (see &lt;font face="Courier New"&gt;&amp;lt;algorithm&amp;gt;&lt;/font&gt;) that converts both strings to UTF-16 and then invokes &lt;strong&gt;CompareStringW&lt;/strong&gt; on them, throwing a &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain.&amp;nbsp; Of course, this will hardly be my primary solution!&amp;nbsp; More on that later.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Anyways, similar issues arise for transcoding.&amp;nbsp; (Not to mention the fact that &lt;font face="Courier New"&gt;win32_collator&lt;/font&gt; is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)&amp;nbsp; So, we must include pluggable transcoders.&amp;nbsp; So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt, Engine e = Engine()&amp;nbsp;);&lt;br /&gt;&lt;br /&gt;template &amp;lt;class Engine, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;These functions now put off transcoding to the Engine object, whatever that may be.&amp;nbsp; In the Win32 vein, we could use &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp"&gt;MultiByteToWideChar&lt;/a&gt; and &lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp"&gt;WideCharToMultiByte&lt;/a&gt;&amp;nbsp;-- but that's too easy, not to mention very difficult to wrap.&amp;nbsp; I'd really like to do something that's solely C++ and entirely based in the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt;'s mappings directory.&amp;nbsp; There's a few dilemmas to be solved for that.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of &lt;strong&gt;compatibility characters&lt;/strong&gt;.&amp;nbsp; Compatibility characters are &lt;strong&gt;canonically equivalent&lt;/strong&gt; to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard.&amp;nbsp; For example, ISO8859-2 defines &lt;strong&gt;0x5A&lt;/strong&gt; to be equivalent to a capital letter L with a caron accent (&amp;Lcaron).&amp;nbsp; The "simple" equivalent of this in Unicode is a capital letter L (&lt;strong&gt;U+004C&lt;/strong&gt;) followed by a combining caron (&lt;strong&gt;U+030C&lt;/strong&gt;); however, Unicode also defines a single pre-combined character, &lt;strong&gt;U+013D&lt;/strong&gt;, that is directly equivalent to those two.&amp;nbsp; Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.&amp;nbsp; (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)&amp;nbsp; Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.&amp;nbsp; The most direct solution, it seems, is to generate a finite automata.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've been&amp;nbsp;working on the DFA for the last few days.&amp;nbsp; My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.&amp;nbsp; Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).&amp;nbsp; The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.&amp;nbsp; Legacy-to-legacy transcodes use UCS-4 as an intermediary.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I'm now working on a program that reads in a file from &lt;a href="http://www.unicode.org/Public/MAPPINGS/"&gt;MAPPINGS&lt;/a&gt; and UnicodeData.txt from the &lt;a href="http://www.unicode.org/ucd/"&gt;Unicode Character Database&lt;/a&gt; and outputs the DFA in C++ format.&amp;nbsp; I'll post more when that's finished.&amp;nbsp; (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=354864" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 7)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2005/01/10/350325.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2005/01/10/350325.aspx</id><published>2005-01-11T03:11:00Z</published><updated>2005-01-11T03:11:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Eugh.&amp;nbsp; Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; -- which means, of course, that this hasn't updated.&amp;nbsp; I haven't given up on it though!&amp;nbsp; (I'm not dead!&amp;nbsp; I don't want to go on the cart...)&amp;nbsp; If anything, my desire to finish&amp;nbsp;it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;..."&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, on to business.&amp;nbsp; First off, the all-important &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; class is done.&amp;nbsp; This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes.&amp;nbsp; The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator.&amp;nbsp; The internals are mostly simple pointer arithmetic; just a lot to be tested.&amp;nbsp; (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for &lt;font face="Courier New"&gt;recv()&lt;/font&gt;ing something in from a TCP socket.&amp;nbsp; If we know that said content is UCS-4, the natural urge is to cast it to an &lt;font face="Courier New"&gt;unsigned long *&lt;/font&gt; to iterate over... except that you can't.&amp;nbsp; Or, at least, you shouldn't.&amp;nbsp; If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or &lt;strong&gt;crash&lt;/strong&gt; (on IA-64, unless &lt;font face="Courier New"&gt;&lt;a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/seterrormode.asp"&gt;SetErrorMode()&lt;/a&gt;&lt;/font&gt; is called to force OS alignment fixups, in which case it will run extremely slowly).&amp;nbsp; Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code.&amp;nbsp; There is also no way for strictly conformant code to check if a given pointer is aligned, since&amp;nbsp;there is no operator to retrieve a type's alignment requirements.&amp;nbsp; The best you can do is assume that no type will have an alignment requirement greater than its size, and &lt;font face="Courier New"&gt;assert(0 == reinterpret_cast&amp;lt;size_t&amp;gt;(ptr) % sizeof(type))&lt;/font&gt;, which is throughly disgusting AND assumes certain things about the host's&amp;nbsp;virtual memory system&amp;nbsp;that may not be true.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Thus, I've opted for the simplest solution: a huge comment in the code that says &lt;em&gt;"These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size.&amp;nbsp; Violating either of these assumptions will result in your program's untimely death."&lt;/em&gt;&amp;nbsp; Sometime later, I might come up with a helper function &lt;font face="Courier New"&gt;alignment_assert&amp;lt;T&amp;gt;(ptr)&lt;/font&gt; that takes advantage of compiler-specific extensions such as MSVC's &lt;font face="Courier New"&gt;__alignof&lt;/font&gt; if available.&amp;nbsp; Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters.&amp;nbsp; The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I've also had occasion to rethink my plans for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;.&amp;nbsp; Initially, I planned to use &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; in a way similar to the Boost &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; pseudo-operator.&amp;nbsp; However, it disturbed me that doing so would mean that every call to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would create a temporary in which to store the result, which would then make its way to final storage either by &lt;font face="Courier New"&gt;operator=&lt;/font&gt; or copy constructor.&amp;nbsp; I ended up realizing that a good 70% of the calls to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; would be writing the encode into a string that already existed.&amp;nbsp; So, instead, we now have the &lt;font face="Courier New"&gt;transcode&lt;/font&gt; function, which comes in both non-member and member flavors:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class SrcEnc, class SrcStore, class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;void transcode( const rmstring&amp;lt;SrcEnc, SrcStore&amp;gt; &amp;amp; src, rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; &amp;amp; tgt );&lt;br /&gt;&lt;br /&gt;template &amp;lt;class TgtEnc, class TgtStore&amp;gt;&lt;br /&gt;rmstring&amp;lt;TgtEnc, TgtStore&amp;gt; rmstring&amp;lt;SrcEnc, SrcStore&amp;gt;::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;With the above, the originally envisioned &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; is now just syntactic sugar for a call to the source string's member &lt;font face="Courier New"&gt;transcode()&lt;/font&gt; function.&amp;nbsp; It also means that the code to do transcodes is now centralized within &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;.&amp;nbsp; Handy!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Oh, and since someone asked: I'm currently developing and testing this on&amp;nbsp;Visual C++&amp;nbsp;.NET 2003 and &lt;a href="http://www.nuwen.net/gcc.html#mingw"&gt;Stephan Lavavej's distribution of MinGW&lt;/a&gt;; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms.&amp;nbsp; My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc.&amp;nbsp; Visual parity will be the hardest to do; in fact, I will likely not do it right away.&amp;nbsp; (Namely, because the Unicode tables do not contain such parity information.)&amp;nbsp; Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.&lt;/p&gt; &lt;p&gt;(Since this was mostly a "what happened while I was gone" article, no point summary.)&lt;/p&gt; &lt;p&gt;(Update 2pm: &lt;A href="http://blogs.msdn.com/michkap/"&gt;Michael Kaplan&lt;/a&gt; nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks!&amp;nbsp; Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=350325" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 6)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/11/04/252439.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/11/04/252439.aspx</id><published>2004-11-04T18:19:00Z</published><updated>2004-11-04T18:19:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.&amp;nbsp; Luckily, I had everything backed up; however,&amp;nbsp;I had to copy everything over to, and work on,&amp;nbsp;a single-monitor Longhorn dogfood box with no major apps installed.&amp;nbsp; This&amp;nbsp;went on for&amp;nbsp;a week and a half while I waited for Dell to&amp;nbsp;slog through&amp;nbsp;the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going)&amp;nbsp;and this put me behind schedule for several deadlines.&amp;nbsp; So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog.&amp;nbsp; In about two weeks these deadlines will be behind me, and I can start posting with regularity again.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Also, at this point I'm now primarily doing implementation of previously discussed ideas, so this series of posts will temporarily serve two purposes: discussion of issues, and journal of coding concerns about implementing this in C++.&amp;nbsp; And this post concerns one of the C++ concerns: how do you define &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; for a string that's in a variable-width encoding such as UTF-8?&amp;nbsp; One of the basic assumptions in &lt;font face="Courier New"&gt;std::string&lt;/font&gt; that I intend to honor is that &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a reference to the actual data, not a copy.&amp;nbsp; For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a &lt;font face="Courier New"&gt;unsigned char&lt;/font&gt;/&lt;font face="Courier New"&gt;short&lt;/font&gt;/&lt;font face="Courier New"&gt;long&lt;/font&gt;.&amp;nbsp; However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well.&amp;nbsp; I could do this with covariant returns and unions, but this is horribly ugly -- and I'd need a lot of different returns, since UTF-8 alone can have up to six bytes in a single code point.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My solution is to return a proxy object, &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt;.&amp;nbsp; When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt;.&amp;nbsp; The Vector Wrapper Problem, as&amp;nbsp;some refer to it,&amp;nbsp;deserves a bit of discussion.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ standard defines that all implementations of the STL container &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; should include a specialization &lt;font face="Courier New"&gt;vector&amp;lt;bool&amp;gt;&lt;/font&gt; that stores the bits in packed form.&amp;nbsp; (Contrast&amp;nbsp;with an array of bools -- bools can be stored in memory as if they were any of several integral types, depending on situation and the intelligence of the compiler).&amp;nbsp; In this case, if &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; returns a bool, you cannot write expressions such as &lt;font face="Courier New"&gt;a[3] = true;&lt;/font&gt; -- there's no bool back there!&amp;nbsp; You need to return a proxy object containing a pointer/reference to the source container, with &lt;font face="Courier New"&gt;operator=&lt;/font&gt; overloaded, in order to support assignment in this manner.&amp;nbsp; However, this breaks with the definition of &lt;font face="Courier New"&gt;std::vector&amp;lt;T&amp;gt;&lt;/font&gt; -- the standard simultaneously claims that any &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; on a &lt;font face="Courier New"&gt;vector&lt;/font&gt; must return some type that is convertible to &lt;font face="Courier New"&gt;T &amp;amp;&lt;/font&gt;.&amp;nbsp; This bit of doublespeak results in the inability to reliably write certain types of wrappers around&amp;nbsp;vector that can accept bool.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My belief is that this was an oversight of the standardization committee.&amp;nbsp; They took the first step towards solving this by defining &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; (and the iterator's dereference operators) as returning a member typedef, &lt;font face="Courier New"&gt;ref_type&lt;/font&gt;; however, they stopped short of a goal, by saying that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; had to be defined from the allocator for the vector.&amp;nbsp; A better solution would be to define a set of semantics and overloaded operators that suitably encapsulated the intent, purpose, and behavior of references, and defining this as a &lt;em&gt;Reference&lt;/em&gt; typeclass.&amp;nbsp; They could then simply require that &lt;font face="Courier New"&gt;ref_type&lt;/font&gt; be some type meeting the &lt;em&gt;Reference(T)&lt;/em&gt; requirements, and all would be well.&amp;nbsp; This is what I intend to do.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The only remaining question is how to handle assignment; at first I planned to make it read-only, but later decided&amp;nbsp;to maintain a reference to the host string and call &lt;font face="Courier New"&gt;replace()&lt;/font&gt; on the&amp;nbsp;encoding/store in response to an &lt;font face="Courier New"&gt;operator=&lt;/font&gt;.&amp;nbsp; This means that a &lt;font face="Courier New"&gt;MultiByteChar&lt;/font&gt; must be templated on the source string in order to be typesafe.&amp;nbsp; This brings up the question of the string's lifetime and the ref's lifetime being separate; however, traditional C++ says that operations such as destruction may invalidate iterators/references/etc. anyways.&amp;nbsp; In this case, I think it's reasonable to be the same.&amp;nbsp; (This also means it's okay to use a member reference variable; in almost every case, pointers&amp;nbsp;are preferable, since references cannot be assigned to, only copy-constructed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;As far as implementation goes, I've completed the &lt;font face="Courier New"&gt;unmanaged_ptr&lt;/font&gt; and &lt;font face="Courier New"&gt;vector_of_bytes&lt;/font&gt; backing stores, and am currently working on the &lt;font face="Courier New"&gt;fixed_width_encoding&lt;/font&gt; parent class that all fixed width encodings such as UCS2 and ASCII derive from.&amp;nbsp; Next post, I will likely talk about the interactions of encoding and backing store classes, and how I've divided responsibilities between them.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;To finish this post off, though, a quick oddity about the use of &lt;font face="Courier New"&gt;widen()&lt;/font&gt; in iostreams.&amp;nbsp; &lt;font face="Courier New"&gt;widen()&lt;/font&gt; is defined on streams as handling certain platform-specific character conversions, such as converting &lt;font face="Courier New"&gt;'\n'&lt;/font&gt; to the appropriate end-of-line character on your platform (CR for Unix and Mac OS X, CRLF for Windows, LF for Classic MacOS).&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; '\n';&lt;/font&gt; outputs &lt;font face="Courier New"&gt;cout.widen('\n')&lt;/font&gt;, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; "\n";&lt;/font&gt; iterates through all characters in the string (as reported&amp;nbsp;by &lt;font face="Courier New"&gt;traits&amp;lt;char&amp;gt;::length()&lt;/font&gt;) and outputs the result of &lt;font face="Courier New"&gt;cout.widen()&lt;/font&gt; on each one, as you'd expect.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;&lt;font face="Courier New"&gt;cout &amp;lt;&amp;lt; string("\n");&lt;/font&gt; does NOT widen characters.&amp;nbsp; It directly asks for cout's &lt;font face="Courier New"&gt;streambuf&lt;/font&gt;, and &lt;font face="Courier New"&gt;xsputn()&lt;/font&gt;'s the entire contents of &lt;font face="Courier New"&gt;data()&lt;/font&gt; into it.&amp;nbsp; Do not pass locale, do not collect i18n.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm still thinking over how I want to define my behavior for &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=252439" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 5)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/10/25/247677.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/10/25/247677.aspx</id><published>2004-10-26T01:46:00Z</published><updated>2004-10-26T01:46:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx"&gt;In our last episode&lt;/a&gt;, we briefly discussed possible behaviors for &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and we discussed how the STL's &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input.&amp;nbsp; We also noted that we could avoid many of the implementation headaches that result,&amp;nbsp;because of our decision to generalize our backing store.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; in &lt;a href="http://www.gotw.ca/publications/xc++s.htm"&gt;Exceptional C++ Style&lt;/a&gt; -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions.&amp;nbsp; However, he's not looking to improve &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s foundation -- he's merely explaining how reducing the number of methods in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; makes the code much easier to maintain.&amp;nbsp; (For example, rather than writing an &lt;font face="Courier New"&gt;empty()&lt;/font&gt; member function, he writes a templated empty function that takes a STL&amp;nbsp;string or container, and returns true if the string's begin and end iterators are equal.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Furthermore, he specifically chooses some less-than-ideal but good-enough implementations as a result of making simplicity the primary goal.&amp;nbsp; For example, in his implementation of &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, he implements the shrinking case by using a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor to make a copy of the first N characters of the string, and then calls &lt;font face="Courier New"&gt;swap()&lt;/font&gt;, so he's incurring a memory allocation and deallocation there that is unneccessary.&amp;nbsp; While Sutter's treatment is good, we have a slightly more ambitious goal in mind (making a better class to replace &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, rather than merely improving upon the existing implementation through decomposition), so we're not duplicating effort.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That said, I agree with his approach of decomposing functions with many overloads such as insert and replace, especially considering that our choice to generalize backing stores eliminates most of my objections to his techniques.&amp;nbsp; So, I've decided to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class after all, in a sense.&amp;nbsp; The &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; class will have a single member function for each major piece of functionality, such as insertion or replacement or concatenation.&amp;nbsp; We'll then make an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; wrapper class that provides overloads in a way to make it roughly equivalent to std::string.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, on to a concern I alluded to in the last entry: distinguishing code points and characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Up until now, I've specifically used the word "code point" to refer to a single symbol in the Unicode/UCS tables, even though Unicode refers to them as characters.&amp;nbsp; I chose to do this because of the existence of "combining characters", which are symbols associated with the previous "base character" such as accents, enclosing boxes/circles, formatting marks for subscript/superscript, and so on.&amp;nbsp; Unicode contains unaccented base characters, combining characters, and "precomposed characters" that use a single codepoint to represent a pre-accented base character.&amp;nbsp; These are considered always canonically equivalent to some combination of a base character and one or more composing characters.&amp;nbsp; (See &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx"&gt;Part 1&lt;/a&gt; for an example of this.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Unicode&amp;nbsp;defines a set of &lt;a href="http://www.unicode.org/unicode/reports/tr15/"&gt;normalization forms&lt;/a&gt; that are used to standardize whether to favor combining characters or precomposed characters.&amp;nbsp; However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and &lt;strong&gt;must &lt;/strong&gt;be represented using&amp;nbsp;combining characters.&amp;nbsp; To make things even nastier, there are some combining characters, most notably double diacritics, that can span multiple base characters.&amp;nbsp; (And I haven't even gotten into Arabic and Hebrew scripts that can change the direction of rendering in mid-string!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through!&amp;nbsp; When you hit the right arrow in Microsoft Word to skip over an &amp;Agrave;, you don't go first to an A and then to the A's accent -- you move past the whole "character."&amp;nbsp; (Unicode refers to this rough definition of&amp;nbsp;character as a "grapheme cluster," FYI.)&amp;nbsp; If it weren't for double diacritics, we could shrug and say "Well, a character is a base codepoint plus zero or more combining codepoints."&amp;nbsp; But it may not be that easy.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; After taking a walk to think it over, I ended up deciding to err on the side of the Unicode standard -- we'll treat double diacritics as a glyph problem.&amp;nbsp; Namely, a double diacritic is attached to the preceeding base codepoint only, and the fact that it extends over the following base codepoint as well is a glyphing concern.&amp;nbsp; (This is also due to the fact that most of the double diacritics can also be represented as a pair of "combining halfmark" where half of the glyph is applied to each character as two separate combining characters, and the glyphing engine is expected to recognize this and render it as a single glyph.)&amp;nbsp; So, we can say that a grapheme cluster is a base character, plus zero or more combining code points, plus any uses of the &lt;em&gt;Combining Grapheme Joiner&lt;/em&gt; codepoint.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; So, do we want &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; to take integer index arguments, iterators, etc.&amp;nbsp;as referring to code points, or to grapheme clusters?&amp;nbsp; For the sake of programmer familiarity, we're going to default to clusters, but we'll allow code points.&amp;nbsp; We will have a single iterator class that takes a bool in its construction describing whether &lt;font face="Courier New"&gt;advance()&lt;/font&gt; and related methods should advance by codepoint or by cluster.&amp;nbsp; Our begin, end, and other such iterator methods will be templated with a default template argument to clusters; thus, you can ask for a codepointer iterator by calling &lt;font face="Courier New"&gt;str.begin&amp;lt;codepoints&amp;gt;()&lt;/font&gt;.&amp;nbsp; This is a bit messy, but workable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Before, we listed the methods that seemed worthwhile to carry over.&amp;nbsp; However, many of them can be implemented as versions of the others.&amp;nbsp; Tomorrow, we'll actually write a complete header for &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and start implementing it.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; That, and I think it's about time I go buy a hardcover copy of the Unicode standard, as I have way too many PDFs on my desktop right now.&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=247677" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 4)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/10/22/246539.aspx</id><published>2004-10-22T23:42:00Z</published><updated>2004-10-22T23:42:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx"&gt;In our last episode&lt;/a&gt;, we established that we wouldn't be able to make a true &lt;font face="Courier New"&gt;std::string&lt;/font&gt; replacement and still handle variable-width encodings.&amp;nbsp; So, we started with the beginning lines of an &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; class.&amp;nbsp; However, this doesn't mean we are going to dispense with &lt;font face="Courier New"&gt;std::string&lt;/font&gt; entirely!&amp;nbsp; But first, a quick answer about my choice of names and an explanation about exceptions.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;A friend of mine asked me yesterday, "Don't you intend to make a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; and then have a typedef'd &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; that hardwires a specific specialization, like ASCII?"&amp;nbsp; I'm considering this -- but if I hardwire anything, it will &lt;em&gt;not &lt;/em&gt;be the encoding type.&amp;nbsp; Trying to abstract away the encoding as hidden information is exactly the thinking that got us into this mess with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;!&amp;nbsp; However, what we use for the backing store might be worth standardizing.&amp;nbsp; After all, using a &lt;font face="Courier New"&gt;vector&amp;lt;byte&amp;gt;&lt;/font&gt; to contain our bitstream is a very flexible choice; it's just not the best-performing one.&amp;nbsp; Whenever possible, we should make a library easy to use on the surface, and expose the guts of it to be changed once someone already has the program running and is trying to improve on it (by, for example, using string literals as backing stores and only copying them to heap memory when needed.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In a dream world, we would typedef a partial specialization.&amp;nbsp; However, we get bit by one of the most annoying mis-features in C++ -- &lt;a href="http://www.gotw.ca/gotw/079.htm"&gt;you can't template a typedef&lt;/a&gt;.&amp;nbsp; Even the STL is crippled by this, and has to work around it using its &lt;font face="Courier New"&gt;::rebind&lt;/font&gt; member.&amp;nbsp; So, the best we could do is allow someone to &lt;font face="Courier New"&gt;#define rmstring(enc) basic_rmstring&amp;lt;enc, vector_of_bytes&amp;gt;&lt;/font&gt;, and declare a string as &lt;font face="Courier New"&gt;rmstring(iso8859_1) str;&lt;/font&gt;..&amp;nbsp;&amp;nbsp;It'd work, but it makes me cringe.&amp;nbsp; Alternately, we could use a rebind approach like the STL:&amp;nbsp;&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt;class Enc&amp;gt; struct rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font color="#000080"&gt;&lt;font face="Courier New"&gt;typedef&amp;nbsp;basic_rmstring&amp;lt;Enc, vector_of_bytes&amp;gt; type;&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;rmstring&amp;lt;iso8859_1&amp;gt;::type str;&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Really, both of them are pretty damned ugly; the preprocessor approach is prettier,&amp;nbsp;IMHO, but is also considerably more dangerous.&amp;nbsp; So, I'm going to leave it as&amp;nbsp;&lt;font face="Courier New"&gt;rmstring&lt;/font&gt; with two template values for the purposes of this&amp;nbsp;blog.&amp;nbsp;&amp;nbsp;Eventually I'll probably opt for the &lt;font face="Courier New"&gt;#define&lt;/font&gt; for my own&amp;nbsp;version of the library, but you can choose whichever is more appealing to you (conciseness versus typesafety), or choose neither.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The second thing I wanted to answer from yesterday were those two exceptions, &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; and &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, that I listed next to the &lt;font face="Courier New"&gt;encoding_cast()&lt;/font&gt; function.&amp;nbsp; What on earth are they for?&amp;nbsp; First off, imagine that you're trying to convert a string from UCS-4 to UCS-2.&amp;nbsp; As I mentioned in &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Part 2&lt;/a&gt;, UCS-2 is a non-universal encoding, and there are some code points that it cannot represent.&amp;nbsp; What happens if our UCS-4 string contains one of those code points?&amp;nbsp; In this case, we will throw the &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; exception.&amp;nbsp; We will also throw it in the case of converting to legacy character sets that simply do not have a code point defined for a symbol.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;There's something to keep in mind, though.&amp;nbsp; The popularity of JPEG proves that a lossless transform is not always necessary.&amp;nbsp; Imagine that we have the greek letter &lt;strong&gt;&amp;AElig;&lt;/strong&gt; -- is it acceptable to convert this to two characters, &lt;strong&gt;AE&lt;/strong&gt;?&amp;nbsp; The proper answer is neither yes or no;it's "sometimes."&amp;nbsp;&amp;nbsp;Remember, all this time, our definitions of string have been derived from a definition of symbols&amp;nbsp;that a human interprets -- and this means that whether or not a&amp;nbsp;'close enough'&amp;nbsp;translation is acceptable depends on who's looking at the string.&amp;nbsp; Imagine that a blind person is using a screenreader (a program that uses a computerized voice to read text as it appears on the screen).&amp;nbsp; In that case, there's a vast difference between &lt;strong&gt;&amp;AElig;&lt;/strong&gt; and &lt;strong&gt;AE.&lt;/strong&gt;&amp;nbsp; However, for a person with normal sight reading a webpage, however, the two might be interchangable.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The computer scientist in me says that I should only allow lossless transforms -- the engineer in me knows better, though, and there's a way to satisfy both.&amp;nbsp; Therefore, we are going to add a third template argument to yesterday's definition of&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt;, and allow it to have a default specialization.&amp;nbsp; This default specialization will be called the "symbol clash resolver" and has a well-known method invoked whenever a missing symbol problem occurs.&amp;nbsp; The default one, &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt;, will throw &lt;font face="Courier New"&gt;missing_symbol&lt;/font&gt; in all cases.&amp;nbsp; A user can define alternatives, though.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Two possible alternatives immediately occur to me -- one called &lt;font face="Courier New"&gt;visual_parity_resolver&lt;/font&gt; that does replacements like the above, and another called &lt;font face="Courier New"&gt;error_symbol_resolver&lt;/font&gt; that acts like RS232's error character, inserting a compile-time constant instead (such as a box symbol, or an "&amp;lt;ERROR&amp;gt;" string, or whatever suits the user) whenever a symbol cannot be translated.&amp;nbsp; But those can all wait for later -- only &lt;font face="Courier New"&gt;lossless_resolver&lt;/font&gt; needs to be immediately defined, and its definition is trivial, since it just throws :)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The other exception, &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt;, comes from if we try to decode a buffer that has an error in it.&amp;nbsp; In the case of UTF-8, there are sequences that decode to illegal or nonsensical numbers, and if we&amp;nbsp;are asked to decode these sequences, we should let the user know.&amp;nbsp; Imagine a scenario where you are writing an Internet&amp;nbsp;server daemon, and expect to recieve a UTF-8 encoded string as the first transmission following a client successfully connecting.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In this scenario, we &lt;font face="Courier New"&gt;recv()&lt;/font&gt; the data from the server into a buffer, and then construct an &lt;font face="Courier New"&gt;rmstring&amp;lt;utf8, &lt;/font&gt;&lt;font face="Courier New"&gt;unmanaged_pointer&amp;gt;&lt;/font&gt; to read it.&amp;nbsp; If there was an error in network transmission, or a malicious client was testing our ability to handle bad data, we should communicate this to the programmer as an error.&amp;nbsp; Thus, if an encoding can detect illegal input (very few encodings can!) it may throw a &lt;font face="Courier New"&gt;malformed_data&lt;/font&gt; exception&amp;nbsp;if you invoke&amp;nbsp;any operations that hit that input,&amp;nbsp;or if you attempt to trans-code it.&amp;nbsp; We will also probably want to make a compile-time flag visible on the encoding class that determines whether or not it can have malformed data.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, with those two issues resolved, let's get down to our dirty business!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I said earlier that we had to pick one of two mutually exclusive goals: be a&amp;nbsp;perfect drop-in replacement for &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, or support variable-width encodings such as UTF-8.&amp;nbsp; Since I think &lt;font face="Courier New"&gt;std::string&lt;/font&gt; is poorly designed &lt;strong&gt;&lt;em&gt;and&lt;/em&gt;&lt;/strong&gt; I demonstrated that not being string-compatible is only a loss for stringstream compatibility, I'm favoring the latter.&amp;nbsp; (Just hating &lt;font face="Courier New"&gt;std::string&lt;/font&gt; alone would not be sufficient reason -- in that case I'd just be suffering from&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Not_Invented_Here"&gt;NIH syndrome&lt;/a&gt;.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;However, this doesn't mean that I can just go roll my own string class in the way that best suits my urges.&amp;nbsp; Many programmers have devoted considerable time and energy to learning &lt;font face="Courier New"&gt;std::string&lt;/font&gt;'s ins and outs, myself included -- so, I should exploit that knowledge by providing similar functions with similar arguments, as long as it doesn't compromise my design's principles.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Looking at &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;'s definition in the C++ Standard is an exercise in mental stamina.&amp;nbsp; It defines six constructors (one of which requires some very &lt;a href="http://www.mpi-sb.mpg.de/~kettner/courses/lib_design_03/notes/meta.html"&gt;special trickery with templating and the SFINAE principle&lt;/a&gt; to implement, as we'll see later) and over 100 methods, plus a host of non-member operators such as &lt;font face="Courier New"&gt;&amp;lt;&amp;lt;&lt;/font&gt; and &lt;font face="Courier New"&gt;+&lt;/font&gt;.&amp;nbsp; However, looking at the expected behavior for each function, most of them are overloads that call a base function.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In other words, a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; has one or two core definitions at most for each core method (such as &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;/font&gt;, &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, etc.), which take &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;s as their input.&amp;nbsp; Every other overload is defined as equivalent to calling that root function, with a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; constructor meant to convert some other form of string (char pointer, run of chars, pair of iterators, etc.) to a &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that the "core implementation" can grok.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Of course, they don't all implement them like that, because it'd mean frivolously making a copy of the input data in &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; form for each trivial overload.&amp;nbsp; Instead, a typical implementation of &lt;font face="Courier New"&gt;std::string&lt;/font&gt; has an optimized version for each&amp;nbsp;variant, making maintenance a nightmare.&amp;nbsp; But we don't have that problem -- because, instead of requiring an STL allocator, we can accept an arbitrary backing store!&amp;nbsp; So, suppose we have a working implementation of append:&lt;/p&gt; &lt;p&gt;&lt;font face="Courier New" color="#000080"&gt;template &amp;lt; class Encoding, class BackingStore &amp;gt; class rmstring {&lt;br /&gt;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// &lt;strong&gt;Appends &lt;em&gt;n&lt;/em&gt;&amp;nbsp;codepoints of &lt;em&gt;str&lt;/em&gt;, starting at &lt;em&gt;pos&lt;/em&gt;, to the&amp;nbsp;string.&lt;/strong&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* Will throw an out_of_range exception if &lt;em&gt;pos&lt;/em&gt; &amp;gt;= &lt;em&gt;str&lt;/em&gt;.length()&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;//&amp;nbsp;* If &lt;em&gt;pos&lt;/em&gt; is in range, but&amp;nbsp;&lt;em&gt;pos&lt;/em&gt; +&amp;nbsp;&lt;em&gt;n&lt;/em&gt;&amp;nbsp;&amp;gt; &lt;em&gt;str&lt;/em&gt;.length(), &lt;em&gt;n&lt;/em&gt; is&amp;nbsp;truncated so that &lt;em&gt;pos&lt;/em&gt; + &lt;em&gt;n&lt;/em&gt; = &lt;em&gt;str&lt;/em&gt;.length().&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// *&amp;nbsp;Will throw an length_error exception if the resulting string would be larger than&amp;nbsp;&lt;em&gt;BackingStore&lt;/em&gt;'s max_size().&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt; class OtherBS &amp;gt; rmstring &amp;amp; append( rmstring&amp;lt;Encoding, OtherBS&amp;gt;&amp;nbsp;const &amp;amp; str, size_type pos, size_type n ) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;font color="#000080"&gt;&lt;em&gt;/* implementation */&lt;br /&gt;&lt;/em&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;...&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Note that I've defined the above in terms of code points, not symbols.&amp;nbsp; There can be multiple codepoints representing a single symbol.&amp;nbsp; I'll discuss this problem, and the related problem of Unicode normalization forms, in a later post -- namely because I'm still working on a solution.&amp;nbsp; :-P This is a learning exercise for me too!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Because &lt;font face="Courier New"&gt;OtherBS&lt;/font&gt; is arbitrary, we can directly implement the other overloads of &lt;font face="Courier New"&gt;append()&lt;/font&gt; as calls to &lt;font face="Courier New"&gt;append()&lt;/font&gt; with a &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; constructor, without worrying about needlessly duplicating information.&amp;nbsp; If we want to use a &lt;font face="Courier New"&gt;char *&lt;/font&gt; from an ANSI C function, we can just use a &lt;font face="Courier New"&gt;unmanaged_pointer&lt;/font&gt; backing store.&amp;nbsp; If we want to use n repetitions of some character c, we can just use a &lt;font face="Courier New"&gt;run_of_chars&amp;lt;n, c&amp;gt;&lt;/font&gt; backing store.&amp;nbsp; We pass the &lt;em&gt;exact same information&lt;/em&gt; as if we were doing it the old way, but abstracted inside a templated class, so there's no overhead except at compiletime.&amp;nbsp; Beautiful!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, what should we implement from &lt;font face="Courier New"&gt;std::string&lt;/font&gt;?&amp;nbsp; Here's the core functions from &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; that seem worth carrying over:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;strong&gt;Size functions&lt;/strong&gt;: &lt;font face="Courier New"&gt;size()&lt;/font&gt; and &lt;font face="Courier New"&gt;length()&lt;/font&gt;, &lt;font face="Courier New"&gt;max_size()&lt;/font&gt;, &lt;font face="Courier New"&gt;capacity()&lt;/font&gt;, &lt;font face="Courier New"&gt;reserve()&lt;/font&gt;, &lt;font face="Courier New"&gt;resize()&lt;/font&gt;, &lt;font face="Courier New"&gt;empty()&lt;/font&gt;, &lt;font face="Courier New"&gt;clear()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Iterators&lt;/strong&gt;: &lt;font face="Courier New"&gt;begin()&lt;/font&gt;, &lt;font face="Courier New"&gt;end()&lt;/font&gt;, &lt;font face="Courier New"&gt;rbegin()&lt;/font&gt;, &lt;font face="Courier New"&gt;rend()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Accessors&lt;/strong&gt;: &lt;font face="Courier New"&gt;operator[]&lt;/font&gt;, &lt;font face="Courier New"&gt;at()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Replacers&lt;/strong&gt;: &lt;font face="Courier New"&gt;assign()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Appenders&lt;/strong&gt;: &lt;font face="Courier New"&gt;push_back()&lt;/font&gt;, &lt;font face="Courier New"&gt;push_front()&lt;/font&gt;, &lt;font face="Courier New"&gt;append()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator+&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Modifiers&lt;/strong&gt;: &lt;font face="Courier New"&gt;insert()&lt;/font&gt;, &lt;font face="Courier New"&gt;erase()&lt;/font&gt;, &lt;font face="Courier New"&gt;replace()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Searchers&lt;/strong&gt; (evil): &lt;font face="Courier New"&gt;find()&lt;/font&gt;, &lt;font face="Courier New"&gt;rfind()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_first_not_of()&lt;/font&gt;, &lt;font face="Courier New"&gt;find_last_not_of()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Utilities&lt;/strong&gt;: &lt;font face="Courier New"&gt;substr()&lt;/font&gt;, &lt;font face="Courier New"&gt;copy()&lt;/font&gt;, &lt;font face="Courier New"&gt;swap()&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Comparators&lt;/strong&gt; (also evil): &lt;font face="Courier New"&gt;compare()&lt;/font&gt;, &lt;font face="Courier New"&gt;operator==&lt;/font&gt;, &lt;font face="Courier New"&gt;operator!=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;lt;=&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;=&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Streams:&lt;/strong&gt; &lt;font face="Courier New"&gt;operator&amp;lt;&amp;lt;&lt;/font&gt;, &lt;font face="Courier New"&gt;operator&amp;gt;&amp;gt;&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;strong&gt;Backwards compatibility:&lt;/strong&gt; &lt;font face="Courier New"&gt;c_str()&lt;/font&gt;, &lt;font face="Courier New"&gt;data()&lt;br /&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;That's a lot of stuff to implement!&amp;nbsp; But not only does it gain us good-will by allowing programmers to code much like they did with &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, it also means that we can make a &lt;font face="Courier New"&gt;typedef rmstring&amp;lt;&lt;em&gt;RMS_COMPILER_SPECIFIC_ENCODING&lt;/em&gt;, vector_of_bytes&amp;gt;&amp;nbsp;rstring&lt;/font&gt;, and be pretty damned close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt;-equivalent.&amp;nbsp; (The compiler-specific encoding can be set in a header file, or specified on the command line -- I'll likely set it to &lt;font face="Courier New"&gt;iso8859_1&lt;/font&gt; for string and &lt;font face="Courier New"&gt;ucs2&lt;/font&gt; for wstring in a header.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;But before I get to that, I'll have a nastier problem to tackle, and that's combining characters.&amp;nbsp; Not only do we have codepoints that can take up variable amounts of space (thanks to encoding), but we also have symbols that can take up variable amounts of codepoints!&amp;nbsp; (See Part 1 and search for "diaeresis" if you're not sure why this is.)&amp;nbsp; Unicode, luckily, comes to the rescue again with a standard that determines when and how a character symbol or should not be broken down into combining characters.&amp;nbsp;&amp;nbsp;These are called&amp;nbsp;normalization forms, and we'll tackle those on Monday.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Next episode: Normalization forms and chain of command (which does not involve rmstring covering its ass if things go FUBAR).&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Takeaways from Part 4:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;We're specifically designing &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; to&amp;nbsp;force the programmer into awareness of encodings -- we don't want&amp;nbsp;to hide that with a &lt;font face="Courier New"&gt;basic_rmstring&lt;/font&gt; being typedefed.&amp;nbsp; (We couldn't anyways, because we can't template typedefs.)&amp;nbsp; So, for now, we'll leave it as-is.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Not only are all encodings inequal, not all trans-coding schemes are equal either!&amp;nbsp; Be aware of this, and think about how you want to handle errors!&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Even if we think&amp;nbsp;&lt;font face="Courier New"&gt;std::string&lt;/font&gt; is evil, we can still gain good will from our potential users by making ourselves as close to &lt;font face="Courier New"&gt;std::string&lt;/font&gt; as possible.&amp;nbsp; This, unfortunately, means lots of work.&amp;nbsp; But not as much as if we were actually implementing &lt;font face="Courier New"&gt;std::string&lt;/font&gt;, due to our luck in choosing to template our backing store.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;However, all our methods need to be defined in terms of symbols, not code points (and certainly not bytes of encoded data!).&amp;nbsp; This makes our life difficult again.&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=246539" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 3)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/10/20/245417.aspx</id><published>2004-10-21T00:08:00Z</published><updated>2004-10-21T00:08:00Z</updated><content type="html">&lt;p&gt;&amp;nbsp;&amp;nbsp;&lt;em&gt;&amp;nbsp;(Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long.&amp;nbsp; So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts.&amp;nbsp; I'll also try to pace my paragraphs more regularly.&amp;nbsp; Thanks for the advice!)&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx"&gt;Yesterday&lt;/a&gt;, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for&amp;nbsp;encoding and decoding code point indices on a binary computer.&amp;nbsp; At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points.&amp;nbsp; Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Before we do that, though, there's one more nasty digression into standards-land that I'd like to take.&amp;nbsp; This is a fairly general definition of what a string is, and you don't really write libraries unless you intend for them to be general-purpose enough to be reused.&amp;nbsp;&amp;nbsp;So,&amp;nbsp;it might be a worthwhile goal to make our new string library compatible with the &lt;font face="Courier New"&gt;string&lt;/font&gt; class in the C++ Standard Template Library, so that anyone could gain its benefits simply by using a different &lt;font face="Courier New"&gt;#include&lt;/font&gt;.&amp;nbsp; Unfortunately, there's some restrictions that the C++ Standard (which I would highly suggest purchasing if you code in C++ for a living -- it's &lt;a href="http://webstore.ansi.org/ansidocstore/product.asp?sku=INCITS/ISO/IEC+14882-2003"&gt;$18 in PDF form direct from ISO&lt;/a&gt;) which prevent us from doing so -- namely, that many parts of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; are hard-wired to require a constant-size encoding and will not work with encodings such as UTF-8.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The C++ Standard starts by defining &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; as templated on three classes -- a character type (&lt;font face="Courier New"&gt;charT&lt;/font&gt;), a specialization of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; for that type, and an allocator for that type.&amp;nbsp; (Nothing SAYS we have to implement&amp;nbsp;it with exactly those template parameters, but we're screwed anyways, as you'll see.)&amp;nbsp; It then defines two static typedefs for that specialization: &lt;font face="Courier New"&gt;traits_type&lt;/font&gt;, which typedefs to the templated traits specialization, and &lt;font face="Courier New"&gt;value_type&lt;/font&gt;, which&amp;nbsp;typedefs to&amp;nbsp;&lt;font face="Courier New"&gt;traits_type::value_type&lt;/font&gt;... which, by definition, is also required to be &lt;font face="Courier New"&gt;charT&lt;/font&gt;.&amp;nbsp; The definition of &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; requires that &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; be specialized only on &lt;a href="http://www.parashift.com/c++-faq-lite/intrinsic-types.html#faq-26.7"&gt;PODs&lt;/a&gt; (which are always constant-size), and its definitions all are written to assume uniformly-sized characters.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;If the traits problem wasn't enough, on top of that, a conformant &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; implementation requires that &lt;font face="Courier New"&gt;s[i]&lt;/font&gt; return the same value as &lt;font face="Courier New"&gt;s.data()[i]&lt;/font&gt;, and data is required to return a &lt;font face="Courier New"&gt;const charT *&lt;/font&gt;.&amp;nbsp; So, even if we could get around the&amp;nbsp;traits problem, variable-length encodings still screw us because &lt;font face="Courier New"&gt;operator[]&lt;/font&gt; and a pointer offset will no longer agree.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, we will have to abandon hopes of being a drop-in replacement for &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;.&amp;nbsp; But, really, this isn't too bad -- there's only three other libraries in the STL that require the use of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt;!&amp;nbsp; The first is in &lt;font face="Courier New"&gt;locale&lt;/font&gt;, and hardly anyone uses C++'s built-in locales anyways, favoring OS functionality.&amp;nbsp; The second is the &lt;font face="Courier New"&gt;bitset&lt;/font&gt; container, which hardly anyone uses either.&amp;nbsp; The third is&amp;nbsp;its use as&amp;nbsp;a backing store for &lt;font face="Courier New"&gt;stringstreams&lt;/font&gt; and as the &lt;font face="Courier New"&gt;stringbuf&lt;/font&gt;&amp;nbsp;wrapper that is the foundation of &lt;font face="Courier New"&gt;iostream&lt;/font&gt;, and this is a bigger loss.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The loss of direct compatibility with&amp;nbsp;stringbuf is a big pain.&amp;nbsp; However, when you're getting to I/O, you need to have already converted your string to the encoding your user is expecting -- we shouldn't expect a prompt expecting ASCII to be able to deal with a stream of UCS-2 characters!&amp;nbsp; So, it's perfectly okay if stringbuf&amp;nbsp;is left&amp;nbsp;alone, as long as we find a way to&amp;nbsp;convert strings between different encodings.&amp;nbsp; So, stringstreams are the only real loss, and we can make our own stringstream, if need be.&amp;nbsp; (Thanks to templates, we may be able to avoid having to re-invent the wheel, which is always good.)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm going to start with policy-based design, which &lt;a href="http://www.moderncppdesign.com/"&gt;Alexandrescu&lt;/a&gt; introduced a few years ago in Modern C++ Design.&amp;nbsp; (Actually, the STL beat him to the punch by using allocators as a template argument for most of its &lt;font color="#000000"&gt;containers, but he popularized its use for general customization.)&amp;nbsp; In fact, he already demonstrated policy-based design in a &lt;/font&gt;&lt;a href="http://www.cuj.com/"&gt;&lt;font color="#000000"&gt;CUJ&lt;/font&gt;&lt;/a&gt;&lt;font color="#000000"&gt; article a year or two ago by making a basic_string replacement that allowed customizing copy-on-write semantics -- but I'm a bit more ambitious :)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;My first stab at the class will be based directly off our most recent definition of string -- an encoding, and an ordered sequence of bits:&lt;/font&gt;&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace encodings {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... utf8, iso8859_1, big5, mac_roman, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;namespace backing_stores {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;/* ... string_literal, vector_of_uchars, etc. go here ... */&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;class Encoding, class&amp;nbsp;Bits&amp;gt; class rmstring {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;public:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;typedef Encoding encoding_type;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;private:&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Bits _data;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;};&lt;br /&gt;&lt;br /&gt;};&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Not much, but it's a start&lt;/font&gt;!&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At this point, I want to reference something I said earlier about I/O -- when you're doing I/O, whether that's taking a string in or sending a string out, your stream of bits needs to have the same encoding as the device you're talking with, or Bad Things happen.&amp;nbsp; We need some way to denote, inside code, that an encoding change needs to take place.&amp;nbsp; (Guessing ahead, this will probably be the&amp;nbsp;most tedious&amp;nbsp;part of development -- creating UCS-to-encoding and encoding-to-UCS transitions for each encoding and character set we support.)&amp;nbsp; I'm going to take a nod from the excellent &lt;a href="http://www.boost.org/"&gt;Boost&lt;/a&gt; library here, and make an analogue to their &lt;font face="Courier New"&gt;lexical_cast&lt;/font&gt; class.&lt;/p&gt; &lt;p dir="ltr" style="MARGIN-RIGHT: 0px"&gt;&lt;font face="Courier New" color="#000080"&gt;namespace rmlibs {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// these are the major exceptions...&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;class&amp;nbsp;missing_symbol;&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New" color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;class malformed_data;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// ... that are thrown by:&lt;br /&gt;&lt;/font&gt;&lt;font face="Courier New"&gt;&lt;br /&gt;&lt;font color="#000080"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;template &amp;lt;typename Target, typename Source&amp;gt; Target encoding_cast(Source str);&lt;br /&gt;};&lt;/font&gt;&lt;/font&gt; &lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;In the near future I'll probably alter this to take only &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;s as input and output and template on encoding types in/out, since right now it accepts any pair of types -- but this is only a prototype.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The goal for doing this is to minimize conversions.&amp;nbsp; Some of my coworkers who have been kind enough to proofread have remarked, &lt;em&gt;"I'd just throw&amp;nbsp;up my hands and convert everything internally to UCS-4 and use a basic_string&amp;lt;unsigned long&amp;gt;; after all, memory is cheap."&lt;/em&gt;&amp;nbsp; In a way, they're right -- doing this would mean I'd only have to write encoding_cast() for each encoding, and not even need the&amp;nbsp;new&amp;nbsp;string&amp;nbsp;class.&amp;nbsp; But, I'm a performance guy, a bit twiddler&amp;nbsp;at heart.&amp;nbsp; I don't want to do a conversion unless I need to, or if the performance gains from a fixed-width format like UCS-4 outweigh the performance loss of having to trans-code everything.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(It's rather like image formats -- TGA is lossless and can hold damn near anything, but that doesn't mean we always convert everything to TGA first before working with it, and then convert back when we're done.&amp;nbsp; Not everything has to be "worked on," and not all work is equally difficult.&amp;nbsp; This is especially true if we're using a compile-time string literal as a backing store, since it won't be modifiable unless you make a copy!)&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;The general plan is to use &lt;font face="Courier New"&gt;rmstring&lt;/font&gt; as a &lt;a href="http://hillside.net/patterns/DPBook/DPBook.html"&gt;Facade pattern&lt;/a&gt; for the Encoding class we're templated on.&amp;nbsp; Most of &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s methods will actually call the Encoding class and pass in state and a pointer to our Bits object as needed; the Encoding class will handle all the work of character traversal.&amp;nbsp; Since many of the encodings we're planning to&amp;nbsp;deal with are fixed-width (UCS-2, UCS-4, and most old systems like ISO 8859 and ASCII), I'll likely create a FixedWidthEncoding base class that does most of the work of locating offsets and insertion/deletion, and inherit most of the Encodings from it.&amp;nbsp; This means, the main thing that will be unique for each Encoding will be the translation tables used for converting the symbol sets for non-Unicode systems to Unicode code points, since most of the older encodings are simple fixed-width affairs and just have non-standard symbol sets.&lt;/p&gt; &lt;p&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Tomorrow, we'll start fleshing out &lt;font face="Courier New"&gt;rmstring&lt;/font&gt;'s body with constructors and methods, and explain what those two exceptions&amp;nbsp;next to&amp;nbsp;&lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; are for.&amp;nbsp; We'll also take a brief look at screen-readers and web browsers, and make a change to &lt;font face="Courier New"&gt;encoding_cast&lt;/font&gt; to handle "looks-close-enough" trans-codes.&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;The definitions of &lt;font face="Courier New"&gt;basic_string&lt;/font&gt; and &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; in the C++ Standard prevent use of variable-width encodings;&amp;nbsp;therefore, we cannot make&amp;nbsp;a perfect drop-in replacement for the STL string class.&amp;nbsp; However, that's okay -- the only STL object we'll have to duplicate functionality for is stringstream.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;We can't expect I/O with external devices/programs to conform to whatever encoding we want -- they're expecting a specific encoding, and we need to present our data in that format -- or die a horrible, painful death.&amp;nbsp; So, the ability to trans-code is absolutely necessary.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Trans-coding can be expensive, but can have some gains, especially if going to UCS-4 for speed in manipulation or going to UTF-8 for compatibility with legacy C APIs.&amp;nbsp; Do it when necessary or justified, but avoid it if it's not absolutely necessary.&amp;nbsp; The coder should be allowed to pick an encoding and work with strings in that encoding as easily as possible.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=245417" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /><category term="C++" scheme="http://blogs.msdn.com/ryanmy/archive/tags/C_2B002B00_/default.aspx" /></entry><entry><title>Encodings in Strings are Evil Things (Part 2)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/10/19/244865.aspx</id><published>2004-10-20T01:38:00Z</published><updated>2004-10-20T01:38:00Z</updated><content type="html">&lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;At the end of the &lt;A href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx"&gt;last post&lt;/a&gt;, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points."&amp;nbsp; (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.)&amp;nbsp; Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer.&amp;nbsp; And someone who is reading a string from a file, or from memory, needs to use the exact same encoding scheme, or we're off in la-la land.&amp;nbsp; And not all encodings are equal.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;First off, the simplest route.&amp;nbsp; There are 2&lt;sup&gt;31&lt;/sup&gt; possible Unicode code points, and an&amp;nbsp;x86&amp;nbsp;register is 32 bits wide, so let's just add a zero and encode everything as a 32-bit unsigned binary!&amp;nbsp; The ISO-10646 standard calls this &lt;strong&gt;UCS-4&lt;/strong&gt;.&amp;nbsp; Only one catch -- it doesn't specify endianness.&amp;nbsp; Of course, this poses a problem if you want to trade text files between PCs and Macs.&amp;nbsp; So, UCS-4 actually is three different encodings -- &lt;strong&gt;UCS-4LE&lt;/strong&gt; (little endian), &lt;strong&gt;UCS-4BE&lt;/strong&gt; (big endian), and just plain &lt;strong&gt;UCS-4&lt;/strong&gt;, which means that no endian is specified and you should assume that it's the host's encoding unless told otherwise.&amp;nbsp; (There are ways to tell otherwise -- but I'll mention them later.)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Now, the ISO-10646 guys recognized that the majority of the written languages used on the Internet today can be expressed using a tiny subset of the 2&lt;sup&gt;31&lt;/sup&gt; symbols, and it seems a waste to use four bytes for every character if the high bytes are 0 most of the time.&amp;nbsp; So, ISO-10646 also defines &lt;strong&gt;UCS-2&lt;/strong&gt;, which uses a 16-bit unsigned binary, but can only represent the lower 2&lt;sup&gt;16&lt;/sup&gt; code points.&amp;nbsp; (The lower 2&lt;sup&gt;16&lt;/sup&gt; codepoints&amp;nbsp;are thus referred to as the Basic Multilingual Plane, or BMP.&amp;nbsp; This includes Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, as well as many mathematical symbols and a small set of basic &lt;/font&gt;&lt;a href="http://foldoc.doc.ic.ac.uk/foldoc/foldoc.cgi?Han"&gt;&lt;font color="#000000"&gt;Han ideographs&lt;/font&gt;&lt;/a&gt;&lt;font color="#000000"&gt;.)&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;This is the first encoding we'll encounter that is non-universal -- there are some strings that are expressible using Unicode characters which UCS-2 cannot be used to encode.&amp;nbsp; Sadly, UCS-2 was adopted by early versions of the Unicode specification, and so UCS-2 is what most people think of when they hear "Unicode".&amp;nbsp; We can't blame them, though -- it took until &lt;strong&gt;2001&lt;/strong&gt; for ISO to use up all 2&lt;sup&gt;16&lt;/sup&gt; code points in the BMP, and by then they were adding Han ideographs in bulk.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Next, we start diving into encodings that were invented for reverse compatibility with older standards.&amp;nbsp; As we said, early versions of Unicode specifiy UCS-2 as a standard, back when nothing existed in the UCS tables beyond the BMP.&amp;nbsp; When it became obvious that eventually people would need to use codepoints beyond 2&lt;sup&gt;16&lt;/sup&gt;, a hybrid encoding called UTF-16 was created.&amp;nbsp; The Unicode Consortium reserved a high range of codepoints (D800 to DFFF) to be used as "surrogate characters," so that up to 1024&lt;sup&gt;2&lt;/sup&gt; characters above the BMP border could be represented as two consecutive surrogate characters, without breaking existing UCS-2 content.&amp;nbsp; This adds a brand new level of complexity to string handling, because now a single codepoint could be either 2 or 4 bytes.&amp;nbsp; This&amp;nbsp;makes even simple tasks such as iterating over the string with a for-loop difficult.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Later on, &lt;strong&gt;UTF-32&lt;/strong&gt; was introduced.&amp;nbsp; UTF-32 is effectively identical to UCS-4 -- its sole difference is in the specification.&amp;nbsp; UTF-32 claims that it should not be used to represent characters above 0x10FFFF.&amp;nbsp; (Nothing is stopping it, though -- it's still just a unsigned long int.)&amp;nbsp; I mention it mostly for completeness, and so you'll recognize the name.&amp;nbsp; And don't forget that all of these encodings have endianness to worry about, so we've really covered 12 encodings for Unicode so far: UCS-4(BE/LE/host), UCS-2(BE/LE/host), UTF-16(BE/LE/host), and UTF32(BE/LE/host).&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Windows-specific digression here: WCHAR is typedef'd inside winnt.h to wchar_t, whose size is determined by the compiler you're using.&amp;nbsp; On Visual C++ .NET 2004, wchar_t is currently an 'unsigned short' and uses UCS-2LE; on gcc, unless specified otherwise it's an 'int'.&amp;nbsp; The encoding for gcc varies by version and by compiler setting, though, and gcc 3.3 in particular &lt;/font&gt;&lt;/em&gt;&lt;a href="http://lists.suse.com/archive/m17n/2004-Aug/0039.html"&gt;&lt;em&gt;&lt;font color="#000000"&gt;is horribly buggy and can corrupt your string literals&lt;/font&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;&lt;font color="#000000"&gt;.)&lt;/font&gt;&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Everything's fine and dandy thus far, except for one catch -- we can't send strings in these encodings to old webservers that use C functions like strcmp(), strlen(), strcpy(), etc. -- or any other function that relies on the presence of a null byte to denote where the string ends.&amp;nbsp; Why?&amp;nbsp; Because, for any string that uses only the Latin alphabet (i.e. one that you could write in plain old ASCII), the first byte in any of the above encodings will be 00.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Because of this, there's one more standard Unicode encoding, and that's the notorious &lt;strong&gt;UTF-8&lt;/strong&gt;.&amp;nbsp; UTF-8 can be thought of as&amp;nbsp;a relative of&amp;nbsp;Huffman encoding -- it guarantees that all codepoints less than or equal to 0x7F are encoded as single unsigned bytes (i.e. direct 7-bit ASCII correspondence), and that all codepoints greater than 0x7F are encoded as a multi-byte sequence.&amp;nbsp; All bytes in a multi-byte sequence have their MSB set, and the first byte of such a codepoint contains the number of bytes that follow.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, in exchange for being a messy variable-length format that's hard to work with,&amp;nbsp;UTF-8 can encode the entire set of Unicode codepoints &lt;strong&gt;and&lt;/strong&gt; guarantees that any UTF-8 string will be correctly handled by a&amp;nbsp;function expecting a null-terminated string.&amp;nbsp; Also, since UTF-8 is specifically meant to be handled a byte at a time, it avoids the entire messy problem of endianness.&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;em&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Historical note: UTF stands for UCS Transformation Format.&amp;nbsp; The infamous &lt;/font&gt;&lt;/em&gt;&lt;a href="http://en.wikipedia.org/wiki/Ken_Thompson"&gt;&lt;em&gt;&lt;font color="#000000"&gt;Ken Thompson&lt;/font&gt;&lt;/em&gt;&lt;/a&gt;&lt;em&gt;&lt;font color="#000000"&gt; created UTF-8 in 1992 on a napkin in a New Jersey diner, for use in Plan9, and reported their success with it to the 1993 USENIX conference.&amp;nbsp; Unicode and ISO both formally standardized it in 2001, although the Unicode adds the extra clause that it should not be used to express codepoints above 0x10FFFF, just like UTF-32.)&lt;/font&gt;&lt;/em&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Now, I mentioned earlier that the other formats had to choose endianness: explicitly specify, or shrug and assume that it's the same as the host.&amp;nbsp; There is a common solution to this -- and that's to use a marker to determine the endianness.&amp;nbsp; This marker is known as the &lt;strong&gt;Byte Order Mark&lt;/strong&gt;, or BOM for short, and is Unicode code point 0xFEFF ("ZERO-WIDTH NO-BREAK SPACE" -- a null symbol, effectively).&amp;nbsp; If you encounter the character 0xFFFE while decoding, you know that the file you're reading was written on a machine of opposite endianness, and you should flip bytes.&amp;nbsp; (Unicode code point 0xFFFE has been specifically designated as an invalid character for this purpose.)&amp;nbsp; Keep in mind that you may encounter multiple BOMs in a string and may have to switch back and forth!&amp;nbsp; (This could happen if, for example, you used UNIX cat to concatenate two text files, and one was UCS-2BE and one was UCE-2LE.)&amp;nbsp; UTF-8, being specifically designed to be parsed on a byte-by-byte basis, does not need a BOM.&amp;nbsp; &lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;There's a few other standard Unicode encodings, but the big 13 above are the only ones that you see regularly.&amp;nbsp; I'll mention the other ones briefly, mainly because they show up in some old internet protocols:&lt;/font&gt;&lt;/p&gt; &lt;ul&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;UTF-7&lt;/strong&gt; was an early attempt to translate Unicode points to 7-bit-ASCII text for use in MIME-encoded emails.&amp;nbsp; It specified that all 7-bit characters should be transmitted as single bytes, like UTF-8.&amp;nbsp; However, rather than use the eighth bit to denote a multibyte character, it overloaded the + sign as a sentinel.&amp;nbsp; "+-" denoted that a normal plus should appear; for any other following character, the following three bytes were the UCS-2 encoding, re-encoded in Base64.&amp;nbsp; It could not transmit anything outside the BMP.&amp;nbsp; UTF-7 is used, slightly modified, in parts of the IMAP mail protocol; for POP3 and SMTP, however, it has mostly been bypassed in favor of UTF-8.&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;SCSU&lt;/strong&gt; (Standard Compression Scheme for Unicode) was an early attempt at a variable-length encoding like UTF-8 proposed by Reuters News, that added light compression as well.&amp;nbsp; However, small compression schemes like this are painfully inefficient compared to larger schemes like LZW or BWT, and they makes it very difficult to handle internally.&amp;nbsp; SCSU is not used in any major protocol or file format that I know of today.&lt;br /&gt;&lt;br /&gt;&lt;/font&gt; &lt;li&gt;&lt;font color="#000000"&gt;&lt;strong&gt;Punycode&lt;/strong&gt; (RFC 3942) is similar to UTF-7 and uses the string "xn--" as a sentinel.&amp;nbsp; Punycode is only used in one situation -- the IDNA (Internationalizing Domain Names in Applications) protocol used to handle use of Unicode domain names in DNS.&amp;nbsp; An IDNA-capable web browser will capture a string from the address bar, translate it to ASCII text using the Punycode system, and send the converted string as a standard getaddrbyname() DNS request, and the DNS server translates it back to Unicode upon reciept before doing the lookup.&amp;nbsp; If you're making a better bind, or fixing Firefox, this will be of interest to you; I do not expect to encounter files or other strings encoded in this system.&lt;br /&gt;&lt;/font&gt;&lt;/li&gt;&lt;/ul&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;So, that's the set of encodings that directly encode Unicode code points.&amp;nbsp; There's only one catch -- there's also encodings out there that don't directly map to&amp;nbsp;Unicode codepoints!&amp;nbsp; In this case, we have to do an two-part mapping to get to Unicode -- first, decoding to a symbol number in the source that matches that encoding's symbol set, and then converting that to a Unicode codepoint!&amp;nbsp; &lt;strong&gt;Yuck.&lt;/strong&gt;&amp;nbsp; And we're going to encounter a lot of these too, because these have names we recognize like ASCII Code Page 437 and ISO 8859-1 and Windows DBCS and GB and Big5 -- all those legacy formats, some of which are also variable-length like UTF-8.&amp;nbsp; We've got our work cut out for us!&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Of course, I never mentioned what it is we actually were working on...&amp;nbsp; I'm out to create a string class for C++ that doesn't suck.&amp;nbsp; Now, there's three ways that C++'s &lt;font face="Courier New"&gt;std::string&lt;/font&gt; sucks.&amp;nbsp; The first sin is that you have to use a backing store -- you can't&amp;nbsp;tell it to use a string literal like &lt;font face="Courier New"&gt;L"Sch&amp;ouml;ne Gr&amp;uuml;&amp;szlig;e"&lt;/font&gt; as a source, since &lt;font face="Courier New"&gt;allocator&amp;lt;char&amp;gt;&lt;/font&gt; requires that the target be modifiable.&amp;nbsp; All contents have to be copied, because contents are always mutable.&amp;nbsp; The second sin is that&amp;nbsp;it assumes that the compiler and author knows what they're doing when they manipulate its contents.&amp;nbsp; To C++, a &lt;font face="Courier New"&gt;basic_string&amp;lt;T&amp;gt;&lt;/font&gt; is really just a pretty interface on a &lt;font face="Courier New"&gt;vector&amp;lt;T&amp;gt;&lt;/font&gt;.&amp;nbsp; The third sin may vary to some people; for me, it exists in the forms of some stupid promises that 14882 (the ISO C++ standard) wasn't willing to make, most notably that the &lt;font face="Courier New"&gt;c_str()&lt;/font&gt; method is capable of invalidating references, pointers, and iterators.&amp;nbsp; This was mostly done to accomodate copy on write and other implementation details, but it makes writing conformant string-handling code infuriatingly difficult if you ever have to interface &lt;font face="Courier New"&gt;std::string&lt;/font&gt; with C functions that need C strings&amp;nbsp;(such as, say, the Win32 API!).&lt;/font&gt;&lt;/p&gt; &lt;p&gt;&lt;font color="#000000"&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;I'm opting to only fix the first two sins&amp;nbsp;for now -- backing store handling, and encoding awareness.&amp;nbsp; The third sin, you can handle according to your level of offendedness.&amp;nbsp; &lt;/font&gt;&lt;font color="#000000"&gt;Tomorrow: Policy-based design using templates, minimizing conversions, and why 14882's &lt;font face="Courier New"&gt;char_traits&lt;/font&gt; makes it impossible to make a strictly conformant&lt;font face="Courier New"&gt; std::string&lt;/font&gt; that supports variable-length encodings.&lt;/font&gt;&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;We have to store the code points in a string somehow.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;A lot of pain comes from wanting to retain reverse compatibility with old character sets and old encodings.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Large fixed-width formats like UCS-2 and UCS-4 make string manipulation very easy since they allow random access to individual code points, but are not compatible with old C functions that expect null-terminated strings.&amp;nbsp; However, keep an eye out for endianness.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Variable-width formats like UTF-8 are compatible with null-termination functions, but have to be parsed sequentially.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=244865" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /></entry><entry><title>Encodings In Strings Are Evil Things (Part 1)</title><link rel="alternate" type="text/html" href="http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx" /><id>http://blogs.msdn.com/ryanmy/archive/2004/10/18/244284.aspx</id><published>2004-10-19T02:54:00Z</published><updated>2004-10-19T02:54:00Z</updated><content type="html">&lt;p&gt;&lt;i&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; What is a string?&lt;/i&gt;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; About six months ago at the Game Developers Conference in San Jose, I sat in on a talk about performance tuning in Xbox games.&amp;nbsp; The presenter had a slide that read:&amp;nbsp; "Programmers love strings.&amp;nbsp; &lt;b&gt;Love hurts.&lt;/b&gt;"&amp;nbsp; This was shown while he described a game which was using a string identifier for every object in the game world and hashing on them, and was incurring a huge performance hit from thousands of strcmp()s each frame.&amp;nbsp; I nodded -- but my mind was thinking, "The same would be true if they had used GUIDs, or any other large identifier.&amp;nbsp; After all, strcmp is just a bounded memcmp."&amp;nbsp; So, what actually IS a string?&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I think it's safe to say that a string is &lt;i&gt;something&lt;/i&gt; that a human interprets and derives meaning from.&amp;nbsp; In this case, that something is almost always an ordered sequence of symbols (note: the symbols may not be co-linear!) that conveys meaning.&amp;nbsp; Now, let's assume from here on that a string is an ordered sequence of 2D glyphs.&amp;nbsp; A glyph is three pieces of data: a symbol, the dimensions to render that symbol at, and the location where it should be rendered.&amp;nbsp; This is still describing a very abstract, human-centric thing.&amp;nbsp; To express this in the programming world, we have to identify these glyphs somehow.&amp;nbsp; A vector drawing or bitmap approximation of a glyph would suffice.&amp;nbsp; But we don't want to require that people deal with these just to print "Hello World" to the screen.&amp;nbsp; So, let's put the glyphs somewhere in the system, and assign indices to them.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; And thus, we have &lt;a href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=29819"&gt;ISO 10646&lt;/a&gt;, known as the Universal Character Set or UCS.&amp;nbsp; UCS is a simple mapping of decimal indices (called code points) and formal names, to symbols.&amp;nbsp; For example, in the UCS, code point 0x41 is "Latin capital letter A" and corresponds to, of course, the letter A.&amp;nbsp; The goal of UCS is to be a superset of all character sets.&amp;nbsp; So, given a set of characters such as 7-bit ASCII, or ISO 8859-1, or EBCDIC, we can find some mapping (preferably 1:1, but we're not always so lucky) to UCS.&amp;nbsp; So, our definition of glyph now converts to a tuple containing a UCS code point, a size, and a distance from the last render point.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; We now find ourselves asking a few more questions about glyphs.&amp;nbsp; Size is fairly easy to measure -- just a box that bounds the symbol.&amp;nbsp; However, distance is difficult, because good typesetting requires that the distance between characters be measured from any number of points inside that box.&amp;nbsp; For simple Roman alphabets, we might want to measure from the baseline; accents might have to go relative to baseline + ascent; some characters may have an advance width that is greater than their bounding box; and this doesn't even begin to address script-based languages like Arabic!&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; On top of this, UCS allows two ways to represent accented characters.&amp;nbsp; Most accented characters have a dedicated UCS code point; however, an accented character can also be represented as the code point for the un-accented character, followed by code points for one or more accents as stand-alone symbols.&amp;nbsp; UCS calls symbols which are meant to be applied to the previous character "combining characters," and refers to symbols containing preaccented letters as "precomposed characters."&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; For example, the symbol &lt;b&gt;&amp;Auml;&lt;/b&gt; can be represented by either the precomposed UCS code point 0xC4 ("Latin capital letter A with diaeresis") or by the code point 0x41 ("Latin capital letter A") immediately followed by code point 0x308 ("combining diaeresis").&amp;nbsp; And don't forget that there needs to be size and direction between the diaeresis and the letter, and that there can be more than one combining character following a single symbol, including some symbols which can vary their positioning depending on their combination with other combiners!&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; The UCS took the easy way (or, as some would argue, the &lt;b&gt;sanest&lt;/b&gt; way) out of dealing with all the positioning problems of glyphs -- it simply refused to acknowledge their existence.&amp;nbsp; The UCS is simply a symbol table that includes combining characters, nothing more.&amp;nbsp;&amp;nbsp; The UCS also doesn't deal with any of the properties that we assign to specific symbols; for example, it doesn't recognize case.&amp;nbsp; It cannot say that &lt;b&gt;&amp;Auml;&lt;/b&gt; and &lt;b&gt;A&lt;/b&gt; are upper-case and &lt;b&gt;a&lt;/b&gt; is lower-case, or that &lt;b&gt;&amp;Auml;&lt;/b&gt; and &lt;b&gt;A&lt;/b&gt; have the same root letter and differ only by accent, or that &lt;b&gt;a&lt;/b&gt; is the same root letter as those two -- they're simply different symbols with no relation.&amp;nbsp; As a result, the UCS isn't very well known, despite the fact that it has existed for over a decade.&amp;nbsp; This is where &lt;a href="http://www.unicode.org/"&gt;Unicode&lt;/a&gt; comes in.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Unicode originally started out in the late 1980s as an ad-hoc standard agreed on by a group of companies making multi-lingual software products.&amp;nbsp; Initially, Unicode was developed separately from UCS; however, starting in 1991 Unicode merged its code table with UCS, and &lt;a href="http://www.unicode.org/standard/versions/"&gt;all versions of Unicode&lt;/a&gt; from 1.1 (June 1992) forward match the UCS.&amp;nbsp; Unicode does not define glyph data, or the vectors that are used to render a symbol.&amp;nbsp; However, it does provide lots of normative semantic information that UCS code points lack.&amp;nbsp; For example, a Unicode code point not only contains the UCS symbol, but also data such as the symbol's case (upper/lower/title), category (letter, mark/accent, digit, punctuation, separator, etc.), and numeric interpretations of digit symbols (i.e. the symbol 4 represents four things).&amp;nbsp; Alongside this, we have the Unicode Technical Standards, which define culturally appropriate comparison, sorting, and searching algorithms, character boundaries in script languages, how to handle newlines (CR/LF/CRLF/NEL), and other such handy information.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Let us assume, for now, that given an ordered sequence of Unicode code points, the OS can convert them to glyphs and render them in a way that's appropriate.&amp;nbsp; Of course, this is a huge and almost entirely false assumption -- and I'll be coming back to it later.&amp;nbsp; But it's also a very convenient assumption, because it allows us to reduce the definition of a string down to something that's easy to tackle: &lt;i&gt;a finite ordered sequence of Unicode code points&lt;/i&gt;.&amp;nbsp; Of course, in order to store a decimal on a computer, it has to be converted to binary.&amp;nbsp; However, not all binary representations are the same, and not everyone thinks it's worth using 31 bits of information for every character.&amp;nbsp; Tomorrow's episode: encoding systems, and the major character sets that love them.&lt;/p&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;Today's facts/conclusions:&lt;/p&gt; &lt;ul&gt; &lt;li&gt;Strings should be thought of as human-centric, rather than tied to a video card's interpretation of regularly-sized bits.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Strings are composed of glyphs.&amp;nbsp; A glyph consists of a symbol, plus typesetting information.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;There's already a standard table called ISO 10646, or UCS, that maps code points (numbers) to symbols.&amp;nbsp; Unicode adds semantics like case, comparison rules, and sorting algorithms to UCS.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;Typesetting information is really tricky to store portably.&amp;nbsp; UCS and Unicode ignore its existence.&lt;br /&gt;&lt;br /&gt; &lt;li&gt;If the OS can be relied on to handle glyphing, we can store a string as an ordered sequence of Unicode code points.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt; &lt;hr /&gt; &lt;p&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Oh, and since this is the first post here that's visible to the public -- I'm Ryan Myers, a geek-of-all-trades currently on the Windows Client Performance team.&amp;nbsp; I intend to use this blog as an ongoing set of essays about various facets of programming I've encountered.&amp;nbsp; (I use essays as the textual equivalent of sitting in front of a whiteboard reasoning things out, rather than a polished report of what I wish I had done the first time.&amp;nbsp; So, conclusions may change from post to post, and I welcome all comments and counterpoints.)&amp;nbsp; So, pardon the mess and enjoy the show.&lt;br /&gt;&lt;/p&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=244284" width="1" height="1"&gt;</content><author><name>ryanmy</name><uri>http://blogs.msdn.com/members/ryanmy.aspx</uri></author><category term="I18N" scheme="http://blogs.msdn.com/ryanmy/archive/tags/I18N/default.aspx" /></entry></feed>