Welcome to MSDN Blogs Sign in | Join | Help
    So, there were two major groups of comments on the last post, and I'll try to address them.

    The first was a question about managed support for ETW.  I talked to the ETW team, and the current state is that there is no official managed interface for ETW.  Being a standard Win32 API, it is posisble to PInvoke the functions involved, and several internal teams have written their own managed wrappers around ETW.  This isn't expected to change up through Whidbey (Visual Studio 2005); for Orcas (the version of VS after that!) an official managed interface is on the table.  The second is one of backwards compatibility -- ETW is only available in Win2K and later OSes.  Users will expect that software works similarly on all OSes; thus, if you want to support 9x-era OSes, you have to write your own logging code anyways.  So, if you are already putting in old-style logging, why use ETW?  I'll try to answer that with this entry.

    One of the gains of ETW is that it's fast; you can spit out thousands of events per second while using relatively little CPU, far faster than you can fprintf() a string to disk.  The biggest gain, though, is combining multiple sources -- including information outside your own process.  And the most notable external source is the kernel.

    The XP and Codename Longhorn kernels are extremely extensive providers, and can be enabled to log any or all of these to a log, and we publish decoding information for:
  • Hardware configuration events -- notes on the system's CPUs, hard drives, NICs, video card, and ACPI power states
  • Disk-level I/O -- every I/O on the system, including IRP flags, operation time in microseconds, number of bytes, and target disk
  • File-level I/O -- every access to every file on the system (including information to tie it to the disk I/Os above)
  • Image layouts -- filenames, locations in memory, and PIDs for every image in the system
  • Page faults -- pointers to instructions and pages whenever a fault occurs (including COWs, demand-zero faults, hard page faults, transition faults, and guard pages)
  • Network I/O -- all TCP and UDP actions, including connects/accepts, transmits (and retransmits!), recieves, etc.
  • Registry I/O -- all Registry key/value  creation/deletions/changes, registry flushes, etc.
  • Process and thread info -- all creations/deletions of processes and threads
    And Codename Longhorn adds even more events -- most notably, the ability to trace extremely fine-grained high-frequency events such as individual context switches, interrupts (ISRs and DPCs), etc.  

    (FYI, since all of the above information is exceedingly detailed, you can only enable the kernel provider if you have Administrator privileges, are part of the Performance Log Users group, or a service running as LocalSystem, LocalService, or NetworkService.)

    Thus, ETW can be used as an effective debugging tool.  By allowing ETW to pull from, sort, and combine events from multiple providers, you can get a powerful log of everything the system was doing, probably the most accurate log available (save for running the entire OS in a debugger).  It's an incredible tool for noticing "hey, things act strangely when X, Y, and Z, but not W, are happening" at a system level, as well as a code level, and it takes far less time than getting symbols and attaching a debugger/profiler to the system.  And it's all available to devs -- and even to users, given a generic ETW tool such as tracelog in Server 2003!

    Even if you don't specifically use the kernel as an information source, ETW's ability to combine providers is useful for mixing and matching information from multiple DLLs, EXEs, etc. in a system.  ETW events are timestamped by the kernel to extremely high resolution (RDTSC on stable machines, converted to microsecond intervals; MM timers on others) and are automatically sorted at process time, so you don't have to write or parse plaintext date/time formats.

    The goal I'm driving at is that when you have more than one DLL or EXE providing information, individually implementing logging for each component means that you usually need a third app to read in the logs from each component and combine them into a single log with coherent event ordering, and this can be difficult -- especially if you have to tie it to some event.  ETW allows you to automate all that, and it's exceedingly efficient at it as well.  Even if you are only personally maintaining one component, ETW can log very quickly, and it can be shipped in retail builds -- and if you publish the structures of some or all of the events you provide, you can give valuable information to your consumers and to future devs without ever needing to work with them.

    Next entry, I'll start discussing how providers are written, starting with thread structures and common ways of publishing event structs.
3 Comments
Filed under:
    A lot of work in performance tuning is organizational.  There's only so much work one can do with a profiler and a single module.  A good example is the Registry -- we can attach profilers to the Registry access routines and optimize them until they run as smooth as silk, but performance will still be impacted if you do thousands of Registry accesses per second.  For many problems, the cause is systemic: several components in a chain of command that are individually well-tuned, but didn't expect to call each other in a huge chain.  A good example of that is DirectShow -- no matter how skillfully crafted an individual filter is, if the mean path in a DirectShow graph is ten filters deep (with memory management between each one for passing buffers of audio or video around), latency is going to be high.

    More often than not, the best solution is simply logging.  Log when filters are instantiated or connected, log when Registry accesses are made, etc..  You want to mark high-level concepts, and try to get a picture for what's going on with the system as a whole.  This works fine if you only have one application that has to log... but more often that not, these systemic problems have hundreds of files involved, most of which aren't coded by you!  If every programmer performs their logging in a different way, it can be a nightmare to combine all those logs together, mixing different types of timestamps and different methods of delivery, and get a single ordered log of what happened over time.  Of course, that's exactly what we need... and that's where Event Tracing for Windows, or ETW for short, comes in.

    ETW is, at its core, a unified system for one-way packetized I/O managed by the Windows kernel, built for logging.  Every use of ETW has three participants in it -- the controller, the provider, and the consumer:
  • A provider is an module (DLL/EXE) doing something worth logging.  Most of the time, it runs without logging; it can, however, be "enabled" by a controller, at which point it recieves a handle from the kernel and starts logging "events" to that handle.  An event is an arbitrary struct (binary block) of data, the only condition being that it start with a 48-byte header.  This header contains a timestamp and identifying information.

  • A controller controls the actual act of logging.  The controller can ask the kernel to start a logging session, creating a handle and specifying that the kernel should take any events delivered to that handle and save them to a file.  (That file is usually on a hard drive, although we occasionally save them to RAM drives to ensure minimal interference.)  The controller can also enable and disable logging by providers, passing them a handle to log to.

  • A consumer reads events out of a file created by a logging session and parses them.  (It is also technically possible to have a consumer directly attach to a logging session's handle and retrive events in real-time, but this is rare.)
    So, why use this system over your own homebrew system?
  • Uniformity.  If you're debugging systemic problems involving multiple components, and all the involved components use ETW, you can have them all deliver their information to a single log file with uniform, steady timestamps, and write a single application that parses them all.

  • Speed.  ETW is extremely fast for providers to use, since all the I/O is handled by the kernel instead of by your module.  It typically takes only 1500-2000 cycles, depending on settings, to deliver an event and return to your code.  One can easily deliver thousands of events per second even on ancient machines.  We've achieved 20,000 events per second while only using 5% CPU load on a P3 500MHz(Yes, we have machines that old in our perf testing labs -- not everyone who uses Longhorn will be using a modern machine!)

  • Consistency.  With fprintf() or other homebrew systems, logging tends to be very slow and intrusive and is thus usually compiled in.  With ETW, logging is extremely fast; furthermore, since logging is turned on by a controller and is usually off by default, you can actually leave the ETW events in final shipping code!  If problems are found in the field, send the tester an app that starts a trace and turns on the provider, then read it later.  Many, many components in Longhorn will ship as ETW providers.

  • Reliability.  ETW isn't a new thing -- it's actually been in the OS and actively used since Win2K, and has been constantly refined since then.  Furthermore, ETW is available in both user-mode apps and kernel components.  (The latter access it through a MJ_SYSTEM_CONTROL IRP.)  This leads to...

  • OS cooperation.  The Windows kernel can provide many highly useful events via ETW for diagnosing performance problems.  Find out when and where disk I/Os, registry accesses, hard faults, and other performance problems happen!  More on this later...
    I'll start discussing the actual APIs in the next entry -- those whose curiosity has been piqued can jump into the MSDN documentation, which is not very good IMO but better than nothing.
9 Comments
Filed under:

   Hello!  I haven't updated this blog in a while; work and other events have conspired to keep me from writing.  Also, blogs.msdn.com moved internally from .Text to Telligent Community Server, and my CSS markup was an unfortunate casualty of the move, so I'm working on redesigning the blog's visual appearance.  More entries will be coming eventually.  :)

   In the meantime, I want to defuse a long-standing controversy -- the /prefetch flag.

 

   With modern computing, the absolute worst thing you can ever do for performance is having to touch the hard drive -- or any non-memory storage for that matter.  The fastest hard drives on earth are still horridly slow compared to a PC's main memory; even with solid state drives, in order to access the drive, one has to jump into system code and drivers, and this will push your own program's code out of the CPU's L2 cache.  (This is called a locality loss.)  There's two typical reasons one has to touch the disk -- the first is when the application requests it explicitly (Word asks the OS to load blog.doc into memory), and the other is a "hard fault" -- when the application tries to use memory that has been paged out to disk via "virtual memory" and needs to be paged back in.

   Now, imagine that a DVD player program always starts playback by loading a DLL to decode MPEG-2 video.  Wouldn't it be nice if we could attempt to pre-load the MPEG-2 DLL whenever we loaded the DVD player's EXE?  That way, when it tries to run code on that DLL, one doesn't have to hard fault and go to disk for it!   This is what a prefetcher does: it tracks what code pages are used by an application, and the next time that application loads, it loads those pages in advance as soon as it's got some idle time.  A prefetcher was added to Windows in XP, and is vastly improved in Windows Longhorn.

   XP systems have a Prefetch directory underneath the windows root directory, full of .pf files -- these are lists of pages to load.  The file names are generated from hashing the EXE to load -- whenever you load the EXE, we hash, see if there's a matching (exename)-(hash).pf file in the prefetch directory, and if so we load those pages.  (If it doesn't exist, we track what pages it loads, create that file, and pick a handful of them to save to it.)  So, first off, it is a bad idea to periodically clean out that folder as some tech sites suggest.  For one thing, XP will just re-create that data anyways; secondly, it trims the files anyways if there's ever more than 128 of them so that it doesn't needlessly consume space.  So not only is deleting the directory totally unnecessary, but you're also putting a temporary dent in your PC's performance.

   Secondly, one can specify a /prefetch:# flag when launching an app.  Many people have noticed that auto-generated shortcuts to Windows Media Player do this, and the number varies depending on what it does.  For example, the shortcut used by the shell when you double-click a WMV file to play it has one prefetch number; the auto-run shortcut to play or rip music that appears when you insert a music CD have other numbers.  Some sites have guessed that this switch turns on prefetching, and suggest that you add that to every executable you care about -- this has appeared on so many, many, many sites to be urban legend.  Other sites write this off as garbage and guess that it's a switch specific to Media Player, guessing from references to prefetching in the Windows driver subsystem.  Both guesses are incorrect.

   The /prefetch:# flag is looked at by the OS when we create the process -- however, it has one (and only one) purpose.  We add the passed number to the hash.  Why?  WMP is a multipurpose application and may do many different things.  The DLLs and code that it touches will be very different when playing a WMV than when playing a DVD, or when ripping a CD, or when listening to a Shoutcast stream, or any of the other things that WMP can do.  If we only had one hash for WMP, then the prefetch would only be correct for one such use.  Having incorrect prefetch data would not be a fatal error -- it'd just load pages into memory that'd never get used, and then get swapped back out to disk as soon as possible.  Still, it's counterproductive.  By specifying a /prefetch:# flag with a different number for each "mode" that WMP can do, each mode gets its own separate hash file, and thus we properly prefetch.  (This behavior isn't specific to WMP -- it does the same for any app.)

   This flag is looked at when we create the first thread in the process, but it is not removed by CreateProcess from the command line, so any app that chokes on unrecognized command line parameters will not work with it.  This is why so many people notice that Kazaa and other apps crash or otherwise refuse to start when it's added.  Of course, WMP knows that it may be there, and just silently ignores its existence.

   I suspect that the "add /prefetch:1 to make rocket go now" urban legend will never die, though.  I know that at least one major company ships products with it in their shortcuts, without ever asking us... just for good measure, I guess.  :-P  All it does is change your hash number -- the OS is doing exactly the same thing it did before, and just saving the prefetch pages to a different file.

   (ATTENTION: This is merely an informative article; this information is completely unsupported, and the functionality may change or disappear entirely in future versions of Windows or service packs.  Furthermore, it is merely a hint for the XP prefetcher, and it may choose to ignore it if it wishes.)

16 Comments
Filed under:

   As more Unicode encodings are being finished, I find myself wanting to actually start using rmstring in real situations.  However, most of my "real situations" involve legacy encodings.  So, I need to start cracking on transcoding.

   The first concern is allowing adapters for arbitrary transcodings.  A tricky problem that's related to transcoding is collation (aka sorting) -- most people aren't aware that sorting strings is often a locale-dependent issue.  This is a localization problem.  Just to make sure that terminology is clear, internationalization (often abbreviated to i18n) is the act of coding a program such that it is entirely independent of location and language; the most classic example of i18n is moving all string literals into a binary resource within an EXE, so that the strings may be changed without modifing the program's logic.  This is almost always paired with localization (sometimes abbreviated to l10n), which is the act of tailoring an already-internationalized program for a specific language/locale.  Internationalization may be done by any programmer; localization requires translators.

   In the case of sorting, a binary sort is often not enough.  Context is everything!

  • Where do accented characters sort -- the same as their base characters, or after?  (For French speakers, accented As come after Z.)
  • What are you sorting for?  (German has a special sorting order for names in phone books!)
  • What about ligatures such as ch or fi?  (Spanish speakers, for example, will sort character sequences starting in "ch" between "c" and "d", even though they recognize "ch" as two separate characters.)

   For this reason, developers using rmstring on Win32 platforms will almost certainly want to use a sorting predicate based on Win32's CompareString or LCMapString APIs.  For example:

rmstring<ucs4, bytevector> getfirst( std::list<rmstring<utf8, bytevector> > & lines ) {
    std::sort( lines.begin(), lines.end(), win32_collator( LOCALE_USER_DEFAULT ) );
    return (*lines.begin()).transcode<ucs4, bytevector>();
}


   This example is a bit contrived -- a real example would template the container and output encoding, and make the LCID a parameter with a default argument -- but you get the point.  win32_collator, in this case, is a custom predicate for std::sort (see <algorithm>) that converts both strings to UTF-16 and then invokes CompareStringW on them, throwing a missing_symbol exception if there's a codepoint above 0x10FFFF that UTF-16 can't contain.  Of course, this will hardly be my primary solution!  More on that later.

   Anyways, similar issues arise for transcoding.  (Not to mention the fact that win32_collator is, in fact, dependent on the ability to transcode, since the Win32 Unicode APIs expect UTF-16 strings.)  So, we must include pluggable transcoders.  So, we change our prototypes from Part 7 to include one more template argument, the transcoding tool:

template <class Engine, class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt, Engine e = Engine() );

template <class Engine, class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( Engine e = Engine(), TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

   These functions now put off transcoding to the Engine object, whatever that may be.  In the Win32 vein, we could use MultiByteToWideChar and WideCharToMultiByte -- but that's too easy, not to mention very difficult to wrap.  I'd really like to do something that's solely C++ and entirely based in the Unicode Character Database's mappings directory.  There's a few dilemmas to be solved for that.

   Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters.  Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard.  For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (&Lcaron).  The "simple" equivalent of this in Unicode is a capital letter L (U+004C) followed by a combining caron (U+030C); however, Unicode also defines a single pre-combined character, U+013D, that is directly equivalent to those two.  Therefore, almost all legacy encodings thus can have a simple 1:1 function to go up to Unicode, in the form of a lookup table.  (Unfortunately, not all legacy encodings have a complete set of compatibility characters, so a LUT will not work for everything.)  Going back from Unicode to legacy is more problematic, however: we now have two equivalents to a given legacy character.  The most direct solution, it seems, is to generate a finite automata.

   I've been working on the DFA for the last few days.  My main concern has been memory efficiency, and I can now get a complete set of typical round-trip encoding data to fit in at under 8K per encoding, which fits nicely in cache.  Obviously, certain ones will be smaller, and certain ones will be larger (in particular KOI8 and other encodings with very large symbol sets).  The DFA solution is very clean though; the legacy-to-Unicode DFA takes in bytes and outputs 32-bit unsigned ints containing codepoints which are then re-encoded, and the Unicode-to-legacy DFA takes in codepoints and outputs bytes.  Legacy-to-legacy transcodes use UCS-4 as an intermediary.

   At this point, I'm now working on a program that reads in a file from MAPPINGS and UnicodeData.txt from the Unicode Character Database and outputs the DFA in C++ format.  I'll post more when that's finished.  (I'm writing this entry pre-emptively, as this work-week looks like an absolute killer.)

7 Comments
Filed under: ,

   Eugh.  Due to a three-part punch of piling-up work, time with family over the holidays, and being thoroughly sick, I haven't had much time to work on rmstring -- which means, of course, that this hasn't updated.  I haven't given up on it though!  (I'm not dead!  I don't want to go on the cart...)  If anything, my desire to finish it has increased, since I've been working on a set of internal utilities which parse text files to take instructions, and one keeps on thinking, "This would be so much easier if I just finished rmstring..."

   So, on to business.  First off, the all-important fixed_width_encoding class is done.  This critical class is the foundation of all encodings with a fixed number of bits per code point; it's templated on an intrinsic type that the implementor knows is 1/2/4 bytes.  The hardest part of an encoding, I've found, is writing the iterators; there are a huge number of methods that one must implement in order to make a 14882-compliant iterator.  The internals are mostly simple pointer arithmetic; just a lot to be tested.  (Yes, I have to write a test harness for this, if I want it to be approved for on-campus use :P)

   One annoyance that I've found is pointer type conversions; imagine that you've allocated a byte array for recv()ing something in from a TCP socket.  If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't.  Or, at least, you shouldn't.  If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly).  Of course, people do this all the time; you just can't guarantee that doing so is safe within the confines of strictly conformant code.  There is also no way for strictly conformant code to check if a given pointer is aligned, since there is no operator to retrieve a type's alignment requirements.  The best you can do is assume that no type will have an alignment requirement greater than its size, and assert(0 == reinterpret_cast<size_t>(ptr) % sizeof(type)), which is throughly disgusting AND assumes certain things about the host's virtual memory system that may not be true.

   Thus, I've opted for the simplest solution: a huge comment in the code that says "These functions assume that the backing store's data() pointer is suitably aligned for Stride-sized accesses and that size() is a multiple of Stride's size.  Violating either of these assumptions will result in your program's untimely death."  Sometime later, I might come up with a helper function alignment_assert<T>(ptr) that takes advantage of compiler-specific extensions such as MSVC's __alignof if available.  Note that this also could potentially result in a Unicode stream that does not make much sense, such as combining characters that don't properly match base characters.  The Unicode standard notes that such a stream is not ill-formed, although it is not quite renderer-friendly; so, I'll support it.

   I've also had occasion to rethink my plans for encoding_cast.  Initially, I planned to use encoding_cast in a way similar to the Boost lexical_cast pseudo-operator.  However, it disturbed me that doing so would mean that every call to encoding_cast would create a temporary in which to store the result, which would then make its way to final storage either by operator= or copy constructor.  I ended up realizing that a good 70% of the calls to encoding_cast would be writing the encode into a string that already existed.  So, instead, we now have the transcode function, which comes in both non-member and member flavors:

template <class SrcEnc, class SrcStore, class TgtEnc, class TgtStore>
void transcode( const rmstring<SrcEnc, SrcStore> & src, rmstring<TgtEnc, TgtStore> & tgt );

template <class TgtEnc, class TgtStore>
rmstring<TgtEnc, TgtStore> rmstring<SrcEnc, SrcStore>::transcode( TgtEnc newenc = TgtEnc(), TgtStore newstore = TgtStore() );

   With the above, the originally envisioned encoding_cast is now just syntactic sugar for a call to the source string's member transcode() function.  It also means that the code to do transcodes is now centralized within rmstring.  Handy!

   Oh, and since someone asked: I'm currently developing and testing this on Visual C++ .NET 2003 and Stephan Lavavej's distribution of MinGW; I'll likely run it against Comeau as well to make sure it's kosher before I release the source to the public.

   My goals for the next article are to have a few non-Unicode encodings done, so I can start testing out transcoding and flesh out the different encoding mechanisms.  My main dilemma is designing the symbol tables; I noted in Part 4 that I wanted to have the ability to pass different resolving engines to the transcoder such as a perfect lossless transcription, visual parity, error codes, etc.  Visual parity will be the hardest to do; in fact, I will likely not do it right away.  (Namely, because the Unicode tables do not contain such parity information.)  Another concern has been memory consumption of tables for encodings; I'll be tackling that shortly.

(Since this was mostly a "what happened while I was gone" article, no point summary.)

(Update 2pm: Michael Kaplan nudged me a bit that I had broken my previous insistence on "code point" versus "character" terminology -- that's what I get for stepping away from the blog for two weeks!  Terminology corrected; anyone who doesn't know the difference between code points and characters needs to go back and read this blog from the beginning, or at least Part 5.)

1 Comments
Filed under: ,

   First, I apologize for not updating recently -- at work, my dev machine's power supply died, and took my hard drive with it.  Luckily, I had everything backed up; however, I had to copy everything over to, and work on, a single-monitor Longhorn dogfood box with no major apps installed.  This went on for a week and a half while I waited for Dell to slog through the warranty process for new parts and have them installed by a Dell-authorized tech (in order to keep the warranty going) and this put me behind schedule for several deadlines.  So, now that my dev machine has a new PSU and HDD I've been frantically working to get caught up on things, and this has left little time for the blog.  In about two weeks these deadlines will be behind me, and I can start posting with regularity again.

   Also, at this point I'm now primarily doing implementation of previously discussed ideas, so this series of posts will temporarily serve two purposes: discussion of issues, and journal of coding concerns about implementing this in C++.  And this post concerns one of the C++ concerns: how do you define operator[] for a string that's in a variable-width encoding such as UTF-8?  One of the basic assumptions in std::string that I intend to honor is that operator[] returns a reference to the actual data, not a copy.  For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a unsigned char/short/long.  However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well.  I could do this with covariant returns and unions, but this is horribly ugly -- and I'd need a lot of different returns, since UTF-8 alone can have up to six bytes in a single code point.

   My solution is to return a proxy object, MultiByteChar.  When I initially decided on this, one of my coworkers pointed out that I would run into the same problem as vector<bool>.  The Vector Wrapper Problem, as some refer to it, deserves a bit of discussion.

   The C++ standard defines that all implementations of the STL container std::vector<T> should include a specialization vector<bool> that stores the bits in packed form.  (Contrast with an array of bools -- bools can be stored in memory as if they were any of several integral types, depending on situation and the intelligence of the compiler).  In this case, if operator[] returns a bool, you cannot write expressions such as a[3] = true; -- there's no bool back there!  You need to return a proxy object containing a pointer/reference to the source container, with operator= overloaded, in order to support assignment in this manner.  However, this breaks with the definition of std::vector<T> -- the standard simultaneously claims that any operator[] on a vector must return some type that is convertible to T &.  This bit of doublespeak results in the inability to reliably write certain types of wrappers around vector that can accept bool.

   My belief is that this was an oversight of the standardization committee.  They took the first step towards solving this by defining operator[] (and the iterator's dereference operators) as returning a member typedef, ref_type; however, they stopped short of a goal, by saying that ref_type had to be defined from the allocator for the vector.  A better solution would be to define a set of semantics and overloaded operators that suitably encapsulated the intent, purpose, and behavior of references, and defining this as a Reference typeclass.  They could then simply require that ref_type be some type meeting the Reference(T) requirements, and all would be well.  This is what I intend to do.

   The only remaining question is how to handle assignment; at first I planned to make it read-only, but later decided to maintain a reference to the host string and call replace() on the encoding/store in response to an operator=.  This means that a MultiByteChar must be templated on the source string in order to be typesafe.  This brings up the question of the string's lifetime and the ref's lifetime being separate; however, traditional C++ says that operations such as destruction may invalidate iterators/references/etc. anyways.  In this case, I think it's reasonable to be the same.  (This also means it's okay to use a member reference variable; in almost every case, pointers are preferable, since references cannot be assigned to, only copy-constructed.)

   As far as implementation goes, I've completed the unmanaged_ptr and vector_of_bytes backing stores, and am currently working on the fixed_width_encoding parent class that all fixed width encodings such as UCS2 and ASCII derive from.  Next post, I will likely talk about the interactions of encoding and backing store classes, and how I've divided responsibilities between them.

   To finish this post off, though, a quick oddity about the use of widen() in iostreams.  widen() is defined on streams as handling certain platform-specific character conversions, such as converting '\n' to the appropriate end-of-line character on your platform (CR for Unix and Mac OS X, CRLF for Windows, LF for Classic MacOS).

  • cout << '\n'; outputs cout.widen('\n'), as you'd expect.

  • cout << "\n"; iterates through all characters in the string (as reported by traits<char>::length()) and outputs the result of cout.widen() on each one, as you'd expect.

  • cout << string("\n"); does NOT widen characters.  It directly asks for cout's streambuf, and xsputn()'s the entire contents of data() into it.  Do not pass locale, do not collect i18n.

   I'm still thinking over how I want to define my behavior for operator<<.

0 Comments
Filed under: ,

    In our last episode, we briefly discussed possible behaviors for encoding_cast, and we discussed how the STL's basic_string class was structured -- namely, we noted that it had several core functions that were overloaded many times for various types of input.  We also noted that we could avoid many of the implementation headaches that result, because of our decision to generalize our backing store.

    One of my coworkers pointed out that Herb Sutter had already done an excellent dissection of basic_string in Exceptional C++ Style -- and, indeed, the last four chapters of the book are spent analyzing its structure, breaking it down to the core functions, and then implementing many of the functions and overloads as non-member template functions.  However, he's not looking to improve basic_string's foundation -- he's merely explaining how reducing the number of methods in basic_string makes the code much easier to maintain.  (For example, rather than writing an empty() member function, he writes a templated empty function that takes a STL string or container, and returns true if the string's begin and end iterators are equal.)

    Furthermore, he specifically chooses some less-than-ideal but good-enough implementations as a result of making simplicity the primary goal.  For example, in his implementation of resize(), he implements the shrinking case by using a basic_string constructor to make a copy of the first N characters of the string, and then calls swap(), so he's incurring a memory allocation and deallocation there that is unneccessary.  While Sutter's treatment is good, we have a slightly more ambitious goal in mind (making a better class to replace std::string, rather than merely improving upon the existing implementation through decomposition), so we're not duplicating effort.

    That said, I agree with his approach of decomposing functions with many overloads such as insert and replace, especially considering that our choice to generalize backing stores eliminates most of my objections to his techniques.  So, I've decided to make a basic_rmstring class after all, in a sense.  The basic_rmstring class will have a single member function for each major piece of functionality, such as insertion or replacement or concatenation.  We'll then make an rmstring wrapper class that provides overloads in a way to make it roughly equivalent to std::string.

    Now, on to a concern I alluded to in the last entry: distinguishing code points and characters.

    Up until now, I've specifically used the word "code point" to refer to a single symbol in the Unicode/UCS tables, even though Unicode refers to them as characters.  I chose to do this because of the existence of "combining characters", which are symbols associated with the previous "base character" such as accents, enclosing boxes/circles, formatting marks for subscript/superscript, and so on.  Unicode contains unaccented base characters, combining characters, and "precomposed characters" that use a single codepoint to represent a pre-accented base character.  These are considered always canonically equivalent to some combination of a base character and one or more composing characters.  (See Part 1 for an example of this.)

    Unicode defines a set of normalization forms that are used to standardize whether to favor combining characters or precomposed characters.  However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and must be represented using combining characters.  To make things even nastier, there are some combining characters, most notably double diacritics, that can span multiple base characters.  (And I haven't even gotten into Arabic and Hebrew scripts that can change the direction of rendering in mid-string!)

    Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through!  When you hit the right arrow in Microsoft Word to skip over an À, you don't go first to an A and then to the A's accent -- you move past the whole "character."  (Unicode refers to this rough definition of character as a "grapheme cluster," FYI.)  If it weren't for double diacritics, we could shrug and say "Well, a character is a base codepoint plus zero or more combining codepoints."  But it may not be that easy.

    After taking a walk to think it over, I ended up deciding to err on the side of the Unicode standard -- we'll treat double diacritics as a glyph problem.  Namely, a double diacritic is attached to the preceeding base codepoint only, and the fact that it extends over the following base codepoint as well is a glyphing concern.  (This is also due to the fact that most of the double diacritics can also be represented as a pair of "combining halfmark" where half of the glyph is applied to each character as two separate combining characters, and the glyphing engine is expected to recognize this and render it as a single glyph.)  So, we can say that a grapheme cluster is a base character, plus zero or more combining code points, plus any uses of the Combining Grapheme Joiner codepoint.

    So, do we want basic_rmstring to take integer index arguments, iterators, etc. as referring to code points, or to grapheme clusters?  For the sake of programmer familiarity, we're going to default to clusters, but we'll allow code points.  We will have a single iterator class that takes a bool in its construction describing whether advance() and related methods should advance by codepoint or by cluster.  Our begin, end, and other such iterator methods will be templated with a default template argument to clusters; thus, you can ask for a codepointer iterator by calling str.begin<codepoints>().  This is a bit messy, but workable.

    Before, we listed the methods that seemed worthwhile to carry over.  However, many of them can be implemented as versions of the others.  Tomorrow, we'll actually write a complete header for basic_rmstring and start implementing it.

    That, and I think it's about time I go buy a hardcover copy of the Unicode standard, as I have way too many PDFs on my desktop right now.

3 Comments
Filed under: ,

   In our last episode, we established that we wouldn't be able to make a true std::string replacement and still handle variable-width encodings.  So, we started with the beginning lines of an rmstring class.  However, this doesn't mean we are going to dispense with std::string entirely!  But first, a quick answer about my choice of names and an explanation about exceptions.

   A friend of mine asked me yesterday, "Don't you intend to make a basic_rmstring and then have a typedef'd rmstring that hardwires a specific specialization, like ASCII?"  I'm considering this -- but if I hardwire anything, it will not be the encoding type.  Trying to abstract away the encoding as hidden information is exactly the thinking that got us into this mess with std::string!  However, what we use for the backing store might be worth standardizing.  After all, using a vector<byte> to contain our bitstream is a very flexible choice; it's just not the best-performing one.  Whenever possible, we should make a library easy to use on the surface, and expose the guts of it to be changed once someone already has the program running and is trying to improve on it (by, for example, using string literals as backing stores and only copying them to heap memory when needed.)

   In a dream world, we would typedef a partial specialization.  However, we get bit by one of the most annoying mis-features in C++ -- you can't template a typedef.  Even the STL is crippled by this, and has to work around it using its ::rebind member.  So, the best we could do is allow someone to #define rmstring(enc) basic_rmstring<enc, vector_of_bytes>, and declare a string as rmstring(iso8859_1) str;..  It'd work, but it makes me cringe.  Alternately, we could use a rebind approach like the STL: 

template <class Enc> struct rmstring {
   
typedef basic_rmstring<Enc, vector_of_bytes> type;
};

rmstring<iso8859_1>::type str;

   Really, both of them are pretty damned ugly; the preprocessor approach is prettier, IMHO, but is also considerably more dangerous.  So, I'm going to leave it as rmstring with two template values for the purposes of this blog.  Eventually I'll probably opt for the #define for my own version of the library, but you can choose whichever is more appealing to you (conciseness versus typesafety), or choose neither.

   The second thing I wanted to answer from yesterday were those two exceptions, missing_symbol and malformed_data, that I listed next to the encoding_cast() function.  What on earth are they for?  First off, imagine that you're trying to convert a string from UCS-4 to UCS-2.  As I mentioned in Part 2, UCS-2 is a non-universal encoding, and there are some code points that it cannot represent.  What happens if our UCS-4 string contains one of those code points?  In this case, we will throw the missing_symbol exception.  We will also throw it in the case of converting to legacy character sets that simply do not have a code point defined for a symbol.

   There's something to keep in mind, though.  The popularity of JPEG proves that a lossless transform is not always necessary.  Imagine that we have the greek letter Æ -- is it acceptable to convert this to two characters, AE?  The proper answer is neither yes or no;it's "sometimes."  Remember, all this time, our definitions of string have been derived from a definition of symbols that a human interprets -- and this means that whether or not a 'close enough' translation is acceptable depends on who's looking at the string.  Imagine that a blind person is using a screenreader (a program that uses a computerized voice to read text as it appears on the screen).  In that case, there's a vast difference between Æ and AE.  However, for a person with normal sight reading a webpage, however, the two might be interchangable.

   The computer scientist in me says that I should only allow lossless transforms -- the engineer in me knows better, though, and there's a way to satisfy both.  Therefore, we are going to add a third template argument to yesterday's definition of encoding_cast, and allow it to have a default specialization.  This default specialization will be called the "symbol clash resolver" and has a well-known method invoked whenever a missing symbol problem occurs.  The default one, lossless_resolver, will throw missing_symbol in all cases.  A user can define alternatives, though.

   Two possible alternatives immediately occur to me -- one called visual_parity_resolver that does replacements like the above, and another called error_symbol_resolver that acts like RS232's error character, inserting a compile-time constant instead (such as a box symbol, or an "<ERROR>" string, or whatever suits the user) whenever a symbol cannot be translated.  But those can all wait for later -- only lossless_resolver needs to be immediately defined, and its definition is trivial, since it just throws :)

   The other exception, malformed_data, comes from if we try to decode a buffer that has an error in it.  In the case of UTF-8, there are sequences that decode to illegal or nonsensical numbers, and if we are asked to decode these sequences, we should let the user know.  Imagine a scenario where you are writing an Internet server daemon, and expect to recieve a UTF-8 encoded string as the first transmission following a client successfully connecting.

   In this scenario, we recv() the data from the server into a buffer, and then construct an rmstring<utf8, unmanaged_pointer> to read it.  If there was an error in network transmission, or a malicious client was testing our ability to handle bad data, we should communicate this to the programmer as an error.  Thus, if an encoding can detect illegal input (very few encodings can!) it may throw a malformed_data exception if you invoke any operations that hit that input, or if you attempt to trans-code it.  We will also probably want to make a compile-time flag visible on the encoding class that determines whether or not it can have malformed data.

   So, with those two issues resolved, let's get down to our dirty business!

   I said earlier that we had to pick one of two mutually exclusive goals: be a perfect drop-in replacement for std::string, or support variable-width encodings such as UTF-8.  Since I think std::string is poorly designed and I demonstrated that not being string-compatible is only a loss for stringstream compatibility, I'm favoring the latter.  (Just hating std::string alone would not be sufficient reason -- in that case I'd just be suffering from NIH syndrome.)

   However, this doesn't mean that I can just go roll my own string class in the way that best suits my urges.  Many programmers have devoted considerable time and energy to learning std::string's ins and outs, myself included -- so, I should exploit that knowledge by providing similar functions with similar arguments, as long as it doesn't compromise my design's principles.

   Looking at basic_string's definition in the C++ Standard is an exercise in mental stamina.  It defines six constructors (one of which requires some very special trickery with templating and the SFINAE principle to implement, as we'll see later) and over 100 methods, plus a host of non-member operators such as << and +.  However, looking at the expected behavior for each function, most of them are overloads that call a base function.

   In other words, a basic_string has one or two core definitions at most for each core method (such as append(), replace(), insert(), etc.), which take basic_strings as their input.  Every other overload is defined as equivalent to calling that root function, with a basic_string constructor meant to convert some other form of string (char pointer, run of chars, pair of iterators, etc.) to a basic_string that the "core implementation" can grok.

   Of course, they don't all implement them like that, because it'd mean frivolously making a copy of the input data in basic_string form for each trivial overload.  Instead, a typical implementation of std::string has an optimized version for each variant, making maintenance a nightmare.  But we don't have that problem -- because, instead of requiring an STL allocator, we can accept an arbitrary backing store!  So, suppose we have a working implementation of append:

template < class Encoding, class BackingStore > class rmstring {
...
   // Appends n codepoints of str, starting at pos, to the string.
   // * Will throw an out_of_range exception if pos >= str.length()
   // * If pos is in range, but posn > str.length(), n is truncated so that pos + n = str.length().
   // * Will throw an length_error exception if the resulting string would be larger than BackingStore's max_size().

   template < class OtherBS > rmstring & append( rmstring<Encoding, OtherBS> const & str, size_type pos, size_type n ) {
      
/* implementation */
   }
...
};

   (Note that I've defined the above in terms of code points, not symbols.  There can be multiple codepoints representing a single symbol.  I'll discuss this problem, and the related problem of Unicode normalization forms, in a later post -- namely because I'm still working on a solution.  :-P This is a learning exercise for me too!)

   Because OtherBS is arbitrary, we can directly implement the other overloads of append() as calls to append() with a rmstring constructor, without worrying about needlessly duplicating information.  If we want to use a char * from an ANSI C function, we can just use a unmanaged_pointer backing store.  If we want to use n repetitions of some character c, we can just use a run_of_chars<n, c> backing store.  We pass the exact same information as if we were doing it the old way, but abstracted inside a templated class, so there's no overhead except at compiletime.  Beautiful!

   So, what should we implement from std::string?  Here's the core functions from basic_string that seem worth carrying over:

  • Size functions: size() and length(), max_size(), capacity(), reserve(), resize(), empty(), clear()

  • Iterators: begin(), end(), rbegin(), rend()

  • Accessors: operator[], at()

  • Replacers: assign(), operator=

  • Appenders: push_back(), push_front(), append(), operator+=, operator+

  • Modifiers: insert(), erase(), replace()

  • Searchers (evil): find(), rfind(), find_first_of(), find_last_of(), find_first_not_of(), find_last_not_of()

  • Utilities: substr(), copy(), swap()

  • Comparators (also evil): compare(), operator==, operator!=, operator<, operator>, operator<=, operator>=

  • Streams: operator<<, operator>>

  • Backwards compatibility: c_str(), data()

   That's a lot of stuff to implement!  But not only does it gain us good-will by allowing programmers to code much like they did with std::string, it also means that we can make a typedef rmstring<RMS_COMPILER_SPECIFIC_ENCODING, vector_of_bytes> rstring, and be pretty damned close to std::string-equivalent.  (The compiler-specific encoding can be set in a header file, or specified on the command line -- I'll likely set it to iso8859_1 for string and ucs2 for wstring in a header.)

   But before I get to that, I'll have a nastier problem to tackle, and that's combining characters.  Not only do we have codepoints that can take up variable amounts of space (thanks to encoding), but we also have symbols that can take up variable amounts of codepoints!  (See Part 1 and search for "diaeresis" if you're not sure why this is.)  Unicode, luckily, comes to the rescue again with a standard that determines when and how a character symbol or should not be broken down into combining characters.  These are called normalization forms, and we'll tackle those on Monday.

   Next episode: Normalization forms and chain of command (which does not involve rmstring covering its ass if things go FUBAR).



Takeaways from Part 4:

  • We're specifically designing rmstring to force the programmer into awareness of encodings -- we don't want to hide that with a basic_rmstring being typedefed.  (We couldn't anyways, because we can't template typedefs.)  So, for now, we'll leave it as-is.

  • Not only are all encodings inequal, not all trans-coding schemes are equal either!  Be aware of this, and think about how you want to handle errors!

  • Even if we think std::string is evil, we can still gain good will from our potential users by making ourselves as close to std::string as possible.  This, unfortunately, means lots of work.  But not as much as if we were actually implementing std::string, due to our luck in choosing to template our backing store.

  • However, all our methods need to be defined in terms of symbols, not code points (and certainly not bytes of encoded data!).  This makes our life difficult again.
0 Comments
Filed under: ,

   (Before I start: I've gotten a few suggestions about readability, since my two entries thus far have been quite long.  So, entries will now contain a summary at the end with major facts/conclusions, and I'll go back and add them for the first two posts.  I'll also try to pace my paragraphs more regularly.  Thanks for the advice!)

   Yesterday, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for encoding and decoding code point indices on a binary computer.  At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode code points.  Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition.

   Before we do that, though, there's one more nasty digression into standards-land that I'd like to take.  This is a fairly general definition of what a string is, and you don't really write libraries unless you intend for them to be general-purpose enough to be reused.  So, it might be a worthwhile goal to make our new string library compatible with the string class in the C++ Standard Template Library, so that anyone could gain its benefits simply by using a different #include.  Unfortunately, there's some restrictions that the C++ Standard (which I would highly suggest purchasing if you code in C++ for a living -- it's $18 in PDF form direct from ISO) which prevent us from doing so -- namely, that many parts of basic_string are hard-wired to require a constant-size encoding and will not work with encodings such as UTF-8.

   The C++ Standard starts by defining basic_string as templated on three classes -- a character type (charT), a specialization of char_traits for that type, and an allocator for that type.  (Nothing SAYS we have to implement it with exactly those template parameters, but we're screwed anyways, as you'll see.)  It then defines two static typedefs for that specialization: traits_type, which typedefs to the templated traits specialization, and value_type, which typedefs to traits_type::value_type... which, by definition, is also required to be charT.  The definition of char_traits requires that char_traits be specialized only on PODs (which are always constant-size), and its definitions all are written to assume uniformly-sized characters.

   If the traits problem wasn't enough, on top of that, a conformant basic_string implementation requires that s[i] return the same value as s.data()[i], and data is required to return a const charT *.  So, even if we could get around the traits problem, variable-length encodings still screw us because operator[] and a pointer offset will no longer agree.

   So, we will have to abandon hopes of being a drop-in replacement for basic_string.  But, really, this isn't too bad -- there's only three other libraries in the STL that require the use of basic_string!  The first is in locale, and hardly anyone uses C++'s built-in locales anyways, favoring OS functionality.  The second is the bitset container, which hardly anyone uses either.  The third is its use as a backing store for stringstreams and as the stringbuf wrapper that is the foundation of iostream, and this is a bigger loss.

   The loss of direct compatibility with stringbuf is a big pain.  However, when you're getting to I/O, you need to have already converted your string to the encoding your user is expecting -- we shouldn't expect a prompt expecting ASCII to be able to deal with a stream of UCS-2 characters!  So, it's perfectly okay if stringbuf is left alone, as long as we find a way to convert strings between different encodings.  So, stringstreams are the only real loss, and we can make our own stringstream, if need be.  (Thanks to templates, we may be able to avoid having to re-invent the wheel, which is always good.)

   I'm going to start with policy-based design, which Alexandrescu introduced a few years ago in Modern C++ Design.  (Actually, the STL beat him to the punch by using allocators as a template argument for most of its containers, but he popularized its use for general customization.)  In fact, he already demonstrated policy-based design in a CUJ article a year or two ago by making a basic_string replacement that allowed customizing copy-on-write semantics -- but I'm a bit more ambitious :)

   My first stab at the class will be based directly off our most recent definition of string -- an encoding, and an ordered sequence of bits:

namespace rmlibs {

   namespace encodings {
      /* ... utf8, iso8859_1, big5, mac_roman, etc. go here ... */
   };

   namespace backing_stores {
      /* ... string_literal, vector_of_uchars, etc. go here ... */
   };

   template <class Encoding, class Bits> class rmstring {
      public:
         typedef Encoding encoding_type;

      private:
         Bits _data;
   };

};

   Not much, but it's a start!

   At this point, I want to reference something I said earlier about I/O -- when you're doing I/O, whether that's taking a string in or sending a string out, your stream of bits needs to have the same encoding as the device you're talking with, or Bad Things happen.  We need some way to denote, inside code, that an encoding change needs to take place.  (Guessing ahead, this will probably be the most tedious part of development -- creating UCS-to-encoding and encoding-to-UCS transitions for each encoding and character set we support.)  I'm going to take a nod from the excellent Boost library here, and make an analogue to their lexical_cast class.

namespace rmlibs {

   // these are the major exceptions...

   
class missing_symbol;
   class malformed_data;

   // ... that are thrown by:

   template <typename Target, typename Source> Target encoding_cast(Source str);
};

   In the near future I'll probably alter this to take only rmstrings as input and output and template on encoding types in/out, since right now it accepts any pair of types -- but this is only a prototype.

   The goal for doing this is to minimize conversions.  Some of my coworkers who have been kind enough to proofread have remarked, "I'd just throw up my hands and convert everything internally to UCS-4 and use a basic_string<unsigned long>; after all, memory is cheap."  In a way, they're right -- doing this would mean I'd only have to write encoding_cast() for each encoding, and not even need the new string class.  But, I'm a performance guy, a bit twiddler at heart.  I don't want to do a conversion unless I need to, or if the performance gains from a fixed-width format like UCS-4 outweigh the performance loss of having to trans-code everything.

   (It's rather like image formats -- TGA is lossless and can hold damn near anything, but that doesn't mean we always convert everything to TGA first before working with it, and then convert back when we're done.  Not everything has to be "worked on," and not all work is equally difficult.  This is especially true if we're using a compile-time string literal as a backing store, since it won't be modifiable unless you make a copy!)

   The general plan is to use rmstring as a Facade pattern for the Encoding class we're templated on.  Most of rmstring's methods will actually call the Encoding class and pass in state and a pointer to our Bits object as needed; the Encoding class will handle all the work of character traversal.  Since many of the encodings we're planning to deal with are fixed-width (UCS-2, UCS-4, and most old systems like ISO 8859 and ASCII), I'll likely create a FixedWidthEncoding base class that does most of the work of locating offsets and insertion/deletion, and inherit most of the Encodings from it.  This means, the main thing that will be unique for each Encoding will be the translation tables used for converting the symbol sets for non-Unicode systems to Unicode code points, since most of the older encodings are simple fixed-width affairs and just have non-standard symbol sets.

   Tomorrow, we'll start fleshing out rmstring's body with constructors and methods, and explain what those two exceptions next to encoding_cast are for.  We'll also take a brief look at screen-readers and web browsers, and make a change to encoding_cast to handle "looks-close-enough" trans-codes.



Today's facts/conclusions:

  • The definitions of basic_string and char_traits in the C++ Standard prevent use of variable-width encodings; therefore, we cannot make a perfect drop-in replacement for the STL string class.  However, that's okay -- the only STL object we'll have to duplicate functionality for is stringstream.

  • We can't expect I/O with external devices/programs to conform to whatever encoding we want -- they're expecting a specific encoding, and we need to present our data in that format -- or die a horrible, painful death.  So, the ability to trans-code is absolutely necessary.

  • Trans-coding can be expensive, but can have some gains, especially if going to UCS-4 for speed in manipulation or going to UTF-8 for compatibility with legacy C APIs.  Do it when necessary or justified, but avoid it if it's not absolutely necessary.  The coder should be allowed to pick an encoding and work with strings in that encoding as easily as possible.
0 Comments
Filed under: ,

   At the end of the last post, we reduced the abstract concept of "string" down to an "ordered sequence of Unicode code points."  (We did so by choosing to actively ignore glyph information, but we'll be coming back to it later.)  Unicode code points are simply numbers; of course, numbers have to be reduced to binary to be stored in a computer.  And someone who is reading a string from a file, or from memory, needs to use the exact same encoding scheme, or we're off in la-la land.  And not all encodings are equal.

   First off, the simplest route.  There are 231 possible Unicode code points, and an x86 register is 32 bits wide, so let's just add a zero and encode everything as a 32-bit unsigned binary!  The ISO-10646 standard calls this UCS-4.  Only one catch -- it doesn't specify endianness.  Of course, this poses a problem if you want to trade text files between PCs and Macs.  So, UCS-4 actually is three different encodings -- UCS-4LE (little endian), UCS-4BE (big endian), and just plain UCS-4, which means that no endian is specified and you should assume that it's the host's encoding unless told otherwise.  (There are ways to tell otherwise -- but I'll mention them later.)

   Now, the ISO-10646 guys recognized that the majority of the written languages used on the Internet today can be expressed using a tiny subset of the 231 symbols, and it seems a waste to use four bytes for every character if the high bytes are 0 most of the time.  So, ISO-10646 also defines UCS-2, which uses a 16-bit unsigned binary, but can only represent the lower 216 code points.  (The lower 216 codepoints are thus referred to as the Basic Multilingual Plane, or BMP.  This includes Latin, Greek, Cyrillic, Devangari, hiragana, katakana, and Cherokee scripts, as well as many mathematical symbols and a small set of basic Han ideographs.)

   This is the first encoding we'll encounter that is non-universal -- there are some strings that are expressible using Unicode characters which UCS-2 cannot be used to encode.  Sadly, UCS-2 was adopted by early versions of the Unicode specification, and so UCS-2 is what most people think of when they hear "Unicode".  We can't blame them, though -- it took until 2001 for ISO to use up all 216 code points in the BMP, and by then they were adding Han ideographs in bulk.

   Next, we start diving into encodings that were invented for reverse compatibility with older standards.  As we said, early versions of Unicode specifiy UCS-2 as a standard, back when nothing existed in the UCS tables beyond the BMP.  When it became obvious that eventually people would need to use codepoints beyond 216, a hybrid encoding called UTF-16 was created.  The Unicode Consortium reserved a high range of codepoints (D800 to DFFF) to be used as "surrogate characters," so that up to 10242 characters above the BMP border could be represented as two consecutive surrogate characters, without breaking existing UCS-2 content.  This adds a brand new level of complexity to string handling, because now a single codepoint could be either 2 or 4 bytes.  This makes even simple tasks such as iterating over the string with a for-loop difficult.

   Later on, UTF-32 was introduced.  UTF-32 is effectively identical to UCS-4 -- its sole difference is in the specification.  UTF-32 claims that it should not be used to represent characters above 0x10FFFF.  (Nothing is stopping it, though -- it's still just a unsigned long int.)  I mention it mostly for completeness, and so you'll recognize the name.  And don't forget that all of these encodings have endianness to worry about, so we've really covered 12 encodings for Unicode so far: UCS-4(BE/LE/host), UCS-2(BE/LE/host), UTF-16(BE/LE/host), and UTF32(BE/LE/host).

   (Windows-specific digression here: WCHAR is typedef'd inside winnt.h to wchar_t, whose size is determined by the compiler you're using.  On Visual C++ .NET 2004, wchar_t is currently an 'unsigned short' and uses UCS-2LE; on gcc, unless specified otherwise it's an 'int'.  The encoding for gcc varies by version and by compiler setting, though, and gcc 3.3 in particular is horribly buggy and can corrupt your string literals.)

   Everything's fine and dandy thus far, except for one catch -- we can't send strings in these encodings to old webservers that use C functions like strcmp(), strlen(), strcpy(), etc. -- or any other function that relies on the presence of a null byte to denote where the string ends.  Why?  Because, for any string that uses only the Latin alphabet (i.e. one that you could write in plain old ASCII), the first byte in any of the above encodings will be 00.

   Because of this, there's one more standard Unicode encoding, and that's the notorious UTF-8.  UTF-8 can be thought of as a relative of Huffman encoding -- it guarantees that all codepoints less than or equal to 0x7F are encoded as single unsigned bytes (i.e. direct 7-bit ASCII correspondence), and that all codepoints greater than 0x7F are encoded as a multi-byte sequence.  All bytes in a multi-byte sequence have their MSB set, and the first byte of such a codepoint contains the number of bytes that follow.

   So, in exchange for being a messy variable-length format that's hard to work with, UTF-8 can encode the entire set of Unicode codepoints and guarantees that any UTF-8 string will be correctly handled by a function expecting a null-terminated string.  Also, since UTF-8 is specifically meant to be handled a byte at a time, it avoids the entire messy problem of endianness.

   (Historical note: UTF stands for UCS Transformation Format.  The infamous Ken Thompson created UTF-8 in 1992 on a napkin in a New Jersey diner, for use in Plan9, and reported their success with it to the 1993 USENIX conference.  Unicode and ISO both formally standardized it in 2001, although the Unicode adds the extra clause that it should not be used to express codepoints above 0x10FFFF, just like UTF-32.)

   Now, I mentioned earlier that the other formats had to choose endianness: explicitly specify, or shrug and assume that it's the same as the host.  There is a common solution to this -- and that's to use a marker to determine the endianness.  This marker is known as the Byte Order Mark, or BOM for short, and is Unicode code point 0xFEFF ("ZERO-WIDTH NO-BREAK SPACE" -- a null symbol, effectively).  If you encounter the character 0xFFFE while decoding, you know that the file you're reading was written on a machine of opposite endianness, and you should flip bytes.  (Unicode code point 0xFFFE has been specifically designated as an invalid character for this purpose.)  Keep in mind that you may encounter multiple BOMs in a string and may have to switch back and forth!  (This could happen if, for example, you used UNIX cat to concatenate two text files, and one was UCS-2BE and one was UCE-2LE.)  UTF-8, being specifically designed to be parsed on a byte-by-byte basis, does not need a BOM. 

   There's a few other standard Unicode encodings, but the big 13 above are the only ones that you see regularly.  I'll mention the other ones briefly, mainly because they show up in some old internet protocols:

  • UTF-7 was an early attempt to translate Unicode points to 7-bit-ASCII text for use in MIME-encoded emails.  It specified that all 7-bit characters should be transmitted as single bytes, like UTF-8.  However, rather than use the eighth bit to denote a multibyte character, it overloaded the + sign as a sentinel.  "+-" denoted that a normal plus should appear; for any other following character, the following three bytes were the UCS-2 encoding, re-encoded in Base64.  It could not transmit anything outside the BMP.  UTF-7 is used, slightly modified, in parts of the IMAP mail protocol; for POP3 and SMTP, however, it has mostly been bypassed in favor of UTF-8.

  • SCSU (Standard Compression Scheme for Unicode) was an early attempt at a variable-length encoding like UTF-8 proposed by Reuters News, that added light compression as well.  However, small compression schemes like this are painfully inefficient compared to larger schemes like LZW or BWT, and they makes it very difficult to handle internally.  SCSU is not used in any major protocol or file format that I know of today.

  • Punycode (RFC 3942) is similar to UTF-7 and uses the string "xn--" as a sentinel.  Punycode is only used in one situation -- the IDNA (Internationalizing Domain Names in Applications) protocol used to handle use of Unicode domain names in DNS.  An IDNA-capable web browser will capture a string from the address bar, translate it to ASCII text using the Punycode system, and send the converted string as a standard getaddrbyname() DNS request, and the DNS server translates it back to Unicode upon reciept before doing the lookup.  If you're making a better bind, or fixing Firefox, this will be of interest to you; I do not expect to encounter files or other strings encoded in this system.

   So, that's the set of encodings that directly encode Unicode code points.  There's only one catch -- there's also encodings out there that don't directly map to Unicode codepoints!  In this case, we have to do an two-part mapping to get to Unicode -- first, decoding to a symbol number in the source that matches that encoding's symbol set, and then converting that to a Unicode codepoint!  Yuck.  And we're going to encounter a lot of these too, because these have names we recognize like ASCII Code Page 437 and ISO 8859-1 and Windows DBCS and GB and Big5 -- all those legacy formats, some of which are also variable-length like UTF-8.  We've got our work cut out for us!

   Of course, I never mentioned what it is we actually were working on...  I'm out to create a string class for C++ that doesn't suck.  Now, there's three ways that C++'s std::string sucks.  The first sin is that you have to use a backing store -- you can't tell it to use a string literal like L"Schöne Grüße" as a source, since allocator<char> requires that the target be modifiable.  All contents have to be copied, because contents are always mutable.  The second sin is that it assumes that the compiler and author knows what they're doing when they manipulate its contents.  To C++, a basic_string<T> is really just a pretty interface on a vector<T>.  The third sin may vary to some people; for me, it exists in the forms of some stupid promises that 14882 (the ISO C++ standard) wasn't willing to make, most notably that the c_str() method is capable of invalidating references, pointers, and iterators.  This was mostly done to accomodate copy on write and other implementation details, but it makes writing conformant string-handling code infuriatingly difficult if you ever have to interface std::string with C functions that need C strings (such as, say, the Win32 API!).

   I'm opting to only fix the first two sins for now -- backing store handling, and encoding awareness.  The third sin, you can handle according to your level of offendedness.  Tomorrow: Policy-based design using templates, minimizing conversions, and why 14882's char_traits makes it impossible to make a strictly conformant std::string that supports variable-length encodings.



Today's facts/conclusions:

  • We have to store the code points in a string somehow.

  • A lot of pain comes from wanting to retain reverse compatibility with old character sets and old encodings.

  • Large fixed-width formats like UCS-2 and UCS-4 make string manipulation very easy since they allow random access to individual code points, but are not compatible with old C functions that expect null-terminated strings.  However, keep an eye out for endianness.

  • Variable-width formats like UTF-8 are compatible with null-termination functions, but have to be parsed sequentially.
4 Comments
Filed under:

    What is a string?  

    About six months ago at the Game Developers Conference in San Jose, I sat in on a talk about performance tuning in Xbox games.  The presenter had a slide that read:  "Programmers love strings.  Love hurts."  This was shown while he described a game which was using a string identifier for every object in the game world and hashing on them, and was incurring a huge performance hit from thousands of strcmp()s each frame.  I nodded -- but my mind was thinking, "The same would be true if they had used GUIDs, or any other large identifier.  After all, strcmp is just a bounded memcmp."  So, what actually IS a string?

    I think it's safe to say that a string is something that a human interprets and derives meaning from.  In this case, that something is almost always an ordered sequence of symbols (note: the symbols may not be co-linear!) that conveys meaning.  Now, let's assume from here on that a string is an ordered sequence of 2D glyphs.  A glyph is three pieces of data: a symbol, the dimensions to render that symbol at, and the location where it should be rendered.  This is still describing a very abstract, human-centric thing.  To express this in the programming world, we have to identify these glyphs somehow.  A vector drawing or bitmap approximation of a glyph would suffice.  But we don't want to require that people deal with these just to print "Hello World" to the screen.  So, let's put the glyphs somewhere in the system, and assign indices to them.

    And thus, we have ISO 10646, known as the Universal Character Set or UCS.  UCS is a simple mapping of decimal indices (called code points) and formal names, to symbols.  For example, in the UCS, code point 0x41 is "Latin capital letter A" and corresponds to, of course, the letter A.  The goal of UCS is to be a superset of all character sets.  So, given a set of characters such as 7-bit ASCII, or ISO 8859-1, or EBCDIC, we can find some mapping (preferably 1:1, but we're not always so lucky) to UCS.  So, our definition of glyph now converts to a tuple containing a UCS code point, a size, and a distance from the last render point.

    We now find ourselves asking a few more questions about glyphs.  Size is fairly easy to measure -- just a box that bounds the symbol.  However, distance is difficult, because good typesetting requires that the distance between characters be measured from any number of points inside that box.  For simple Roman alphabets, we might want to measure from the baseline; accents might have to go relative to baseline + ascent; some characters may have an advance width that is greater than their bounding box; and this doesn't even begin to address script-based languages like Arabic!

    On top of this, UCS allows two ways to represent accented characters.  Most accented characters have a dedicated UCS code point; however, an accented character can also be represented as the code point for the un-accented character, followed by code points for one or more accents as stand-alone symbols.  UCS calls symbols which are meant to be applied to the previous character "combining characters," and refers to symbols containing preaccented letters as "precomposed characters."

    For example, the symbol Ä can be represented by either the precomposed UCS code point 0xC4 ("Latin capital letter A with diaeresis") or by the code point 0x41 ("Latin capital letter A") immediately followed by code point 0x308 ("combining diaeresis").  And don't forget that there needs to be size and direction between the diaeresis and the letter, and that there can be more than one combining character following a single symbol, including some symbols which can vary their positioning depending on their combination with other combiners!

    The UCS took the easy way (or, as some would argue, the sanest way) out of dealing with all the positioning problems of glyphs -- it simply refused to acknowledge their existence.  The UCS is simply a symbol table that includes combining characters, nothing more.   The UCS also doesn't deal with any of the properties that we assign to specific symbols; for example, it doesn't recognize case.  It cannot say that Ä and A are upper-case and a is lower-case, or that Ä and A have the same root letter and differ only by accent, or that a is the same root letter as those two -- they're simply different symbols with no relation.  As a result, the UCS isn't very well known, despite the fact that it has existed for over a decade.  This is where Unicode comes in.

    Unicode originally started out in the late 1980s as an ad-hoc standard agreed on by a group of companies making multi-lingual software products.  Initially, Unicode was developed separately from UCS; however, starting in 1991 Unicode merged its code table with UCS, and all versions of Unicode from 1.1 (June 1992) forward match the UCS.  Unicode does not define glyph data, or the vectors that are used to render a symbol.  However, it does provide lots of normative semantic information that UCS code points lack.  For example, a Unicode code point not only contains the UCS symbol, but also data such as the symbol's case (upper/lower/title), category (letter, mark/accent, digit, punctuation, separator, etc.), and numeric interpretations of digit symbols (i.e. the symbol 4 represents four things).  Alongside this, we have the Unicode Technical Standards, which define culturally appropriate comparison, sorting, and searching algorithms, character boundaries in script languages, how to handle newlines (CR/LF/CRLF/NEL), and other such handy information.

    Let us assume, for now, that given an ordered sequence of Unicode code points, the OS can convert them to glyphs and render them in a way that's appropriate.  Of course, this is a huge and almost entirely false assumption -- and I'll be coming back to it later.  But it's also a very convenient assumption, because it allows us to reduce the definition of a string down to something that's easy to tackle: a finite ordered sequence of Unicode code points.  Of course, in order to store a decimal on a computer, it has to be converted to binary.  However, not all binary representations are the same, and not everyone thinks it's worth using 31 bits of information for every character.  Tomorrow's episode: encoding systems, and the major character sets that love them.



Today's facts/conclusions:

  • Strings should be thought of as human-centric, rather than tied to a video card's interpretation of regularly-sized bits.

  • Strings are composed of glyphs.  A glyph consists of a symbol, plus typesetting information.

  • There's already a standard table called ISO 10646, or UCS, that maps code points (numbers) to symbols.  Unicode adds semantics like case, comparison rules, and sorting algorithms to UCS.

  • Typesetting information is really tricky to store portably.  UCS and Unicode ignore its existence.

  • If the OS can be relied on to handle glyphing, we can store a string as an ordered sequence of Unicode code points.


    Oh, and since this is the first post here that's visible to the public -- I'm Ryan Myers, a geek-of-all-trades currently on the Windows Client Performance team.  I intend to use this blog as an ongoing set of essays about various facets of programming I've encountered.  (I use essays as the textual equivalent of sitting in front of a whiteboard reasoning things out, rather than a polished report of what I wish I had done the first time.  So, conclusions may change from post to post, and I welcome all comments and counterpoints.)  So, pardon the mess and enjoy the show.

3 Comments
Filed under:
 
Page view tracker