Welcome to MSDN Blogs Sign in | Join | Help

Browse by Tags

All Tags » C++   (RSS)
Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters. Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard. For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (Ľ). The "simple" equival Read More...
8 Comments
Filed under: ,
Imagine that you've allocated a byte array for recv()ing something in from a TCP socket. If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't. Or, at least, you shouldn't. If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly). Of cour Read More...
1 Comments
Filed under: ,
How do you define operator[] for a string that's in a variable-width encoding such as UTF-8? One of the basic assumptions in std::string that I intend to honor is that operator[] returns a reference to the actual data, not a copy. For fixed-width encodings such as ASCII, UCS2, or UCS4, this is not a problem; I simply return a unsigned char/short/long. However, for variable-width encodings, I need to return a range of bytes, and presumably a size as well. I could do this with covariant return Read More...
0 Comments
Filed under: ,
However, regardless of whether pre-composed characters are favored or not, there are some character sequences which do not have pre-composed equivalents and must be represented using combining characters. Of course, our problem here is that most programmers don't think about accents as being distinct elements to iterate through! When you hit the right arrow in Microsoft Word to skip over an À, you don't go first to an A and then to the A's accent -- you move past the whole "character." (Unico Read More...
6 Comments
Filed under: ,
In our last episode, we established that we wouldn't be able to make a true std::string replacement and still handle variable-width encodings. So, we started with the beginning lines of an rmstring class. However, this doesn't mean we are going to dispense with std::string entirely! And, as it turns out, compatibility with it is both easier and harder than actually making a std::string, depending on what you're implementing and where... Read More...
2 Comments
Filed under: ,
Yesterday, we took the definition of string as an ordered sequence of Unicode code points, and explored various schemes for encoding and decoding code point indices on a binary computer. At the end, we had a new definition for string -- a stream of bits, and some type of information identifying the encoding scheme used to interpret the bits as a stream of Unicode codepoints. Today, since I'm a coder, we'll be starting a C++ implementation of a string library based on this definition. Read More...
1 Comments
Filed under: ,
 
Page view tracker