Funny, It Worked Last Time

... and other odd mutterings of a performance junkie

January, 2005

  • Funny, It Worked Last Time

    Encodings in Strings are Evil Things (Part 8)

    • 8 Comments
    Going from a legacy format to Unicode is fairly simple; in addition to combining characters, Unicode also provides an array of compatibility characters. Compatibility characters are canonically equivalent to a sequence of one or more other Unicode characters; they are usually placed so that you have a single codepoint that's equivalent to a character in some older standard. For example, ISO8859-2 defines 0x5A to be equivalent to a capital letter L with a caron accent (Ľ). The "simple" equival...
  • Funny, It Worked Last Time

    Encodings in Strings are Evil Things (Part 7)

    • 1 Comments
    Imagine that you've allocated a byte array for recv()ing something in from a TCP socket. If we know that said content is UCS-4, the natural urge is to cast it to an unsigned long * to iterate over... except that you can't. Or, at least, you shouldn't. If that byte array isn't suitably aligned for 32-bit accesses, code will either run slowly (on x86 and AMD64) or crash (on IA-64, unless SetErrorMode() is called to force OS alignment fixups, in which case it will run extremely slowly). Of cour...
Page 1 of 1 (2 items)