May, 2005

  • The Old New Thing

    When is a window visible yet not visible?

    • 6 Comments

    Today, a quick puzzler.

    Consider the following code fragment:

     ShowWindow(hwnd, SW_SHOWNORMAL);
     assert(IsWindowVisible(hwnd));
    

    We just showed the window, certainly it is visible, right? Yet the assertion can fire (even in the absence of multi-threading). Why?

    Answer below - stop reading if you want to try to solve it yourself.

    Take a look at the IsWindowVisible function.

    If the specified window, its parent window, its parent's parent window, and so forth, have the WS_VISIBLE style, the return value is nonzero. Otherwise, the return value is zero.

    The WS_VISIBLE style indicates that this window is visible in its parent. But the parent itself might not be visible, in which case IsWindowVisible returns FALSE.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    Loading the dictionary, part 6: Taking advantage of our memory allocation pattern

    • 25 Comments

    After our latest round of optimization, the 100ms barrier teased us, just milliseconds away. Profiling the resulting program reveals that 60% of the CPU is spent in operator new. Is there anything we can do about that?

    Indeed, we can. Notice that the memory allocation pattern for the strings in our dictionary is quite special: Once a string is allocated into the dictionary, it is never modified or freed while the dictionary is in use. When the dictionary is freed, all the strings are deleted at once. This means that we can design an allocator tailored to this usage pattern.

    I don't know whether there is a standard name for this thing, so I'm just going to call it a StringPool. A string pool has the following characteristics:

    • Once you allocate a string, you can't modify or free it as long as the pool remains in existence.
    • If you destroy the string pool, all the strings in it are destroyed.

    We implement it by using the same type of fast allocator that the CLR uses: A single pointer. [25 May 2005: The blog server software corrupts the diagram, sorry.]

    allocated free
    p

    To allocate memory, we just increment p by the number of bytes we need. If we run out of memory, we just allocate a new block, point p to its start, and carve the memory out of the new block. Destroying the pool consists of freeing all the blocks.

    Note also that this memory arrangement has very good locality. Instead of scattering the strings all over the heap, they are collected into one location. Furthermore, they are stored in memory in exactly the order we're going to access them, which means no wasted page faults or cache lines. (Well, you don't know that's the order we're going to access them, but it's true. This is one of those "performance-guided designs" I mentioned a little while ago.)

    class StringPool
    {
    public:
     StringPool();
     ~StringPool();
     LPWSTR AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd);
    
    private:
     union HEADER {
      struct {
       HEADER* m_phdrPrev;
       SIZE_T  m_cb;
      };
      WCHAR alignment;
     };
     enum { MIN_CBCHUNK = 32000,
            MAX_CHARALLOC = 1024*1024 };
    
    private:
     WCHAR*  m_pchNext;   // first available byte
     WCHAR*  m_pchLimit;  // one past last available byte
     HEADER* m_phdrCur;   // current block
     DWORD   m_dwGranularity;
    }; // colorization fixed 25 May
    

    Each block of memory we allocate begins with a StringPool::HEADER structure, which we use to maintain a linked list of blocks as well as providing enough information for us to free the block when we're done.

    Exercise: Why is HEADER a union containing a structure rather than just being a structure? What is the significance of the alignment member?

    inline RoundUp(DWORD cb, DWORD units)
    {
        return ((cb + units - 1) / units) * units;
    }
    
    StringPool::StringPool()
     : m_pchNext(NULL), m_pchLimit(NULL), m_phdrCur(NULL)
    {
     SYSTEM_INFO si;
     GetSystemInfo(&si);
     m_dwGranularity = RoundUp(sizeof(HEADER) + MIN_CBCHUNK,
                               si.dwAllocationGranularity);
    }
    

    At construction, we compute the size of our chunks. We base it on the system allocation granularity, choosing the next multiple of the system allocation granularity that is at least sizeof(HEADER) + MIN_CBCHUNK in size. Since a chunk is supposed to be a comfortably large block of memory, we need to enforce a minimum chunk size to avoid having an enormous number of tiny chunks if we happen to be running on a machine with a very fine allocation granularity.

    LPWSTR StringPool::AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd)
    {
     size_t cch = pszEnd - pszBegin + 1;
     LPWSTR psz = m_pchNext;
     if (m_pchNext + cch <= m_pchLimit) {
      m_pchNext += cch;
      lstrcpynW(psz, pszBegin, cch);
      return psz;
     }
    
     if (cch > MAX_CHARALLOC) goto OOM;
     DWORD cbAlloc = RoundUp(cch * sizeof(WCHAR) + sizeof(HEADER),
                                                              m_dwGranularity);
     BYTE* pbNext = reinterpret_cast<BYTE*>(
                      VirtualAlloc(NULL, cbAlloc, MEM_COMMIT, PAGE_READWRITE));
     if (!pbNext) {
    OOM:
      static std::bad_alloc OOM;
      throw(OOM);
     }
    
     m_pchLimit = reinterpret_cast<WCHAR*>(pbNext + cbAlloc);
     HEADER* phdrCur = reinterpret_cast<HEADER*>(pbNext);
     phdrCur->m_phdrPrev = m_phdrCur;
     phdrCur->m_cb = cbAlloc;
     m_phdrCur = phdrCur;
     m_pchNext = reinterpret_cast<WCHAR*>(phdrCur + 1);
    
     return AllocString(pszBegin, pszEnd);
    }
    

    To allocate a string, we first try to carve it out of the remainder of the current chunk. This nearly always succeeds.

    If the string doesn't fit in the chunk, we allocate a new chunk based on our allocation granularity. To avoid integer overflow in the computation of the desired chunk size, we check against a fixed "maximum allocation" and go stright to the out-of-memory handler if it's too big.

    Once we have a new chunk, we link it into our list of HEADERs and abandon the old chunk. (Yes, this wastes some memory, but in our usage pattern, it's not much, and trying to squeeze out those last few bytes isn't worth the added complexity.) Once that's done, we try to allocate again; this second time will certainly succeed since we made sure the new chunk was big enough. (And any decent compiler will detect this as a tail recursion and turn it into a "goto".)

    There is subtlety here. Notice that we do not update m_pchNext until after we're sure we either satisfied the allocation or allocated a new chunk. This ensures that our member variables are stable at the points where exceptions can be thrown. Writing exception-safe code is hard, and seeing the difference between code that is and isn't exception safe is often quite difficult.

    StringPool::~StringPool()
    {
     HEADER* phdr = m_phdrCur;
     while (phdr) {
      HEADER hdr = *phdr;
      VirtualFree(hdr.m_phdrPrev, hdr.m_cb, MEM_RELEASE);
      phdr = hdr.m_phdrPrev;
     }
    }
    

    To destroy the string pool, we walk our list of chunks and free each one. Note the importance of copying the HEADER out of the chunk before we free it!

    Using this string pool requires only small changes to the rest of our program.

    struct DictionaryEntry
    {
     bool Parse(const WCHAR *begin, const WCHAR *end, StringPool& pool);
     // void Destruct() {
     //  delete[] m_pszTrad;
     //  delete[] m_pszSimp;
     //  delete[] m_pszPinyin;
     //  delete[] m_pszEnglish;
     // }
     LPWSTR m_pszTrad;
     LPWSTR m_pszSimp;
     LPWSTR m_pszPinyin;
     LPWSTR m_pszEnglish;
    };
    
    class Dictionary
    {
    public:
     Dictionary();
     // ~Dictionary();
     int Length() { return v.size(); }
    private:
     vector v;
     StringPool m_pool;
    };
    
    // Dictionary::~Dictionary()
    // {
    //  for (vector<DictionaryEntry>::iterator i = v.begin();
    //       i != v.end(); i++) {
    //   i->Destruct();
    //  }
    // }
    

    We no longer need to free the strings in the DictionaryEntry manually, so the Destruct method and the Dictionary destructor can go.

    bool DictionaryEntry::Parse(
       const WCHAR *begin, const WCHAR *end,
       StringPool& pool)
    {
     const WCHAR* pch = std::find(begin, end, L' ');
     if (pch >= end) return false;
     m_pszTrad = pool.AllocString(begin, pch);
     begin = std::find(pch, end, L'[') + 1;
     if (begin >= end) return false;
     pch = std::find(begin, end, L']');
     if (pch >= end) return false;
     m_pszPinyin = pool.AllocString(begin, pch);
     begin = std::find(pch, end, L'/') + 1;
     if (begin >= end) return false;
     for (pch = end; *--pch != L'/'; ) { }
     if (begin >= pch) return false;
     m_pszEnglish = pool.AllocString(begin, pch);
     return true;
    }
    
    Dictionary::Dictionary()
    {
     ...
        if (de.Parse(buf, buf + cchResult, m_pool)) {
     ...
    }
    

    And finally, we pass our string pool to DictionaryEntry::Parse so it knows where to get memory for its strings from.

    With these changes, the dictionary loads in 70ms (or 80ms if you include the time it takes to destroy the dictionary). This is 70% faster than the previous version, and over three times as fast if you include the destruction time.

    And now that we've reached our 100ms goal, it's a good time to stop. We've gotten the running time of dictionary loading down from an uncomfortable 2080ms to a peppier 70ms, a nearly 30-fold improvement, by repeatedly profiling and focusing on where the most time is being spent. (I have some more simple tricks that shave a few more milliseconds off the startup time. Perhaps I'll bring them into play if other changes to startup push us over the 100ms boundary. As things stand, the largest CPU consumers are MultiByteToWideChar and lstrcpynW, so that's where I would focus next.)

    That's the end of the first stage. The next stage will be displaying the dictionary in an owner-data listview, but you'll have to wait until next month.

  • The Old New Thing

    Loading the dictionary, part 5: Avoiding string copying

    • 28 Comments

    Looking at the profile for our program so far, 35% of the CPU time is spent copying strings around. Let's see if we can improve that.

    The best way to speed up copying strings is not to copy them in the first place. Using a wstring in our DictionaryEntry structure forces the vector class to copy the string data, when all we really need to copy is the pointer and size information. The actual characters can stay put. C++ has a copy constructor but not a "move" constructor.

    Let's use plain string pointers rather than wstring objects. The "copy constructor" for a string pointer is just to copy the pointer—exactly what we want here.

    struct DictionaryEntry
    {
     bool Parse(const WCHAR* begin, const WCHAR* end);
     void Destruct() {
      delete[] m_pszTrad;
      delete[] m_pszSimp;
      delete[] m_pszPinyin;
      delete[] m_pszEnglish;
     }
     LPWSTR m_pszTrad;
     LPWSTR m_pszSimp;
     LPWSTR m_pszPinyin;
     LPWSTR m_pszEnglish;
    };
    

    The DictionaryEntry is no longer a structure of wstrings but rather is just a structure of LPWSTRs, which copy much faster. The cost for this, however, is having to free all the strings manually in the dictionary destructor (which we will see below).

    Since we aren't using wstrings any more, we need to allocate the memory for the strings and copy them the old fashioned way.

    LPWSTR AllocString(const WCHAR* begin, const WCHAR* end)
    {
     int cch = end - begin + 1;
     LPWSTR psz = new WCHAR[cch];
     lstrcpynW(psz, begin, cch);
     return psz;
    }
    
    bool DictionaryEntry::Parse(
           const WCHAR* begin, const WCHAR* end)
    {
     const WCHAR* pch = std::find(begin, end, L' ');
     if (pch >= end) return false;
     m_pszTrad = AllocString(begin, pch);
     begin = std::find(pch, end, L'[') + 1;
     if (begin >= end) return false;
     pch = std::find(begin, end, L']');
     if (pch >= end) return false;
     m_pszPinyin = AllocString(begin, pch);
     begin = std::find(pch, end, L'/') + 1;
     if (begin >= end) return false;
     for (pch = end; *--pch != L'/'; ) { }
     if (begin >= pch) return false;
     m_pszEnglish = AllocString(begin, pch);
     return true;
    }
    

    There isn't a std::rfind function, so I coded up a backwards-search-for-slash loop inline. Exercise: Why don't I have to check that pch hasn't underflowed beyond the beginning of the string?

    class Dictionary
    {
    public:
     Dictionary();
     ~Dictionary();
     int Length() { return v.size(); }
     const DictionaryEntry& Item(int i) { return v[i]; }
    private:
     vector<DictionaryEntry> v;
    };
    
    Dictionary::Dictionary()
    {
     ...
       if (cchResult){
        // wstring line(buf, cchResult);
        DictionaryEntry de;
        if (de.Parse(buf, buf + cchResult)) {
         ...
    }
    
    Dictionary::~Dictionary()
    {
     for (vector<DictionaryEntry>::iterator i = v.begin();
          i != v.end(); i++) {
      i->Destruct();
     }
    }
    

    The last bits of the change are to get rid of the temporary wstring we parsed from, since we don't need it any more, and to free all the strings in the Dictionary's destructor.

    This new program clocks in at 120ms (or 290ms if you include the time it takes to destroy the dictionary). Again, we're twice as fast as the previous version, but still haven't quite crossed the 100ms barrier. And the time to destroy the dictionary isn't starting to be a more significant factor. But I have another trick up my sleeve.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    You can't escape those AOL CDs

    • 15 Comments

    One of my colleagues was unpacking one of those $30,000 quad-processor more-memory-than-you-know-what-to-do-with super-server computers. The kind that require their own electrical substation.

    And it came with an AOL CD.

    It's like buying a Lexus and finding a 35-cents-off coupon in the glove compartment.

    Apparently, one of the questions AOL tech support asks when people call in complaining that they can't get their AOL CD to work is, "Do you have a computer?" [Opening Panel Round, second question] because so many people who don't have computers stick the CD into their stereo or DVD player and can't get it to work.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    Loading the dictionary, part 4: Character conversion redux

    • 24 Comments

    Getting rid of getline was a big help, but 480ms is still not quite peppy enough. You need to respond to user actions within a tenth of a second for thing to seem responsive.

    Profiling the latest endeavor reveals that 40% of our CPU time is spent in codecvt::in. Some debugging reveals that codecvt::in ultimately calls MultiByteToWideChar but uses it to convert only one or two characters at a time, even though we handed it a whole line.

    Let's get rid of codecvt::in and convert the characters ourselves, calling MultiByteToWideChar exactly once to convert the entire line at a single go.

    #define CP_BIG5 950
    
    Dictionary::Dictionary()
    {
     MappedTextFile mtf(TEXT("cedict.b5"));
     // typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
     // std::locale l(".950");
     // const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
     const CHAR* pchBuf = mtf.Buffer();
     const CHAR* pchEnd = pchBuf + mtf.Length();
     while (pchBuf < pchEnd) {
      const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
      if (*pchBuf != '#') {
       size_t cchBuf = pchEOL - pchBuf;
       wchar_t* buf = new wchar_t[cchBuf];
       DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,
                              pchBuf, cchBuf, buf, cchBuf);
       if (cchResult) {
        wstring line(buf, cchResult);
        DictionaryEntry de;
        if (de.Parse(line)) {
         v.push_back(de);
        }
       }
       delete[] buf;
      }
      pchBuf = pchEOL + 1;
     }
    }
    

    Instead of using the codecvt::in method to perform character conversion, we go straight to the MultiByteToWideChar function. Notice that we assume that the Big5 string will not generate more Unicode characters than its length in bytes. This happens to be a safe assumption based on our external knowledge of the Big5 encoding. (If the encoding were something else, the assumption may no longer be valid.)

    With this change, the dictionary load time has dropped to 240ms (or 300ms if you include the time it takes to destroy the dictionary). That's twice as fast the previous version, but still not quite close enough to the 100ms goal. We still have some work ahead of us.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    Loading the dictionary, part 3: Breaking the text into lines

    • 28 Comments

    Even after moving the character conversion out of the getline function, profiling reveals that getline is still taking nearly 50% of our CPU. The fastest code is code that isn't there, so let's get rid of getline altogether. Oh wait, we still need to break the file into lines. But maybe we can break the file into lines faster than getline did.

    #include <algorithm>
    
    class MappedTextFile
    {
    public:
     MappedTextFile(LPCTSTR pszFile);
     ~MappedTextFile();
    
     const CHAR *Buffer() { return m_p; }
     DWORD Length() const { return m_cb; }
    
    private:
     PCHAR   m_p;
     DWORD   m_cb;
     HANDLE  m_hf;
     HANDLE  m_hfm;
    };
    
    MappedTextFile::MappedTextFile(LPCTSTR pszFile)
        : m_hfm(NULL), m_p(NULL), m_cb(0)
    {
     m_hf = CreateFile(pszFile, GENERIC_READ, FILE_SHARE_READ,
                       NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
     if (m_hf != INVALID_HANDLE_VALUE) {
      DWORD cb = GetFileSize(m_hf, NULL);
      m_hfm = CreateFileMapping(m_hf, NULL, PAGE_READONLY, 0, 0, NULL);
      if (m_hfm != NULL) {
       m_p = reinterpret_cast<PCHAR>
                     (MapViewOfFile(m_hfm, FILE_MAP_READ, 0, 0, cb));
       if (m_p) {
        m_cb = cb;
       }
      }
     }
    }
    
    MappedTextFile::~MappedTextFile()
    {
     if (m_p) UnmapViewOfFile(m_p);
     if (m_hfm) CloseHandle(m_hfm);
     if (m_hf != INVALID_HANDLE_VALUE) CloseHandle(m_hf);
    }
    

    This very simple class babysits a read-only memory-mapped file. (Yes, there is a bit of oddness with files greater than 4GB, but let's ignore that for now, since it's a distraction from our main point.)

    Now that the file is memory-mapped, we can just scan it directly.

    Dictionary::Dictionary()
    {
     MappedTextFile mtf(TEXT("cedict.b5"));
     typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
     std::locale l(".950");
     const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
     const CHAR* pchBuf = mtf.Buffer();
     const CHAR* pchEnd = pchBuf + mtf.Length();
     while (pchBuf < pchEnd) {
      const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
      if (*pchBuf != '#') {
       size_t cchBuf = pchEOL - pchBuf;
       wchar_t* buf = new wchar_t[cchBuf];
       mbstate_t state = 0;
       char* nextsrc;
       wchar_t* nextto;
       if (cvt.in(state, pchBuf, pchEOL, nextsrc,
                       buf, buf + cchBuf, nextto) == widecvt::ok) {
        wstring line(buf, nextto - buf);
        DictionaryEntry de;
        if (de.Parse(line)) {
         v.push_back(de);
        }
       }
       delete[] buf;
      }
      pchBuf = pchEOL + 1;
     }
    }
    

    We simply scan the memory-mapped file for a '\n' character, which tells us where the line ends. This tells us the location and length of the line, which we use to convert it to Unicode and continue our parsing.

    Exercise:Why don't we have to worry about the carriage return that comes before the linefeed?

    Exercise:Why don't we have to worry about possibly reading past the end of the file when we check *pchBuf != '#'?

    With this change, the program now loads the dictionary in 480ms (or 550ms if you include the time it takes to destroy the dictionary). That's over twice as fast as the previous version.

    But it's still not fast enough. A half-second delay between hitting Enter and getting the visual feedback is still unsatisfying. We can do better.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    The Microsoft corporate network: 1.7 times worse than hell

    • 17 Comments

    Today I'm going to tell a story from 1996. Why? Because I can.

    One of the tests performed by Windows Hardware Quality Labs (WHQL) was the NCT packet stress test which had the nickname "Hell". The purpose of the test was to flood a network card with an insane number of packets, in order to see how it handled extreme conditions. It uncovered packet-dropping bugs, timing problems, all sorts of great stuff. Network card vendors used it to determine what size internal hardware buffers should be in order to cover "all reasonable network traffic scenarios".

    It so happened that at the time this test had currency (1996 era), the traffic on the Microsoft corporate network was approximately 1.7 times worse than the NCT packet stress test. A card could pass the Hell test with flying colors, yet drop 90% of its packets when installed on a computer at Microsoft because the card simply couldn't keep up with the traffic.

    The open secret among network card vendors was, "If you want your card to work with Windows, submit one card to WHQL and send another to a developer on the Windows team."

    (This rule applied to hardware other than network cards. I was "gifted" a sound card from a major manufacturer and installed it on my main machine. It wasn't long before I found and fixed a crashing bug in their driver.)

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    Loading the dictionary, part 2: Character conversion

    • 24 Comments

    When you want to optimize a program, you first need to know where the time is being spent. There's no point optimizing a function that isn't actually responsible for your poor performance. For example, if a particular function is responsible for 2% of your CPU time, then even if you optimized it down to infinite speed, your program would speed up at best by only little over 2%. In the comments to yesterday's entry, several people put forth suggestions as to how the program could be optimized, in the process quite amply demonstrating this principle. None of the people who made suggestions actually investigated the program to see where the glaring bottleneck was.

    (Rico Mariani points out that you also need to take performance in account when doing high level designs, choosing algorithms and data structures that are suitable for the level of performance you need. If profiling reveals that a fundamental design decision is responsible for a performance bottleneck, you're in big trouble. You will see this sort of performance-guided design as the program develops. And you should check out Performance Quiz #6 which starts with the very program we're developing here.)

    Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.

    Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!

    Let's get rid of the piecemeal character set conversion and instead convert entire lines at a time.

    Dictionary::Dictionary()
    {
     std::ifstream src;
     typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
     std::locale l(".950");
     const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
     src.open("cedict.b5");
     string s;
     while (getline(src, s)) {
      if (s.length() > 0 && s[0] != L'#') {
       wchar_t* buf = new wchar_t[s.length()];
       mbstate_t state = 0;
       char* nextsrc;
       wchar_t* nextto;
       if (cvt.in(state, s.data(), s.data() + s.length(), nextsrc,
                       buf, buf + s.length(), nextto) == widecvt::ok) {
        wstring line(buf, nextto - buf);
        DictionaryEntry de;
        if (de.Parse(line)) {
         v.push_back(de);
        }
       }
       delete[] buf;
      }
     }
    }
    

    Instead of using a wifstream, we just use a non-Unicode ifstream and convert each line to Unicode manually. Doing it a line at a time rather than a character at a time, we hope, will be more efficient.

    We ask code page 950 for a converter, which we call cvt. Notice that the Microsoft C++ compiler requires you to use the strange _USE macro instead of the more traditional use_facet.

    For each line that isn't a comment, we convert it to Unicode. Our lives are complicated by the fact that codecvt::in requires pointers to elements rather than iterators, which means that we can't use a wstring or a vector; we need a plain boring wchar_t[] array. (Notice that we can cheat on the "from" buffer and use the string::data() function to get at a read-only array representation of the string.) If the conversion succeeds, we convert the array into a proper string and continue as before.

    With this tweak, the program now loads the dictionary in 1120ms (or 1180ms if you include the time it takes to destroy the dictionary). That's nearly twice as fast as the previous version.

    You might think that we could avoid redundant allocations by caching the temporary conversion buffer between lines. I tried that, and surprisingly, it actually slowed the program down by 10ms. Such is the counter-intuitive world of optimization. That's why it's important to identify your bottlenecks via measurement instead of just guessing at them.

  • The Old New Thing

    Loading the dictionary, part 1: Starting point

    • 33 Comments

    The first thing we'll need to do in our little dictionary program is to load the dictionary into memory. The format of the dictionary file is as a plain text file, each line of which is of the form

    Chinese [pinyin] /English 1/English 2/.../
    舉例 [ju3 li4] /to give an example/
    

    Since it was the Big5 dictionary we downloaded, the Chinese characters are in Big5 format, known to Windows as code page 950. Our program will be Unicode, so we'll have to convert it as we load the dictionary. Yes, I could've used the Unicode version of the dictionary, but it so happens that when I set out to write this program, there was no Unicode version available. Fortunately, this oversight opened up the opportunity to illustrate some other programming decisions and techniques.

    The first stage in our series of exercises will be loading the dictionary into memory.

    #define UNICODE
    #define _UNICODE
    #include <windows.h>
    #include <string>
    #include <fstream>
    #include <iostream> // for cin/cout
    #include <vector>
    
    using std::string;
    using std::wstring;
    using std::vector;
    
    struct DictionaryEntry
    {
     bool Parse(const wstring& line);
     wstring trad;
     wstring simp;
     wstring pinyin;
     wstring english;
    };
    
    bool DictionaryEntry::Parse(const wstring& line)
    {
        wstring::size_type start = 0;
        wstring::size_type end = line.find(L' ', start);
        if (end == wstring::npos) return false;
        trad.assign(line, start, end);
        start = line.find(L'[', end);
        if (start == wstring::npos) return false;
        end = line.find(L']', ++start);
        if (end == wstring::npos) return false;
        pinyin.assign(line, start, end - start);
        start = line.find(L'/', end);
        if (start == wstring::npos) return false;
        start++;
        end = line.rfind(L'/');
        if (end == wstring::npos) return false;
        if (end <= start) return false;
        english.assign(line, start, end-start);
        return true;
    }
    
    class Dictionary
    {
    public:
     Dictionary();
     int Length() { return v.size(); }
     const DictionaryEntry& Item(int i) { return v[i]; }
    private:
     vector<DictionaryEntry> v;
    };
    
    Dictionary::Dictionary()
    {
     std::wifstream src;
     src.imbue(std::locale(".950"));
     src.open("cedict.b5");
     wstring s;
     while (getline(src, s)) {
      if (s.length() > 0 && s[0] != L'#') {
       DictionaryEntry de;
       if (de.Parse(s)) {
        v.push_back(de);
       }
      }
     }
    }
    
    int __cdecl main(int argc, const char* argv[])
    {
     DWORD dw = GetTickCount();
     {
      Dictionary dict;
      std::cout << dict.Length() << std::endl;
      std::cout << GetTickCount() - dw << std::endl;
     }
     std::cout << GetTickCount() - dw << std::endl;
     return 0;
    }
    

    Our dictionary is just a list of words with their English definitions. The Chinese words are written in three forms (traditional Chinese, simplified Chinese, and Pinyin romanization). For those who are curious, there are two writing systems for the Mandarin Chinese language and two phonetic systems. Which one a particular Mandarin-speaking population follows depends on whether they fell under the influence of China's language reform of 1956. Traditional Chinese characters and the Bopomo phonetic system (also called Bopomofo) are used on Taiwan; simplified Chinese characters and the Pinyin system are used in China. Converting Pinyin to Bopomo isn't interesting, so I've removed that part from the program I'm presenting here.

    (The schism in the spelling of the English language follows a similar pattern. Under the leadership of Noah Webster, the United States underwent its own spelling reform, but countries which were under the influence of the British crown retained the traditional spellings. Spelling reform continues in other languages even today, and the subject is almost always highly contentious, with traditionalists and reformists pitted against each other in a battle over a language's—and by proxy, a culture's—identity.)

    The program itself is fairly straightforward. It creates a Unicode file stream wifstream and "imbues" it with code page 950 (Big5). This instructs the runtime to interpret the bytes of the file interpreted in the specified code page. We read strings out of the file, ignore the comments, and parse the rest, appending them to our vector of dictionary entries.

    Parsing the line consists of finding the spaces, brackets, and slashes, and splitting the line into the traditional Chinese, Pinyin, and English components. (We'll deal with simplified Chinese later.)

    When I run this program on my machine, the dictionary loads in 2080ms (or 2140ms if you include the time to run the destructor). This is an unacceptably long startup time, so the first order of business is to make startup faster. That will be the focus of this stage.

    Notice that as a sanity check, I print the total number of words in the dictionary. The number should match the number of lines in the cedict.b5 file (minus the one comment line). If not, then I know that something went wrong. This is an important sanity check: You might make a performance optimization that looks great when you run it past a stopwatch, only to discover that your "optimization" actually introduced a bug. For example, one of my attempted optimizations of this program resulted in a phenomenal tenfold speedup, but only because of a bug that caused it to think it was finished when it had in reality processed only 10% of the dictionary!

    As my colleague Rico Mariani is fond of saying, "It's easy to make it fast if it doesn't have to work!"

  • The Old New Thing

    Developing a Chinese/English dictionary: Introduction

    • 28 Comments

    The other day, one of my colleagues mentioned that his English name "Ben" means "stupid" in Chinese: 笨/bèn/ㄅㄣˋ. (His wife is Chinese; that's why he knows this in the first place.) Knowing that the Chinese language is rich in homophones, I fired up my Chinese/English dictionary program to see if we could find anything better. (Unfortunately, the best I could come up with was 賁/贲/bēn/ㄅㄣ, which means "energetic".)

    Ben seemed to take his appellative fate in stride; he seemed much more interested in the little dictionary program I had written. So, as an experiment, instead of developing tiny samples that illustrate a very focused topic, I'll develop a somewhat larger-scale program (though still small by modern standards) so you can see how multiple techniques come together. The task will take many stages, some of which may take only a day or two, others of which can take much longer. If a particular stage is more than two or three days long, I'll break it up with other articles, and I'll try to leave some breathing room between stages.

    Along the way, we'll learn about owner-data (also known as "virtual") listviews, listview custom-draw, designing for accessibility, window subclassing, laying out child windows, code pages, hotkeys, and optimization.

    If you're going to play along at home, beware that you're going to have to install Chinese fonts to see the program as it evolves, and when you're done, you'll have a Chinese/English dictionary program, which probably won't be very useful unless you're actually studying Chinese...

    If you're not into Win32 programming at all, then, well, my first comment to you is, "So what are you doing here?" And my second comment is, "I guess you're going to be bored for a while." You may want to go read another blog during those boring stretches, or just turn off the computer and go outside for some fresh air and exercise.

    Those who have decided to play along at home will need the following: a copy of the CEDICT Chinese-English dictionary in Big5 format (note: Big5 format) and the Chinese Encoding Converter source code (all we need is the file hcutf8.txt). We'll start digging in next time.

Page 2 of 3 (25 items) 123