Welcome to MSDN Blogs Sign in | Join | Help

Loading the dictionary, part 4: Character conversion redux

Getting rid of getline was a big help, but 480ms is still not quite peppy enough. You need to respond to user actions within a tenth of a second for thing to seem responsive.

Profiling the latest endeavor reveals that 40% of our CPU time is spent in codecvt::in. Some debugging reveals that codecvt::in ultimately calls MultiByteToWideChar but uses it to convert only one or two characters at a time, even though we handed it a whole line.

Let's get rid of codecvt::in and convert the characters ourselves, calling MultiByteToWideChar exactly once to convert the entire line at a single go.

#define CP_BIG5 950

Dictionary::Dictionary()
{
 MappedTextFile mtf(TEXT("cedict.b5"));
 // typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
 // std::locale l(".950");
 // const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
 const CHAR* pchBuf = mtf.Buffer();
 const CHAR* pchEnd = pchBuf + mtf.Length();
 while (pchBuf < pchEnd) {
  const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
  if (*pchBuf != '#') {
   size_t cchBuf = pchEOL - pchBuf;
   wchar_t* buf = new wchar_t[cchBuf];
   DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,
                          pchBuf, cchBuf, buf, cchBuf);
   if (cchResult) {
    wstring line(buf, cchResult);
    DictionaryEntry de;
    if (de.Parse(line)) {
     v.push_back(de);
    }
   }
   delete[] buf;
  }
  pchBuf = pchEOL + 1;
 }
}

Instead of using the codecvt::in method to perform character conversion, we go straight to the MultiByteToWideChar function. Notice that we assume that the Big5 string will not generate more Unicode characters than its length in bytes. This happens to be a safe assumption based on our external knowledge of the Big5 encoding. (If the encoding were something else, the assumption may no longer be valid.)

With this change, the dictionary load time has dropped to 240ms (or 300ms if you include the time it takes to destroy the dictionary). That's twice as fast the previous version, but still not quite close enough to the 100ms goal. We still have some work ahead of us.

[Raymond is currently on vacation; this message was pre-recorded.]

Published Monday, May 16, 2005 7:00 AM by oldnewthing
Filed under:

Comments

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 10:06 AM by Anders Dalvander
Dictionary::Dictionary()
{
MappedTextFile mtf(TEXT("cedict.b5"));
const CHAR* pchBuf = mtf.Buffer();
const CHAR* pchEnd = pchBuf + mtf.Length();
std::vector<wchar_t> buf;
while (pchBuf < pchEnd) {
const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
if (*pchBuf != '#') {
size_t cchBuf = pchEOL - pchBuf;
buf.resize(cchBuf + 1); // +1 to prevent zero length buffer, which is undefined.
DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,
pchBuf, cchBuf, &buf[0], cchBuf);
if (cchResult) {
wstring line(&buf[0], cchResult);
DictionaryEntry de;
if (de.Parse(line)) {
v.push_back(de);
}
}
}
pchBuf = pchEOL + 1;
}
}

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 12:25 PM by CornedBee
I can't help but think that so far, all you have demonstrated is how inefficient the MS STL implementation is.

Perhaps it's not possible to make it much better. But I don't really believe that.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 1:20 PM by TrueTom
C++ and its lack of support of Unicode is a big, big pain... You cannot rely on nothing.

# Performance Quiz #6 -- Looking at the fourth cut

Monday, May 16, 2005 2:09 PM by Rico Mariani's Performance Tidbits
Raymond is definately making some great headway.&amp;nbsp; Having targetted the single-character at a time...

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 2:59 PM by niven
I've run the four versions, but get no discernable difference in performance:

version 3: 210ms
version 4: 210ms

(with Borland CPP builder)

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 3:40 PM by Simon Cooke
niven - sounds like you're using GetTickCount() to do your perf timing.

You might want to use the instruction cycle timer or the performance counter instead.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 4:09 PM by AlienRancher
Just for reference here http://blogs.msdn.com/ricom/ the managed unoptimized version (whith plain old stream ReadLine ) clocking at 124ms which seems to me that the small allocations are the weakest part of the standard library.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 7:36 PM by Michael Pryhodko
Some notes about code in your example:
wstring's constructor could throw
DictionaryEntry's contructor could throw
de.Parse could throw
v.push_back could throw

Any of these events will give you memory leak, because nobody will call 'delete[] buf'.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 7:58 PM by Tim Smith
Hold it, I thought I just had to use exceptions and the world became a better place without me having to do anymore work.

*snicker*

Once again showing that switching between return values and exceptions just means that the developer has to check for something different.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 8:46 PM by Michael Pryhodko
> Once again showing that switching between
> return values and exceptions just means that
> the developer has to check for something
> different.

No -- that means that developer must be always **prepared**. :) That is different.

Other notes:
- it is a good idea to do 'v.reserve(N)' with some meaningfuill value N (in case if 'v' is std::vector)
- if size of v could potentially grow large I'd strongly suggest to use 'deque' (in case if 'v' is std::vector)

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 9:14 PM by Michael Pryhodko
And finally (I hope):
it is a good idea to allocate buffer from stack (with '_alloca()') -- this will give significant boost in speed depending on type of CRT you are linking.
To prvent stack overuse(and overflow), each 'while' iteration should be contained on separate stack frame -- for example containing it in a separate __fastcall __decspec(noinline) function. On modern PCs cost of function call is insignificant in comparison with lock/unlock in malloc (even if it is amortized by any mechanism).

Basically -- just use your mind, Luke... :)

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 11:22 PM by asdf
Microsoft's implementation of _alloca is ass-backwards as can be. They just reused (literally, same address, different symbol names) the code they call on functions with large stacks (which just touches the stack in 4k increments) to implement it. Instead of fixing _alloca, they added another function _resetstkoflw (that looks like it was written by a summer intern) which you have to call in a SEH block that tests for stack overflow.

It would be interesting to see how much of a further improvement this code gets just by using Doug Lea's malloc instead of using VC's allocator and maybe even add STLPort to the mix instead of VC's STL/string classes.

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 11:36 PM by Michael Pryhodko
1. you need to call '_resetstkoflw' ONLY if you wish to recover from stack overflow (which is nonsense by itself from C++'s point of view)
2. _alloca's implementation looks reasonable to me... i.e. this intrinsic is doing almost nothing, leaving stack-growth logic to underlying system.
3. In this specific example stack should not ever overflow, and there should be no need to call '_resetstkoflw', and using _alloca is ok
4. It is unconvenient to force system to increase stack size, because AFAIK all Windows systems never reclaims committed memory from stack while it's owner thread is still running.


> It would be interesting to see how much of a
> further improvement this code gets just by
> using Doug Lea's malloc instead of using VC's
> allocator and maybe even add STLPort to the
> mix instead of VC's STL/string classes.

It will be slower. _alloca() is intrinsic and should end up with 1-3 asm commands...

# re: Loading the dictionary, part 4: Character conversion redux

Monday, May 16, 2005 11:39 PM by Mihai
My guess is that in the end we will have something like this:

open MemoryMappedFile
Allocate wcharBuffer // big enough for full file
MultiByteToWideChar // one single conversion
parse the wcharBuffer and just remember
pointers inside the wcharBuffer, no allocation,
no nothing.

And I bet Raymond has it all written already, just trying to show us how it is done :-)

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 1:00 AM by LarryOsterman
Mihai,
Raymond had it written sometime last year (in September or October, IIRC).

He writes that far ahead.

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 1:54 AM by Michael Pryhodko
I wonder if '\n' could be a part of multibyte character? if yes -- this code is invalid.

AFAIK, Windows supports only DBCS (i.e. up to 2 bytes per character) -- who knows if it is possible for DBCS character to have '\n' in second byte?

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 7:49 AM by Matthew Douglass-Riley
In a production environment, wouldn't it be more efficient to write a converter to preprocess the file, and make the whole thing unicode to begin with?

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 8:42 AM by LarryOsterman
I'm not 100% sure (I don't have my DBCS manual here) but I believe that no characters < 0x1f are legal leading or trailing bytes in Big5

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 12:31 PM by Olaf Ian Davidson
From <http://www.jbrowse.com/text/charsets.html#Big5>

The term 'Big5' has been abused quite a lot over the years. The original Big5 character repertoire is no longer used and the name 'Big5' usually means one of the many extensions. The main extensions are Microsoft's CP950, Big5-ETen, and Big5-HKSCS.

Big5 is both a character set and an encoding. As an encoding, it is a DBCS encoding with lead bytes in the range 0xa1-0xf90 and trails 0x40-0xfe.

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 6:09 PM by Michael Pryhodko
> From <http://www.jbrowse.com/text/charsets.html#Big5>
> Big5 is both a character set and an encoding.
> As an encoding, it is a DBCS encoding with
> lead bytes in the range 0xa1-0xf90 and trails
> 0x40-0xfe.

Anyway, it worth pointing out that for example in 'CP949' '\n' could be a part of DBCS character:

[---- quote begin ----]
CP949 charset encoding

Also called: UHC, Unified Hangul Code
Languages: Korean

This is Microsoft's favored way of representing Korean. It is a derivative of EUC-KR, extended to include all johab precomposed hangul. Like other east asian Microsoft encodings, it allows ASCII trail bytes, and lead bytes in the control code range, thus losing a form of ASCII compatibility.
[---- quote end ----]

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 9:52 PM by A
> Anyway, it worth pointing out that for example in 'CP949' '\n' could be a part of DBCS character

I don't think your source is correct. According to http://www.microsoft.com/globaldev/reference/dbcs/949.mspx , lead bytes start at 0x81, and trail bytes (as far as I can tell) go no lower than 0x41.

http://lists.freebsd.org/pipermail/freebsd-bugs/2003-August/002657.html (random page found via Google) agrees:

+ * CP949 contains four character sets:
+ *
+ * <0x00-0x7f> ASCII equivalent
+ * <0xa1-0xfd><0xa1-0xfe> EUC instance of KS X 1001:2002
+ * <0x81-0xa0><0x41-0xfe> Hangul syllables GAGG-JWAK
+ * <0xa1-0xc6><0x41-0xa0> Hangul syllables JWAT-HIH (ends with C652)

# re: Loading the dictionary, part 4: Character conversion redux

Tuesday, May 17, 2005 10:25 PM by A
Ah ha... Here's a table from MSDN listing the lead and trail byte ranges for the various Far East code pages used by Windows:

http://web.archive.org/web/20000303195204/http://msdn.microsoft.com/library/books/devintl/S24B2_b2.HTM

(Link at archive.org because it seems it's no longer on microsoft.com anymore.)

# The Old New Thing : Very late remarks on the original Chinese dictionary series

# Performance Quiz #6 -- Looking at the fourth cut

Tuesday, January 23, 2007 9:30 PM by Rico Mariani's Performance Tidbits

Raymond is definately making some great headway. Having targetted the single-character at a time conversion

New Comments to this post are disabled
 
Page view tracker