Yesterday Raymond posted his initial version and I posted Performance Quiz #6. Today he posts the first optimization. Quoting him:
Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise. Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!
Upon profiling our dictionary-loader, I discovered that 80% of the CPU time was spent in getline. Clearly this is where the focus needs to be. Everything else is just noise.
getline
Digging a little deeper, it turns out that 29% of the CPU time was spent by getline doing character set conversion in codecvt::do_in. Some debugging revealed that codecvt::do_in was being called millions of times, each time converting just one or two characters. In fact, for each character in the file, codecvt::do_in was called once and sometimes twice!
codecvt::do_in
And supporting this, my own measurements indicate this:
Exclusive time indicates the amount of time in the function itself. Inclusive indicates the amount in the function plus the children of that function.
These measurements were taken with the Visual Studio Enterprise Performance Tool in sampling mode (one sample every 1,000,000 cycles).
Interestingly if you look at where the Exclusive time is (i.e the specific functions taking up the time in the tree) you see something surprising.
Look at that! The top 4 functions, 39.24% of the time, and I can add those because they are all leaf functions (Exclusive = Inclusive), have to do with synchronization operations of one kind or another. Even the interlocked operations have some sort of synchronization intent because of course if they weren't trying to synchronize across threads they wouldn't need to be interlocked. And the best part is this app is single threaded! Yumm :)
MultiByteToWideChar which is arguably part of What We Actually Have To Do (TM) is a mere 9.961% inclusive. Not good.
Naturally the bulk of this is in the line reading as shown by the inclusive time.
But my question remains unanswered: How do you think the managed system will fare?