STL Performance

STL Performance

Rate This
  • Comments 24

Hello, I am Mohammad Usman, a Software Design Engineer in Test on the Visual C++ Libraries team. I recently joined the team. In VS 2008 SP1, we invested heavily in our Standard Template Library (STL) implementation by adding TR1 extensions such as shared_ptr, including performance optimizations for things like vectors of shared_ptrs.  In this post, I will be talking about how we take this investment further by improving our STL performance in VS2010. We have done some re-writing of our libraries to make them more conformant to the C++0x standard, more flexible and extendible, and to improve performance in a few areas. For the first two items we already have a few blog posts and a Channel 9 video, this post will mainly cover the performance part.

Since we have re-written quite a bit of STL code in order to improve flexibility and increase conformance, it was important to make sure that this does not negatively affect the performance. We have taken special care of that. Most of our performance tests are to ensure this. And our tests are showing good results. In fact there are some areas where VS2010 is faster than VS2008. Some of those areas where we improved performance significantly are:

-          Overall deque performance

o   The improvement is evident when used with algorithms such as std::reverse(), std::remove(), std::sort() and std::unique().

-          std::unique() 

o   An optimization change in the loops of the function has given a significant performance boost when used with different containers such as list, deque and vector.

-          Vector reallocations

o   Vector is one of the most commonly used containers because of its attributes such as contiguous memory, random access, fast traversals, and its ability to grow dynamically etc. But it does have some limitations. To maintain contiguous memory, it has to allocate memory in a chunk and when that chunk is full, it has to go through something called “reallocation”. This is a potentially costly process causing the vector to copy all the contents to a new location. Now the good news is that we have made very significant improvements there in VS2010 with the help of rvalue references.  This will come into action when you use vectors with STL objects in VS2010 and you can also take advantage of this when using vectors of your own custom objects by writing simple move constructors, move assignment operators for your class. A significant performance gain will be evident here when used with large objects containing strings. This is explained in more detail below.

Vector Reallocation (UDTs - User Defined Types)

Introduction of rvalue references in STL is one of my favorite enhancements for VS2010. “Rvalue references” is quite a complicated topic. Stephan (our STL developer) recently wrote a very good comprehensive blog post explaining rvalue references, which can be found here. For the sake of this post, I will touch only the parts which impact the performance of STL.

By using rvalue references, we are able to move the contents of objects rather than copying and destroying the original where we don’t have to. We actually “steal” the resources by copying the pointers rather than the whole object when the object we are dealing with is an rvalue. This is most helpful when move constructors and move assignment operators can be used. You can use the move machinery to write your move constructors and move assignment operators. The following simple test shows that we can get an order of magnitude performance boost when we take advantage of rvalue references in large UDTs.

 

#include <iostream>

#include <ostream>

#include <ios>

#include <stddef.h>

#include <string>

#include <vector>

#include <windows.h>

#include <utility>

 

 

//***************************** START: <Helper Code> ***********************************

 

long long counter() {

    LARGE_INTEGER li;

    QueryPerformanceCounter(&li);

    return li.QuadPart;

}

 

long long frequency() {

    LARGE_INTEGER li;

    QueryPerformanceFrequency(&li);

    return li.QuadPart;

}

 

// to confine the test to run on a single processor in order to get consistent results for all tests.

void perf_startup() {

    SetThreadAffinityMask(GetCurrentThread(), 1);

    SetThreadIdealProcessor(GetCurrentThread(), 0);

    Sleep(1);

}

 

 

//***************************** END: <Helper Code> ***********************************

 

// User Defined Type

class myUserObject

{

public:

       std::string name;

       std::string address;

       std::string telephone;

       std::string name2;

       std::string address2;

       std::string telephone2;

 

       // default constructor

       myUserObject()

       {}

      

       // Copy Constructor

       myUserObject(const myUserObject& myObj)

       {

              name = myObj.name;

              telephone = myObj.telephone;

              address = myObj.address;

              name2 = myObj.name2;

              telephone2 = myObj.telephone2;

              address2 = myObj.address2;

       }

 

// copy assignment operator

       myUserObject& operator=(const myUserObject& myObj)

       {

              name = myObj.name;

              telephone = myObj.telephone;

              address = myObj.address;

              name2 = myObj.name2;

              telephone2 = myObj.telephone2;

              address2 = myObj.address2;

             

              return *this;

       }

      

       // As VS2008 does not have rvalue reference support

       #if _MSC_VER >= 1600

      

       // move constructor

// This is where are getting our performance gain.

// The move() machinery is made available in <utility>

       myUserObject(myUserObject&& myObj)

       {

              name = std::move(myObj.name);

              telephone = std::move(myObj.telephone);

              address = std::move(myObj.address);

              name2 = std::move(myObj.name2);

              telephone2 = std::move(myObj.telephone2);

              address2 = std::move(myObj.address2);          

       }

      

       // move assignment operator

       myUserObject& operator=(myUserObject&& myObj)

       {

              name = std::move(myObj.name);

              telephone = std::move(myObj.telephone);

              address = std::move(myObj.address);

              name2 = std::move(myObj.name2);

              telephone2 = std::move(myObj.telephone2);

              address2 = std::move(myObj.address2);

             

              return *this;

       }

 

       #endif

      

};

      

 

int main()

{

       perf_startup();

 

       int total_copy_move_count = 0;

int total_vector_resizes = 0;

double total_pushback_time = 0.0;

 

       std::vector<myUserObject> vec;

       myUserObject obj;

       obj.name = "Stephan T. Lavavej Stephan T. Lavavej Stephan T. Lavavej";

obj.telephone = "314159265 314159265 314159265 314159265 314159265";

       obj.address = "127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0";

obj.name2 = "Mohammad Usman. Mohammad Usman. Mohammad Usman. ";

       obj.telephone2 = "1234567890 1234567890 1234567890 1234567890 1234567890";

       obj.address2 = "Republik Of mancunia. Republik Of mancunia Republik Of mancunia";

 

       const long long start = counter();

       for (int i = 0; i<1050000; ++i)

       {            

      

              size_t capacity = vec.capacity();

              size_t size = vec.size();

             

              vec.push_back(obj);

 

              if (capacity == size) // indication that vector reallocation has occurred.

              {

                     ++total_vector_resizes;

                     total_copy_move_count += size;   

              }

       }

       const long long finish = counter();

       total_pushback_time = (finish-start)*1.0 / frequency();

 

       // print result

std::cout << std::endl;

       std::cout << std::fixed;

       std::cout << "Total Pushback time: \t" << total_pushback_time << " sec" << std::endl;

       std::cout << "Total number of vector reallocations: \t" << total_vector_resizes << std::endl;

       std::cout << "Total number of objects copied/moved: \t" << total_copy_move_count << std::endl;

}

 

 

VS2008:

c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 rvalue.cpp

rvalue.cpp

Generating code

Finished generating code

 

c:\TEMP\STLPerf> rvalue.exe

 

Total Pushback time:    8.309241 sec

Total number of vector reallocations:   36

Total number of objects copied/moved:   3149622

 

VS2010:

c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 rvalue.cpp

rvalue.cpp

Generating code

Finished generating code

 

c:\TEMP\STLPerf> rvalue.exe

 

Total Pushback time:    2.335225 sec
Total number of vector reallocations:   36
Total number of objects copied/moved:   3149622

As you can see that in this specific scenario we are getting more than 3 times performance boost. This gain is due to the fact that the move machinery is used during vector reallocations, which is cheaper as compared to the copy machinery (copy construction and copy assignment). And the bigger your object is and the larger your vector is, the more performance gain you will get. So those of you who use vectors on a big scale, this is something you will surely love.

Vector Reallocation (STL Types)

Now the interesting part is that similar to the sample code above, we have implemented move semantics (move constructors and move assignment operators) for our STL types in VS2010. It is worth mentioning here that for some types, in order to improve performance especially in situations where the copy was taking a lot of time such as vector reallocation, we used a trick called “swaptimization” before VS2010. (see pt.#16 in this blog post) Now with the arrival of rvalue references we don’t need that anymore and we have replaced those swaptimization tricks with the rvalue reference machinery and, guess what, we are now even faster. The following example shows one such scenario where we used swaptimization earlier.

#include <iostream>

#include <ostream>

#include <ios>

#include <stddef.h>

#include <string>

#include <vector>

#include <windows.h>

 

 

//***************************** START: <Helper Code> ***********************************

 

long long counter() {

    LARGE_INTEGER li;

    QueryPerformanceCounter(&li);

    return li.QuadPart;

}

 

long long frequency() {

    LARGE_INTEGER li;

    QueryPerformanceFrequency(&li);

    return li.QuadPart;

}

 

// to confine the test to run on a single processor in order to get consistent results for all tests.

void perf_startup() {

    SetThreadAffinityMask(GetCurrentThread(), 1);

    SetThreadIdealProcessor(GetCurrentThread(), 0);

    Sleep(1);

}

 

//***************************** END: <Helper Code> ***********************************

 

int main()

{

       perf_startup();

 

       long long total_copy_move_count = 0;

long long total_vector_resizes = 0;

double total_pushback_time = 0.0;

 

       std::vector<std::string> v_str;

 

       long long start = counter();

       for (int i = 0; i<12000000; ++i)

       {

              size_t size = v_str.size();

              size_t capacity = v_str.capacity();

              v_str.push_back("I think, therefore I am");

             

              if (size == capacity) // indication that vector reallocation has occurred.

              {

                     ++total_vector_resizes;

                     total_copy_move_count += size;          

              }     

       }     

       long long finish = counter();

       total_pushback_time = ((finish - start)*1.0 / frequency());

      

       std::cout << std::fixed;

       std::cout << std::endl;

       std::cout << "Total Pushback time: \t" << total_pushback_time << " sec" << std::endl;

       std::cout << "Total number of vector reallocations: \t" << total_vector_resizes << std::endl;

       std::cout << "Total number of objects copied/moved: \t" << total_copy_move_count << std::endl;

 

}

 

 

VS2008:

c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 swaptimization.cpp

swaptimization.cpp

Generating code

Finished generating code

 

c:\TEMP\STLPerf> swaptimization.exe

 

Total Pushback time:    6.717094 sec

Total number of vector reallocations:   42

Total number of objects copied/moved:   35875989

 

VS2010:

c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 swaptimization.cpp

swaptimization.cpp

Generating code

Finished generating code

 

c:\TEMP\STLPerf> swaptimization.exe

 

Total Pushback time:    3.780286 sec

Total number of vector reallocations:   42

Total number of objects copied/moved:   35875989

 

You can see that moving to rvalue references hasn’t impacted the performance. In fact, in this particular example, we are almost twice as fast as VS2008. The credit for this gain also goes to rvalue references, not specifically during vector reallocation, but during push_back() instead. As you can see that we are passing a literal string (an rvalue- see footnote) to the function. So the move machinery comes into play here again, giving us another performance boost.   

This is all what I have to share in my blog. If you have any questions about performance of STL or in general about STL, please feel free to write to me at: Mohammad.Usman at microsoft.com.

Thanks!

 

-Usman

 

* String literals are actually lvalues (C++03 5.1/2), whereas all other literals are rvalues.  However, since we’ve got a vector<string>, push_back() takes const string&.  This constructs a temporary std::string, which is an rvalue.  That’s where move semantics comes into play.

 

  • Moving constructors are really really nice! It's a really useful "performance" feature of C++0x!

    I can't wait for the RTM :)

  • First off, good work! It really is good to see this type of effort being done!

    But, is any energy going into optimizing, extending and improving MFC/ATL data structures?  Or do you recommend legacy code be rewritten into STL?

  • I think the constructors and assignment operators should be written using intitialization lists not assignments. This is a common idiom found in many places.

  • It's great to see some test results confirming that rvalue refs do indeed rock!

    I'm looking forward to simplified code where I can stop using containers of pointers and use containers of objects, safe in the knowledge that there won't be a big performance hit anymore when objects with large/complex members are moved around in the container.

  • I have run a few tests by myself and was also pleased with a performance of new STL with r-value references.

    One thing I wonder is why std::tr1::function does not have move constructor nor move assignment operator? And moreover, there is a version of swap that accepts r-value references, which tells us the class was being ported to C++0x.

    ...

    void swap(_Myt _REFREF _Right)

    { // swap with _Right

    this->_Swap(_Right);

    }

    ...

    Is it just because of beta and you are going to add it later? (I hope so).

  • What about the locking issues in the 2008 STL implementation?  I have an application where the single greatest performance bottleneck is the thread locking in the STL.

    This should never be the case since container locking has to be done in client code. I don't know what shared global structures may be in the MS STL implementation but it is certainly possible (see the SGI STL implementation) to implement the STL without the need for locking.

    Any other performance improvements you may implement are completely worthless until this bottleneck is removed.

  • [longtime mfc/c++ dev]

    > But, is any energy going into optimizing, extending and improving MFC/ATL data structures?

    Move semantics have been given to a few classes (e.g. CComPtr). However, ATL/MFC isn't undergoing the complete overhaul that the STL is undergoing.

    > Or do you recommend legacy code be rewritten into STL?

    The STL tries to provide highly flexible, high performance algorithms and data structures. Ideally, no one should be able to beat the STL at its own game. Therefore, in the vast majority of new code, you should use the STL, except when you have highly specific requirements for behavior or performance that require custom algorithms and data structures.

    However, that doesn't mean that you should go rewrite all of your old code. You probably have better things to do.

    [hoang vu]

    > I think the constructors and assignment operators should be written using intitialization lists not assignments. This is a common idiom found in many places.

    Ctors should indeed use init lists whenever possible. (Assignment operators can't use init lists.)

    [Leo Davidson]

    > I'm looking forward to simplified code where I can stop using containers of pointers and use containers of objects

    In VC9 SP1, you can use vector<shared_ptr<T>>.

    Please note that vector<T *> should not be used to hold new Ts - that's leaktrocity.

    [Alex]

    > One thing I wonder is why std::tr1::function does not have move constructor nor move assignment operator?

    I'll file a bug. I recently noticed that regex is missing move semantics too.

    > Is it just because of beta and you are going to add it later? (I hope so).

    A simple oversight.

    [Ted Jones]

    > What about the locking issues in the 2008 STL implementation?  I have an application where the single greatest performance bottleneck is the thread locking in the STL.

    In VC9 (aka VS 2008), the STL proper doesn't take locks in release mode. To clarify:

    "In VC9" - we recently noticed that in VC10 Beta 1, the STL proper is taking locks in release mode. This was an unintended side effect of fusing _SECURE_SCL and _HAS_ITERATOR_DEBUGGING into _ITERATOR_DEBUG_LEVEL, and we've got a bug tracking this.

    "the STL proper" - as in, things like vectors and algorithms, not iostreams. iostreams takes locks in order to allow cout to be used by multiple threads (since streaming through cout is a modifying operation).

    "in release mode" - as in, /MT or /MD (with _SECURE_SCL, now _ITERATOR_DEBUG_LEVEL, set to either 0 or 1).  In debug mode (/MTd or /MDd) with _HAS_ITERATOR_DEBUGGING enabled (now, _ITERATOR_DEBUG_LEVEL set to 2), locks are used to protect the bookkeeping data structures for the powerful correctness checks. If you're profiling debug mode, don't do that.

    Also, this excludes whatever happens during memory allocation.

  • Do you plan to support unique_ptr in this release?  vector<unique_ptr<T>> is a very interesting data structure.

  • Yes, unique_ptr is in VC10 Beta 1. Try it out!

  • > In VC9 SP1, you can use vector<shared_ptr<T>>.

    Good point! I've still not adopted some of the useful things in TR1/boost, I have to admit.

    I also have to admit to (often, not always) avoiding smartpointers in C++ as I can't remember how all the slightly different ones -- not just in STL -- behave and find I'm not confident that my code is correct and have to keep looking things up.

    That's mostly just me being silly and stuck in my ways, though. If I used the things more I'd remember how the ones I actually used behaved.

    > Please note that vector<T *> should not be used to hold new Ts - that's leaktrocity.

    I think there are/were exceptions, e.g. when the memory overhead of a reference count per object is significant and the vector is a private member of a class which accesses it via methods that take care of the new/delete (inc. the class's destructor, of course).

    Obviously that increases the risk of some code forgetting to delete (or copying a pointer and deleting it twice), so it should only be used where the trade-off is right.

    That's where the rvalue stuff seems great to me. It simplifies the code without adding any memory overhead (and I don't have to remember the semantics of when pointers do and don't become "owned" by a particular smartpointer implementation).

    Maybe there's already (in VC9) a better way to do this which I've not noticed, though. (In VC10, or with Boost, it looks like unique_ptr does the job, presumably at the memory cost of an extra pointer per object.)

    The rvalue stuff has trade-offs of its own, I realise. For an object where the shallow copy might be costly, and where the smart-pointer memory overhead is unimportant, it'll still be better to use a smart-pointer.

    The way I see it, though, using rvalue references by default seems like a good rule of thumb in VC10. I've not actually used VC10 yet, though, and I might be wrong. :)

    Thanks once again, Stephan, for your informative responses to all the questions we ask.

  • So, any idea when we could use all these goods in RTM ? :)

  • I'm glad to hear that the thread locking is gone from release mode builds.  I am still a bit concerned about the approach taken for SECURE_SCL and iterator debugging.  In a large complex program, ensuring that every project and all third party dependencies use the exact same settings can be a real nightmare.  Is the default in release mode now to turn SECURE_SCL off?

  • [Leo Davidson]

    > I also have to admit to (often, not always) avoiding smartpointers in C++

    > as I can't remember how all the slightly different ones -- not just in

    > STL -- behave and find I'm not confident that my code is correct and

    > have to keep looking things up.

    That's the nice thing about shared_ptr - it works very "smoothly". This feeling is difficult to put into words, but basically shared_ptr has been carefully designed to have a simple interface that doesn't contain complicated options or nasty surprises. Because shared_ptr's behavior is not customizable (with one exception), whenever you see a shared_ptr, you know how it's going to behave.

    For example, you can start with a shared_ptr<Derived> and implicitly convert it to a shared_ptr<Base>. (This is part of shared_ptr's interface, something that dumber smart pointers often lack. It isn't built into the language, unlike the conversion from Derived * to Base *.) The shared_ptr<Derived> and shared_ptr<Base> both hold strong refs to the object. If the shared_ptr<Base> (or a copy of it) is the last survivor, the Derived object will be deleted normally. (In fact, shared_ptr is so smart, this happens even if Base and Derived don't have virtual destructors - but they almost always do.)

    The exception to shared_ptr's customizability is custom deleters/allocators - but it's important to notice that they don't affect the shared_ptr's type (shared_ptr is templated on T alone). So after you've constructed a shared_ptr, potentially with a custom deleter/allocator, you can forget about it - it doesn't affect how the shared_ptr behaves at all.

    > (In VC10, or with Boost, it looks like unique_ptr does the job, presumably at the memory cost of an extra pointer per object.)

    unique_ptr stores one raw pointer and has no other overheads.

    [RTM]

    > So, any idea when we could use all these goods in RTM ? :)

    Magic 8 Ball says: Yes.

    Magic 8 Ball says: Better not tell you now.

    [Ted Jones]

    > I'm glad to hear that the thread locking is gone from release mode builds.

    Well, it was never there.

    > I am still a bit concerned about the approach taken for SECURE_SCL and iterator debugging.

    > In a large complex program, ensuring that every project and all third party

    > dependencies use the exact same settings can be a real nightmare.

    Yes, this is painful. (Build systems are nightmares - they don't have to be, if you ask me - but that's how it is.)

    Fortunately, VC10 after Beta 1 will contain "#pragma detect_mismatch", which will detect _ITERATOR_DEBUG_LEVEL mismatch deterministically at link time, instead of crashing mysteriously at run time.

    > Is the default in release mode now to turn SECURE_SCL off?

    In VC10 after Beta 1, _ITERATOR_DEBUG_LEVEL in release mode will default to 0. And there was much rejoicing.

  • Stephan T. Lavavej [MSFT] said:

    "Also, this excludes whatever happens during memory allocation."

    I immediately assumed he was talking about memory allocation, and if so, I fully agree that allocator implementation should be improved. I make heavy use of TBB's allocators, and the difference can be staggering. Is there a particular reason VC++'s default allocators can't be lock-free? Is atomicity that hard to achieve on x86 at this point?

  • [Adam Merz]

    > Is there a particular reason VC++'s default

    > allocators can't be lock-free?

    In VC, new wraps malloc() wraps HeapAlloc(). That is, we do whatever Windows does.

Page 1 of 2 (24 items) 12