STL Performance
Hello, I am Mohammad Usman, a Software Design Engineer in Test on the Visual C++ Libraries team. I recently joined the team. In VS 2008 SP1, we invested heavily in our Standard Template Library (STL) implementation by adding TR1 extensions such as shared_ptr, including performance optimizations for things like vectors of shared_ptrs. In this post, I will be talking about how we take this investment further by improving our STL performance in VS2010. We have done some re-writing of our libraries to make them more conformant to the C++0x standard, more flexible and extendible, and to improve performance in a few areas. For the first two items we already have a few blog posts and a Channel 9 video, this post will mainly cover the performance part.
Since we have re-written quite a bit of STL code in order to improve flexibility and increase conformance, it was important to make sure that this does not negatively affect the performance. We have taken special care of that. Most of our performance tests are to ensure this. And our tests are showing good results. In fact there are some areas where VS2010 is faster than VS2008. Some of those areas where we improved performance significantly are:
- Overall deque performance
o The improvement is evident when used with algorithms such as std::reverse(), std::remove(), std::sort() and std::unique().
- std::unique()
o An optimization change in the loops of the function has given a significant performance boost when used with different containers such as list, deque and vector.
- Vector reallocations
o Vector is one of the most commonly used containers because of its attributes such as contiguous memory, random access, fast traversals, and its ability to grow dynamically etc. But it does have some limitations. To maintain contiguous memory, it has to allocate memory in a chunk and when that chunk is full, it has to go through something called “reallocation”. This is a potentially costly process causing the vector to copy all the contents to a new location. Now the good news is that we have made very significant improvements there in VS2010 with the help of rvalue references. This will come into action when you use vectors with STL objects in VS2010 and you can also take advantage of this when using vectors of your own custom objects by writing simple move constructors, move assignment operators for your class. A significant performance gain will be evident here when used with large objects containing strings. This is explained in more detail below.
Vector Reallocation (UDTs - User Defined Types)
Introduction of rvalue references in STL is one of my favorite enhancements for VS2010. “Rvalue references” is quite a complicated topic. Stephan (our STL developer) recently wrote a very good comprehensive blog post explaining rvalue references, which can be found here. For the sake of this post, I will touch only the parts which impact the performance of STL.
By using rvalue references, we are able to move the contents of objects rather than copying and destroying the original where we don’t have to. We actually “steal” the resources by copying the pointers rather than the whole object when the object we are dealing with is an rvalue. This is most helpful when move constructors and move assignment operators can be used. You can use the move machinery to write your move constructors and move assignment operators. The following simple test shows that we can get an order of magnitude performance boost when we take advantage of rvalue references in large UDTs.
#include <iostream>
#include <ostream>
#include <ios>
#include <stddef.h>
#include <string>
#include <vector>
#include <windows.h>
#include <utility>
//***************************** START: <Helper Code> ***********************************
long long counter() {
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return li.QuadPart;
}
long long frequency() {
LARGE_INTEGER li;
QueryPerformanceFrequency(&li);
return li.QuadPart;
}
// to confine the test to run on a single processor in order to get consistent results for all tests.
void perf_startup() {
SetThreadAffinityMask(GetCurrentThread(), 1);
SetThreadIdealProcessor(GetCurrentThread(), 0);
Sleep(1);
}
//***************************** END: <Helper Code> ***********************************
// User Defined Type
class myUserObject
{
public:
std::string name;
std::string address;
std::string telephone;
std::string name2;
std::string address2;
std::string telephone2;
// default constructor
myUserObject()
{}
// Copy Constructor
myUserObject(const myUserObject& myObj)
{
name = myObj.name;
telephone = myObj.telephone;
address = myObj.address;
name2 = myObj.name2;
telephone2 = myObj.telephone2;
address2 = myObj.address2;
}
// copy assignment operator
myUserObject& operator=(const myUserObject& myObj)
{
name = myObj.name;
telephone = myObj.telephone;
address = myObj.address;
name2 = myObj.name2;
telephone2 = myObj.telephone2;
address2 = myObj.address2;
return *this;
}
// As VS2008 does not have rvalue reference support
#if _MSC_VER >= 1600
// move constructor
// This is where are getting our performance gain.
// The move() machinery is made available in <utility>
myUserObject(myUserObject&& myObj)
{
name = std::move(myObj.name);
telephone = std::move(myObj.telephone);
address = std::move(myObj.address);
name2 = std::move(myObj.name2);
telephone2 = std::move(myObj.telephone2);
address2 = std::move(myObj.address2);
}
// move assignment operator
myUserObject& operator=(myUserObject&& myObj)
{
name = std::move(myObj.name);
telephone = std::move(myObj.telephone);
address = std::move(myObj.address);
name2 = std::move(myObj.name2);
telephone2 = std::move(myObj.telephone2);
address2 = std::move(myObj.address2);
return *this;
}
#endif
};
int main()
{
perf_startup();
int total_copy_move_count = 0;
int total_vector_resizes = 0;
double total_pushback_time = 0.0;
std::vector<myUserObject> vec;
myUserObject obj;
obj.name = "Stephan T. Lavavej Stephan T. Lavavej Stephan T. Lavavej";
obj.telephone = "314159265 314159265 314159265 314159265 314159265";
obj.address = "127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0 127.0.0.0";
obj.name2 = "Mohammad Usman. Mohammad Usman. Mohammad Usman. ";
obj.telephone2 = "1234567890 1234567890 1234567890 1234567890 1234567890";
obj.address2 = "Republik Of mancunia. Republik Of mancunia Republik Of mancunia";
const long long start = counter();
for (int i = 0; i<1050000; ++i)
{
size_t capacity = vec.capacity();
size_t size = vec.size();
vec.push_back(obj);
if (capacity == size) // indication that vector reallocation has occurred.
{
++total_vector_resizes;
total_copy_move_count += size;
}
}
const long long finish = counter();
total_pushback_time = (finish-start)*1.0 / frequency();
// print result
std::cout << std::endl;
std::cout << std::fixed;
std::cout << "Total Pushback time: \t" << total_pushback_time << " sec" << std::endl;
std::cout << "Total number of vector reallocations: \t" << total_vector_resizes << std::endl;
std::cout << "Total number of objects copied/moved: \t" << total_copy_move_count << std::endl;
}
VS2008:
c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 rvalue.cpp
rvalue.cpp
Generating code
Finished generating code
c:\TEMP\STLPerf> rvalue.exe
Total Pushback time: 8.309241 sec
Total number of vector reallocations: 36
Total number of objects copied/moved: 3149622
VS2010:
c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 rvalue.cpp
rvalue.cpp
Generating code
Finished generating code
c:\TEMP\STLPerf> rvalue.exe
Total Pushback time: 2.335225 sec
Total number of vector reallocations: 36
Total number of objects copied/moved: 3149622
As you can see that in this specific scenario we are getting more than 3 times performance boost. This gain is due to the fact that the move machinery is used during vector reallocations, which is cheaper as compared to the copy machinery (copy construction and copy assignment). And the bigger your object is and the larger your vector is, the more performance gain you will get. So those of you who use vectors on a big scale, this is something you will surely love.
Vector Reallocation (STL Types)
Now the interesting part is that similar to the sample code above, we have implemented move semantics (move constructors and move assignment operators) for our STL types in VS2010. It is worth mentioning here that for some types, in order to improve performance especially in situations where the copy was taking a lot of time such as vector reallocation, we used a trick called “swaptimization” before VS2010. (see pt.#16 in this blog post) Now with the arrival of rvalue references we don’t need that anymore and we have replaced those swaptimization tricks with the rvalue reference machinery and, guess what, we are now even faster. The following example shows one such scenario where we used swaptimization earlier.
#include <iostream>
#include <ostream>
#include <ios>
#include <stddef.h>
#include <string>
#include <vector>
#include <windows.h>
//***************************** START: <Helper Code> ***********************************
long long counter() {
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return li.QuadPart;
}
long long frequency() {
LARGE_INTEGER li;
QueryPerformanceFrequency(&li);
return li.QuadPart;
}
// to confine the test to run on a single processor in order to get consistent results for all tests.
void perf_startup() {
SetThreadAffinityMask(GetCurrentThread(), 1);
SetThreadIdealProcessor(GetCurrentThread(), 0);
Sleep(1);
}
//***************************** END: <Helper Code> ***********************************
int main()
{
perf_startup();
long long total_copy_move_count = 0;
long long total_vector_resizes = 0;
double total_pushback_time = 0.0;
std::vector<std::string> v_str;
long long start = counter();
for (int i = 0; i<12000000; ++i)
{
size_t size = v_str.size();
size_t capacity = v_str.capacity();
v_str.push_back("I think, therefore I am");
if (size == capacity) // indication that vector reallocation has occurred.
{
++total_vector_resizes;
total_copy_move_count += size;
}
}
long long finish = counter();
total_pushback_time = ((finish - start)*1.0 / frequency());
std::cout << std::fixed;
std::cout << std::endl;
std::cout << "Total Pushback time: \t" << total_pushback_time << " sec" << std::endl;
std::cout << "Total number of vector reallocations: \t" << total_vector_resizes << std::endl;
std::cout << "Total number of objects copied/moved: \t" << total_copy_move_count << std::endl;
}
VS2008:
c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 swaptimization.cpp
swaptimization.cpp
Generating code
Finished generating code
c:\TEMP\STLPerf> swaptimization.exe
Total Pushback time: 6.717094 sec
Total number of vector reallocations: 42
Total number of objects copied/moved: 35875989
VS2010:
c:\TEMP\STLPerf>cl /EHsc /nologo /W4 /O2 /GL /D_SECURE_SCL=0 swaptimization.cpp
swaptimization.cpp
Generating code
Finished generating code
c:\TEMP\STLPerf> swaptimization.exe
Total Pushback time: 3.780286 sec
Total number of vector reallocations: 42
Total number of objects copied/moved: 35875989
You can see that moving to rvalue references hasn’t impacted the performance. In fact, in this particular example, we are almost twice as fast as VS2008. The credit for this gain also goes to rvalue references, not specifically during vector reallocation, but during push_back() instead. As you can see that we are passing a literal string (an rvalue- see footnote) to the function. So the move machinery comes into play here again, giving us another performance boost.
This is all what I have to share in my blog. If you have any questions about performance of STL or in general about STL, please feel free to write to me at: Mohammad.Usman at microsoft.com.
Thanks!
-Usman
* String literals are actually lvalues (C++03 5.1/2), whereas all other literals are rvalues. However, since we’ve got a vector<string>, push_back() takes const string&. This constructs a temporary std::string, which is an rvalue. That’s where move semantics comes into play.