Q&A on our TR1 implementation

Q&A on our TR1 implementation

  • Comments 82

Hello.  My name is Stephan and I’m a developer on the Visual C++ libraries team.  As the Visual Studio 2008 Feature Pack Beta (available for download here with documentation available here) contains an implementation of TR1, I thought I’d answer some common questions about this technology.

 

Q. What version of Visual C++ does the Feature Pack work against?

 

A. The Feature Pack is a patch for the RTM version Visual C++ 2008 (also known as VC9)..  The patch can't be applied to:

 

    * VC9 Express.

    * Pre-RTM versions of VC9 (e.g. VC9 Beta 2).

    * Older versions of VC (e.g. VC8).

 

Q: Can I just drop new headers into VC\include instead of applying the patch?

 

A: No. VC9 TR1 consists of new headers (e.g. <regex>), modifications to existing headers (e.g. <memory>), and separately compiled components added to the CRT (e.g. msvcp90.dll). Because it is mostly but not completely header-only, you must apply the patch and distribute an updated CRT with your application. You can think of VC9 TR1 as "Service Pack 0".

 

Q: If I use TR1, will I gain a dependency on MFC? Or, if I use the MFC updates, will I gain a dependency on TR1?

 

A: No. The TR1 and MFC updates don't interact. They're just being distributed together for simplicity.

 

Q: How complete is your TR1 implementation?

 

A: Our implementation contains everything in TR1 except sections 5.2 and 8 (http://open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1836.pdf ). That is, we are shipping the Boost-derived components and the unordered containers.

 

Q: Because TR1 modifies existing headers, can it negatively affect me when I'm not using it?

 

A: It shouldn't (if it does, that's a bug, and we really want to hear about it). You shouldn't see any broken behavior at runtime, nor any new compiler warnings or errors (at least under /W4, the highest level that we aim to keep the VC Libraries clean under). Of course, TR1 can slow down your build slightly (more code means more compilation time), and if you are very close to the /Zm limit for PCHs, the additional code can put you over the limit. You can define _HAS_TR1 to 0 project-wide in order to get rid of the new code.

 

Q: Does this patch affect the compiler or IDE?

 

A: No.

 

Q: Did you license Dinkumware's TR1 implementation?

 

A: Yes (see http://dinkumware.com/tr1.aspx ), just as we licensed their Standard Library implementation. So, you can expect the usual high level of quality.

 

Q: So, what has MS been doing?

 

A: Roughly speaking, additional testing and "MS-specific" stuff.

 

1. We've integrated TR1 into VC9, so TR1 lives right next to the STL. (This involved unglamorous work with our build system, to get the TR1 separately compiled components picked up into msvcp90.dll and friends, and with our setup technologies, to get the new headers and sources picked up into the Visual Studio installer.) As a result, users don't have to modify their include paths, or distribute new DLLs with their applications - they just need to apply the patch and update their CRT.

 

2. We've made TR1 play nice with /clr and /clr:pure (which are outside the domain of Standard C++, but which we must support, of course). At first, these switches caused all sorts of compiler errors and warnings. For example, even something as simple as calling a varargs function internally triggered a "native code generation" warning. These errors and warnings took a long time to iron out.

 

3. We're ensuring that TR1 compiles warning-free at /W4, in all supported scenarios. This includes switches like /Za, /Gz, and the like.

 

4. We're ensuring that TR1 is /analyze-clean.

 

As usual, we've preferred real fixes to workarounds to disabling warnings (and when we disable warnings, we do so only in the headers, not affecting user code).

 

5. We're identifying bugs in TR1 and working with Dinkumware to fix them. Dinkumware's code was very solid to begin with - but as ever, more eyes find more bugs. I've even found a couple of bugs in the TR1 spec itself (see http://open-std.org/JTC1/sc22/WG21/docs/lwg-active.html#726 and Issue 727 below).

 

6. We're striving for performance parity with Boost (which serves as a convenient reference; we could compare against GCC's TR1 implementation, but then we'd have to deal with the difference in compilers). In some areas, we won't get there for VC9 TR1 (hopefully, we should for VC10), but we've already made good progress. Thanks to MS's performance testing (which Rob Huyett has been in charge of), we identified a performance problem in regex matching, which Dinkumware has sped up by 4-5x. (Note that this fix didn't make it into the beta, which is roughly 18x slower at matching; current builds are roughly 3.8x slower.) And we've achieved performance parity for function (again, this fix didn't make it into the beta).

 

("Okay," you say, "but will regex::optimize make it faster?" Unfortunately, no. The NFA => DFA transformation suggested by regex::optimize will not be implemented in VC9 TR1, but we will consider it for VC10. In my one cursory test, regex::optimize did nothing with Boost 1.34.1.)

 

7. We're identifying select C++0x features to backport into TR1 - for example, allocator support for shared_ptr and function. While not in TR1, this is important to many customers (including our own compiler). This just got checked in, and isn't in the beta.

 

8. We're implementing IDE debugger visualizers for TR1 types. Like the STL (more so, in some cases), the representations of TR1 types are complicated, so visualizers really help with debugging. I've written visualizers for almost every TR1 type (I am secretly proud of how shared_ptr's visualizer switches between "1 strong ref" and "2 strong refs"). Note that the beta doesn't include any TR1 visualizers.

 

9. We've worked with Dinkumware to fix a small number of bugs present in VC8 SP1 and VC9 RTM ("because we were in the neighborhood"). One which was actually related to TR1 was that stdext::hash_set/etc. had an O(N), throwing, iterator-invalidating swap() (discovered because unordered_set/etc. shares much of its implementation). This has been fixed to be O(1), nofail, non-iterator-invalidating.

 

10. Because TR1 lives alongside the STL, we've made them talk to each other in order to improve performance. For example, STL containers of TR1 types (e.g. vector<shared_ptr<T> >, vector<unordered_set<T> >) will avoid copying their elements, just as STL containers of STL containers in VC8 and VC9 avoid copying their elements.

 

This is a little-known feature of the VC8 STL; it's there in the source for everyone to see, except that almost no one reads the Standard Library implementation (nor should they have a reason to).  Basically, this is a library implementation of C++0x "move semantics", although it's naturally much more limited than language support will be.  In VC8, template magic is used to annotate containers (vector, deque, list, etc.) as having O(1) swaps, so containers-of-containers will swap them instead of making new copies and destroying the originals.  (For builtin types, swapping would be less efficient.)

 

We've simply extended this machinery to the new types in TR1. Everything in TR1 with custom swap() implementations will benefit: shared_ptr/weak_ptr (avoiding reference count twiddling), function (avoiding dynamic memory allocation/deallocation), regex (avoiding copying entire finite state machines), match_results (avoiding copying vectors), unordered_set/etc. (avoiding copying their guts), and array (which doesn't have an O(1) swap(), but does call swap_ranges() - so arrays of things with O(1) swaps will benefit).

 

That is to say: if a vector<shared_ptr<T> > undergoes reallocation, the reference counts won't be incremented and decremented. That's really neat, if you ask me.

 

If you have any questions about or issues with the Feature Pack Beta, let us know!

 

Thanks,

 

Stephan T. Lavavej, Visual C++ Libraries Developer

  • PingBack from http://blogs.gotdotnet.ru/personal/kichik/PermaLink,guid,4F4F8997-BB1C-41BD-9E39-F901F7F75C54.aspx

  • Great stuff!  Thanks for the detailed rundown.  Looking forward to using it!

  • Looks nice, but where do we submit TR1 bugs?

  • If TR1 is just a bunch of headers and maybe some link libraries, and if there are no compiler changes, then why would TR1 not run on VCExpress? With a little bit of manual work, I imagine that it would not be too difficult to get up and running. Is this merely a business decision, or is there actually a technical reason for this restriction?

    Why you may ask?

    I have access to the full version, but I honestly prefer to work in Express. Express removes so much of the unneeded bloat, is faster, and as a result more productive. Similar to the days of VC6.

  • [Cory Nelson]

    > Looks nice, but where do we submit TR1 bugs?

    Microsoft Connect, or directly to me at stl@microsoft.com .

    Stephan T. Lavavej

    Visual C++ Libraries Developer

  • Stephan T. Lavavej,

    Pretty cool initials you have there. Seems ideal for someone who loves the STL so much.

  • James,

    Thats a really great idea: use VCExpress to avoid the bloated VS.

    Stephan T. Lavavej,

    Is there an official method for paying customers to use (leverage) VCExpress with MFC and TR1. I also find VS has become very bloated since VC6.  Overall, I think the VC++ team is doing a good job, but much of Visual Studio does not appeal to me.

  • Speaking of STL optimizations: is there any specific reason (aside from the "we haven't time to do that") why std::vector doesn't use __is_pod and friends to switch to a memset/memcpy-using implementation where possible? In my tests, this gives noticeable speed increase for small elements - it's more than 2x for chars, and 20% for shorts.

  • Dinkumware has implemented nearly everything in sections 5.2 and 8.  Since MS is licensing Dinkumware's implementation, is there a good reason VC9 TR1 does not include sections 5.2 and 8?

  • [Anon.]

    > Pretty cool initials you have there. Seems ideal for someone who loves the STL so much.

    Yes (and my initials are so much more convenient to type)!

    [Jared]

    > Is there an official method for paying customers to use (leverage) VCExpress with MFC and TR1.

    No, this is not supported.

    [Pavel Minaev]

    > Speaking of STL optimizations: is there any specific reason

    > (aside from the "we haven't time to do that") why std::vector

    > doesn't use __is_pod and friends to switch to a

    > memset/memcpy-using implementation where possible?

    I thought it did - vector<T>::push_back() eventually calls vector<T>::_Insert_n(), which (should) call _Umove() to move elements from the old memory block to the new memory block. (As I mentioned in http://blogs.msdn.com/vcblog/archive/2008/01/07/mfc-beta-now-available.aspx#7032589 , this is broken in the VC9 TR1 Beta.) That eventually calls _Uninit_move(), which - if T is something like char which isn't annotated with _Swap_move_tag - calls unchecked_uninitialized_copy() ("move defaults to copy if there is not a more effecient way"). That eventually calls _Uninit_copy(), of which there are two implementations. The first is for _Nonscalar_ptr_iterator_tag, and the second is for _Scalar_ptr_iterator_tag. The latter calls _CRT_SECURE_MEMMOVE().

    Did you define __STDC_WANT_SECURE_LIB__ to 0 when doing your performance comparisons? If not, you might just be seeing the overhead of memmove_s() versus memmove().

    Of course, now I wonder if memcpy() is actually faster, and whether we should be using _CRT_SECURE_MEMCPY().

    Stephan T. Lavavej

    Visual C++ Libraries Developer

  • [Kevin]

    > Dinkumware has implemented nearly everything in

    > sections 5.2 and 8.  Since MS is licensing

    > Dinkumware's implementation, is there a good reason

    > VC9 TR1 does not include sections 5.2 and 8?

    Development time, testing time, and customer value.

    Shipping C99 compat and special math functions would take more dev and test time (probably not an incredible amount of time, but definitely nonzero).

    The special math functions are of extremely limited interest (and have not been picked up into C++0x), and the C99 compat, while useful, is of less interest than the Boost-derived components. (I'd like to have <cstdint>, but shared_ptr is about a bazillion times more vital.)

    So, given limited resources, we chose the Boost-derived components and unordered containers. Does that sound reasonable?

    Stephan T. Lavavej

    Visual C++ Libraries Developer

  • > Of course, now I wonder if memcpy() is actually faster, and whether we should be using _CRT_SECURE_MEMCPY().

    Not really, since memmove() in VC checks the ranges for overlap, and delegates to memcpy() if possible anyway. My tests show no noticeable difference for non-overlapping sequences between the two even for amount of data as small as 4K. Not a problem with memmove_s(), either - it also does a very quick check with no observable difference for moderately large vectors.

    I was also thinking that the difference is because I used memset to zero-initialize my arrays, and vector uses std::fill (it could also use memset to default-initialize PODs, by the way), but again it doesn't seem to make any difference. So I don't know what to make of it. Here's the code I've used to test:

    #include <algorithm>

    #include <cstdio>

    #include <cstring>

    #include <vector>

    #include <windows.h>

    struct timer

    {

     const char* msg;

     DWORD start;

     timer(const char* msg)

       : msg(msg), start(GetTickCount())

     {

     }

     ~timer()

     {

       DWORD end = GetTickCount();

       std::fprintf(stderr, "%s: %u\n", msg, end - start);

     }

    };

    int main()

    {

     typedef char elem_t;

     static const int N = 100000, K = 10000;

     //getchar();

     {

       timer t("vector");

       for (int k = 0; k < K; ++k)

       {

         std::vector<elem_t> v(N);

         v.push_back(0);

       }

     }

     {

       timer t("fill + memmove");

       for (int k = 0; k < K; ++k)

       {

         elem_t* a1 = new elem_t[N];

         std::fill(a1, a1 + N, 0);

         elem_t* a2 = new elem_t[N * 2];

         std::memmove(a2, a1, N * sizeof(elem_t));

         delete[] a1;

         a2[N] = 0;

         delete[] a2;

       }

     }

     {

       timer t("fill + memcpy");

       for (int k = 0; k < K; ++k)

       {

         elem_t* a1 = new elem_t[N];

         std::fill(a1, a1 + N, 0);

         elem_t* a2 = new elem_t[N * 2];

         std::memcpy(a2, a1, N * sizeof(elem_t));

         delete[] a1;

         a2[N] = 0;

         delete[] a2;

       }

     }

     {

       timer t("memset + memcpy");

       for (int k = 0; k < K; ++k)

       {

         elem_t* a1 = new elem_t[N];

         std::memset(a1, 0, N * sizeof(elem_t));

         elem_t* a2 = new elem_t[N * 2];

         std::memcpy(a2, a1, N * sizeof(elem_t));

         delete[] a1;

         a2[N] = 0;

         delete[] a2;

       }

     }

    }

    This is compiled with:

    > cl /EHsc /Ox /Zi /D_SECURE_SCL=0 /D__STDC_WANT_SECURE_LIB__=0

    When run, it gives the following results:

    vector: 1344

    fill + memmove: 422

    fill + memcpy: 422

    memset + memcpy: 437

    From what you say, I would expect vector to be as fast as fill+memmove (since that's pretty much what it does). Curiously, these numbers differ with varying K and N; in general, smaller N make the difference more pronounced, while larger N make it less noticeable. For example, for N=10000 & K=1000000 I get:

    vector: 11391

    fill + memmove: 2343

    fill + memcpy: 2313

    memset + memcpy: 2328

    And for N=1000000 & K=1000, it's:

    vector: 5000

    fill + memmove: 4938

    fill + memcpy: 4937

    memset + memcpy: 4938

    Which makes more sense. I'm more interested in cases with N<100000, since that's where it is going to be in most practical cases. From these numbers, it seems that the issue is not in the filling/copying code itself, but in some checks which remain there even with _SECURE_SCL disabled.

  • > I'd like to have <cstdint>, but shared_ptr is about a bazillion times more vital.

    <cstdint> is by far the most important of all TR1 compat headers, and also the simplest; why not have it and forgo the rest?

  • I'm disappointed at stdint/cstdint being missing, too. I hate typing unsigned __int64 in one-off programs and I've never seen a major project that didn't have to have its own typedefs for sized types. Lots of fun when they collide. Much of the stuff in TR1 section 8 I could do without, but stdint.h I want. I suppose not having the extra printf() size specifiers would be a bummer, though.

  • Well, for most practical purposes, we've got a defacto standard by now:

    sizeof(short)==2

    sizeof(int)==4

    sizeof(long long)==8

    At least I'm not aware of any still-relevant compiler for which this doesn't hold (I know int was 2 bytes in 16-bit DOS days, but these are hardly relevant today).

    Even so, int32_t is so much more preferrable...

Page 1 of 6 (82 items) 12345»