Introducing ‘Vector Calling Convention’

Introducing ‘Vector Calling Convention’

Rate This
  • Comments 18

Introduction


In VS2013 (download here), we have introduced a new calling convention known as 'Vector Calling Convention' but before we go on and introduce the new calling convention let us take a look at the lay of the land today.

There are multiple calling conventions that exist today, especially on x86 platform. The x64 platform has however been slightly blessed with only one convention. The following calling conventions are supported by the Visual C/C++ compiler (_cdecl, _clrcall, _stdcall, _fastcall and others) on x86. _cdecl is the default calling convention for C/C++ programs on x86. However x64 just uses the _fastcall calling convention.

 The types __m64, __m128 and __m256 define the SIMD data types that fit into the 64-bit mmx registers, the 128-bit xmm registers, and the 256-bit ymm registers, respectively. The Microsoft compiler has partial support for these types today.

 Let us take a look at an example to understand what this (i.e. partial support) really means:  


                                                  
Figure 1: arguments are passed by reference implicitly on x64.

Today on AMD64 target, passed by value vector arguments (such as __m128/__m256/) must be turned into a passed by address of a temporary buffer (i.e. $T1, $T2, $T3 in the figure above) allocated in caller's local stack as shown in the figure above. We have been receiving increasing concerns about this inefficiency in past years, especially from game, graphic, video/audio, and codec domains. A concrete example is MS XNA library in which passing vector arguments is a common pattern in many APIs of XNAMath library. The inefficiency will be intensified on upcoming AVX2/AVX-512 and future processors with wider vector registers.

On X86, the convention is a little more advanced in which first 3 passed by value vector arguments will be passed in XMM0:XMM2 register. However, 4th or beyond vector argument is not allowed and will cause C2719 error. Developers today are forced to manually turn it into passed by reference argument to get around the limitation on X86. 

 

How to make use of Vector Calling Convention?

 
The new calling convention focuses on utilizing vector registers for passing vector type arguments. With Vector Calling Convention the design consideration was to avoid creating a totally different convention and be compatible with existing convention for integer and floating point arguments. This design consideration was further extended to avoid changing the stack layout or dealing with padding and alignment. Please note, the vector calling convention is only supported for native amd64/x86 targets and further it does not apply to MSIL (/clr) target.

The new calling convention can be triggered in the following two ways: 
 

  • _vectorcall: Use the new _vectorcall keyword to control the calling convention of specific functions. For example, take a look in the figure 2 below: 

                                        Figure 2: '__vectorcall' denotes the use of Vector Calling Convention

  • The other way vector calling convention can be used is if the /Gv compiler switch is specified. Using the /Gv compiler option causes each function in the module to compile as vectorcall unless the function is declared with a conflicting attribute, or the name of the function is main.  

In addition to SIMD data types, Vector Calling Convention can also be used for Homogeneous Vector Aggregate data-type (HVA) and Homogeneous Float Aggregate data-type (HFA). An HVA/HFA data-type is a composite type where all of fundamental data types of members that compose the type are the same and are of Vector or Floating Point data type. (__m128, __m256, float/double). An HVA/HFA data type can have at most four members. Some examples of what constitutes an HVA/HFA data type are listed below.

                                        
                                                    Figure 3: HVA/HFA examples

 For both architectures (x86 and amd64), HVA/HFA arguments will be passed by value in vector registers if the unallocated volatile vector registers (XMM0:XMM5/YMM0:YMM5) is sufficient to hold the entire aggregate set.  They will be otherwise passed via reference the same way as the existing convention.  The return value of type HVA/HFA is returned via XMM0(/YMM0): XMM3(/YMM3), one register per element.

 

Vector Calling Convention /Disasm

 
Now that we understand a little about what Vector Calling Convention is really about. Let us take look at the disassembly with the use of Vector Calling Convention in the figure given below.

 

As you can see the 'Disassembly' generated as a result of using 'Vector Calling Convention' is simplified. The number of instructions with and without vector calling convention are displayed below.


In addition to the number of instructions saved, there is also the stack allocation (96 bytes, allocation of $T1, $T2 and $T3) saved by using vector calling convention which adds to the general goodness resulting in performance gains.   

 

Wrap Up

 
This blog should provide you an introduction to what Vector Calling Convention is all about. As you can observe, there is a lot of goodness in using this convention if you perform a lot of vector calculations in your code especially on the x64 platform. One quick way for validating the performance gain by using Vector Calling Convention for vector code without changing the source code is by using the /Gv compiler switch. At this point you should have everything you need to get started! Additionally, if you would like us to blog about some other compiler technology please let us know we are always interested in learning from your feedback. 

 

  • Interesting, thanks for the update!

    Out of curiosity, perhaps this is as good place to ask as any ;-)

    Were there any changes made in VC++2013 that would improve the situation in the following:

    www.g-truc.net/post-0571.html

    // VC++2012 (and comparable versions of Clang and GCC) were unable to vectorize a seemingly simple code.

  • I thought that Windows x64 SEH only supported __fastcall.

    So is it the case then that you cannot use SEH from within __vectorcall methods?

  • What about making the default heaps on x86 (and ARM 32) to be 16-byte aligned instead of 8?  That's an even bigger obstacle to widespread SIMD development on Win32 (and by extension Win64 to share the same code).   We currently have to overload global new/delete and malloc/free to aligned versions, or supply cumbersome alignment operators to all STL items and classes.  I seem to recall that the VS 2010 push_back didn't even support aligned data.

  • This looks like a good change, but a lot of the low level intrinsic work could be made lower priority if in-line assembly, with a smarter asm keyword.  The problem with the current asm keyword is it makes the compiler "forget" almost all of it's state, and the compiler ends up distrusting all current register state.  GCC has an asm keyword that lets you specify what gets clobbered, so that the compiler can optimize across the asm statement.

    If such a keyword existed in visual studio, motivated developers and library writers could use processor instructions efficiently before intrinsics are added.

  • __vectorcall was really a nice surprise when it was announced. Thank you for this. :)

    I was always completely baffled that whoever came up with your default x64 calling convention got it so wrong. We finally have an architecture where SSE2 is *guaranteed* to be supported. And then they carefully design a calling convention to *not* use it. (Out of curiosity, does anyone happen to know if there ever was a rationale for this?)

    Is all of this supposed to work in the preview though? I played around with it a week or two ago, and I couldn't get it working with your hfa examples. Worked perfectly with structs containing __m128's, but not if they contained floats. I guess I must have done something wrong.

    About the /Gv compiler switch, how does that handle function declarations from third-party headers? Say I include a header from another library (which does not use __vectorcall, and does not explicitly specify a calling convention in their function declarations), won't they then be interpreted as __vectorcall, and won't that give me linker errors?

    (I suppose a way around that would be if the mangled function name was only decorated as __vectorcall by the compiler if it actually contains one or more arguments affected by the calling convention)

    As @Alecazam points out, being unable to use std::vector or other standard library containers with 16-bit aligned datatypes in x86 code has been a major headache. Could the necessary std lib functions be defined as __vectorcall to fix this?

    Anyway, __vectorcall is a very welcome addition. And thanks for the blog post about it :)

  • Watcom's inline assembly?  Powerful stuff.  15 years back in time and far more powerful than any MS inline.  Oh, wait, you don't even do inline x64. Wow!  Go find those Watcom people.  Hire them.  Get things really moving forward.

  • sooo... the days of "there is only one calling convention on x64" are gone now ;-)

  • I may be worth highlighting that this also applies to scalar arguments and returns on x86. For example:

    float sum(float x, float y) {

       // of course, such a function is normally inlined but let's ignore that for the sake of example

       return x + y;

    }

    In the usual calling x86 conventions the arguments are passed on the stack and the return is made via a x87 stack register even if the SSE instruction set is enabled. The use of the x87 stack is particularly nasty because it can prevent the compiler from using SSE instructions or it requires an additional copy from a SSE register to a x87 register.

    If the __vectorcall convention is used then the 2 arguments are passed via xmm0 and xmm1 and the return is done via xmm0. The whole function reduces to a "addss xmm0, xmm1" instruction and doesn't require any memory accesses.

    Note that if the function is not exported, its address is never taken and your have LTCG enabled then the compiler will ignore a calling convention like __fastcall and will use a custom calling convention that's more or less similar to __vectorcall.

  • To JDT:

    You can use SEH within __vectorcall methods. __vectorcall functions can call functions with other calling conventions.

    To Alecazam:

    We realize the inconvenience that the default heap allocation from CRT is 8-byte aligned on x86 and evaluated several ways to address this issue. One of the approaches is to make 16-byte alignment as the default. The problem is that it will have code size impact for current applications, specially Windows. We also evaluate to introduce a declaration keyword on heap allocated objects or let compiler to figure out the alignment to generate a trampoline to call aligned_malloc for new. But that is limited to the C++ conformance that new operator needs to call the global new. So as today, you have to overload global new/delete and malloc/free to address this headache.  We would like to hear your input on this.

    To Grumpy:

    1) Thanks for trying __vectorcall using Preview. Can you share the hfa case which fails to you?

    2) I don't fully understand your first question "they carefully design a calling convention to *not* use it. ". /arch:SSE2 is the default option which generates SSE2 instruction on x64.

    3) When you use /Gv, compiler will treat any functions which don't have calling convention decorations within the compilation unit as __vectorcall unless they are vararg functions. These include function declarations from  third party headers. So you could get linker errors if inconsistency happens. The expected behavior is similar as /Gr and /Gd on x86.

    4) It is a good suggestion to make the STL functions as __vectorcall. We will discuss this with the library team.

    To Mike:

    You make a good point that with __vectorcall, it can avoid x87 instruction on passing/returning float-point arguments. That is one of the goals that we introduce __vectorcall. It only has performance impact but lower power consumption because no x87 core is turned on frequently.

    Thank all for your comments.

    Charles Fu

    Microsoft Visual C++ Team

  • @Charles Fu: "I don't fully understand your first question "they carefully design a calling convention to *not* use it. ".

    The default x64 calling convention uses SSE registers to pass scalar values like float and double but forces __m128 values to be passed on stack instead of using the SSE registers. Basically __vectorcall is what the default x64 calling convention should have been in the first place.

  • As a game developer this is a very welcome addition but it is very inconvenient to use.

    Using /Gv will not be possible for us until we get 3rd party libraries that are also compiled with it which might take a while and attributing every function that takes or returns a type containing SIMD data is just madness.

    It would be much more useful if the type could be attributed:

    struct __vectorcall Particle // any function that takes or returns an Particle will use __vectorcall convention

    {

    __m128 x;

    __m128 y;

    };

    Are there reasons for not enabling this? And if not I would like to strongly encourage you to add support for it.

  • @Simon, that is an interesting suggestion and good feedback. I will discuss this internally and get back to you :), is it possible to connect to you through email ? (I can be reached at aasthan@microsoft.com).

  • Hi, thanks for the post on how it works. Considering that performance is the whole point of this, could we please see some measures, perhaps even some graphs, on how this helps with performance?

  • DirectXMath 3.05 in the Windows SDK for Windows 8.1 Preview included in the VS 2013 Preview--phew! What a mouthful. Let's just say the 18.00 compiler with the Windows 8.1 SDK--has been updated for __vectorcall. I also updated the DirectXMath processor specific extensions series, Spherical Harmonics (SH) math library, the XDSP.h digital signal processing helper library, and the DirectX Tool Kit. Many of them are 'inline' so when it gets inlined it doesn't matter what the calling convention was set to, but it does make a difference in codegen whenever the compiler decides not to inline.

    blogs.msdn.com/.../directxmath-sse-sse2-and-arm-neon.aspx

    go.microsoft.com/.../p

    blogs.msdn.com/.../xdsp-h.aspx

    go.microsoft.com/fwlink

    @Ben Craig: The asm keyword is only supported for x86. Intrinsics support x86, x64, and ARM -and- it doesn't break the compiler's optimizations contexts. In some sense, intrinsics are the 'smart asm' keyword. Writing assembly for x64 and ARM is also really painful to deal with all the exception handler requirements. DirectXMath, SH math, and XDSP.H all use intrinsics and no asm.

    @grumpy: The original x64 __fastcall calling convention was created by analyzing 'standard' code across a broad spectrum of usage in Windows applications. One of the driving factors was also minimizing marshaling costs for things like .NET interop as well, which is what __fastcall does (you only have to marshal XMM0L - XMM3L, not the whole SSE register file or have to worry about ever-increasing SSE register widths ala AVX)

    @Azarien: __fastcall is still the best choice for 'system OS' style functions, and is still the one and only x64 calling-convention for .NET interop, WinRT, and numerous other cases. __vectorcall is really best when used for "internal" code within a DLL or EXE. You can use it to cross such a boundary, but it's complicated and requires a agreement on a 'new' calling convention between the components.

    @Simon: /Gv is currently a little problematic until every gets the news on the existence of __vectorcall and decides to explicitly annotate their headers (either with __fastcall or __vectorcall or any of the various aliases like __cdecl, __stdcall, STDAPI, STDMETHOD, etc.) as appropriate. If you are rebuilding all your code, it's not a problem, but if course most "really useful engines" consume some binaries that are only available as static libraries or import libs.

  • @Charles Fu: sorry, it looks like I'd gotten my test cases mixed up before. It does work just fine with HFA.

    :)

Page 1 of 2 (18 items) 12