New Intrinsic Support in Visual Studio 2008

New Intrinsic Support in Visual Studio 2008

  • Comments 16
  Hello. This is Dylan Birtolo, a UE writer for Visual C++. This is my first vcblog entry, but hopefully I will be more of a regular contributor. One of my most recent tasks was to incorporate the documentation for all of the new intrinsic functions that are being put into Visual Studio 2008 for VC++. It is very exciting since support for over 100 intrinsics were added.

Before getting to the intrinsics themselves, it is important to mention why you should prefer using intrinsics when it is possible for you to use inline assembly (inline asm) to access the instructions directly. Here are some reasons to consider using the intrinsics:

  • Inline asm is not supported by Visual C++ on 64-bit machines. Therefore, if you want your code to be 64-bit compatible, you need to use intrinsics.
  • Ease of use. The intrinsics do not require you to be aware of registers or manage memory directly. Instead, you have a function that is complete with inputs and return values. This makes the instructions more accessible to a wider range of technical expertise.
  • The intrinsics are updated in the compiler. What this means from a user perspective is that if the compiler improves how it handles the intrinsics, you receive this benefit immediately. Otherwise, if you are using inline asm, you will be responsible for making any improvements.
  • The optimizer does not work well with inline asm code, so it is recommended that you write inline asm code in its own function, assemble it, and link it in. With the intrinsics, those additional steps are not necessary.
  • Intrinsics are also more portable over code that uses inline asm.

Now let's get back to the new intrinsics. For the most part, these functions provide support for the Supplemental Streaming SIMD Extensions 3 (SSSE3), Streaming SIMD Extensions 4.1 (SSE4.1), SSE4.2, and SSE4A intrinsics. A handful of instructions were also created to support advanced bit manipulation instructions not available on earlier chipsets. All of these new intrinsics are first supported by the Penryn and Nehalem architectures for Intel and the Third-Generation AMD Opteron processors for AMD. However, regardless of your processor, you should always verify that a given intrinsic is supported before you attempt to use it. Not doing so could result in a run-time error.

To facilitate this verification process, the CPUID instruction has been updated. The latest copy of the documentation for Visual Studio 2008 contains a sample program in the topic __cpuid that you can copy, compile, and use. It is currently up to date and prints out in plain text what technologies your processor supports.

All of the intrinsics are straightforward and have documentation as well as code samples. Take a look. Tables for the new intrinsics can be found in the following three topics: SSE4A and Advanced Bit Manipulation Intrinsics, Streaming SIMD Extensions 4 Instructions, and Supplemental Streaming SIMD Extensions 3 Instructions.

Here is a list of the new intrinsics, organized by the instruction they support. Several of the instructions are very similar and only differ based on the size of the input parameters. To save space, these instructions are listed together. The one unusual case that bears some special consideration is POPCNT. This is listed both under SSE4.2 and ABM. This is so that the intrinsics are compatible with both the AMD and Intel compilers.

  • SSE
    • CVTSI2SS - Converts a 64-bit signed integer to a floating point value and inserts it into a 128-bit parameter. Intrinsics: _mm_cvtsi64_ss
    • CVTSS2SI - Extracts a 32-bit floating point value and rounds it to a 64-bit integer. Intrinsics: _mm_cvtss_si64
    • CVTTSS2SI - Extracts a 32-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttss_si64
  • SSE2
    • CVTSD2SI - Extracts the lowest 64-bit floating point value and rounds it to an integer. Intrinsics: _mm_cvtsd_si64
    • CVTSI2SD - Extracts the lowest 64-bit integer and converts it to a floating point value. Intrinsics: _mm_cvtsi64_sd
    • CVTTSD2SI - Extracts a 64-bit floating point value and truncates it to a 64-bit integer. Intrinsics: _mm_cvttsd_si64
    • MOVNTI - Writes 64 bits to a specified memory location. Intrinsics: _mm_stream_si64
    • MOVQ - Moves a 64-bit integer either to or from a 128-bit parameter. Intrinsics: _mm_cvtsi64_si128, _mm_cvtsi128_si64
  • SSSE3
    • PABSB / PABSW / PABSD - Gets the absolute value of signed integers. Intrinsics: _mm_abs_epi8, _mm_abs_epi16, _mm_abs_epi32, _mm_abs_pi8, _mm_abs_pi16, _mm_abs_pi32
    • PALIGNR - Combines two parameters and right-shifts the result. Intrinsics: _mm_alignr_epi8, _mm_alignr_pi8
    • PHADDSW - Adds two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hadds_epi16, _mm_hadds_pi16
    • PHADDW / PHADDD - Adds two parameters that contain signed integers. Intrinsics: _mm_hadd_epi16, _mm_hadd_epi32, _mm_hadd_pi16, _mm_hadd_pi32
    • PHSUBSW - Subtracts two parameters that contain 16-bit signed integers, saturating the result at the maximum value for 16 bits. Intrinsics: _mm_hsubs_epi16, _mm_shubs_pi16
    • PHSUBW / PHSUBD - Subtracts two parameters that contain signed integers. Intrinsics: _mm_hsub_epi16, _mm_hsub_epi32, _mm_hsub_pi16, _mm_hsub_pi32
    • PMADDUBSW - Multiplies and adds together 8-bit integers. Intrinsics: _mm_maddubs_epi16, _mm_maddubs_pi16
    • PMULHRSW - Multiplies 16-bit signed integers and right shifts the results. Intrinsics: _mm_mulhrs_epi16, _mm_mulhrs_pi16
    • PSHUFB - Selects and shuffles 8-bit chunks from a 128-bit parameter. Intrinsics: _mm_shuffle_epi8, _mm_shuffle_pi8
    • PSIGNB / PSIGNW / PSIGND - Negates, zeroes, or preserves signed integers. Intrinsics: _mm_sign_epi8, _mm_sign_epi16, _mm_sign_epi32, _mm_sign_pi8, _mm_sign_pi16, _mm_sign_pi32
  • SSE4A
    • EXTRQ - Extracts specified bits from the parameter. Intrinsics: _mm_extract_si64, _mm_extracti_si64
    • INSERTQ - Inserts specified bits into a given parameter. Intrinsics: _mm_insert_si64, _mm_inserti_si64
    • MOVNTSD / MOVNTSS - Writes bits directly to a specified memory location without polluting the caches. Intrinsics: _mm_stream_sd, _mm_stream_ss
  • SSE4.1
    • DPPD / DPPS - Calculates the dot product of two parameters. Intrinsics: _mm_dp_pd, _mm_dp_ps
    • EXTRACTPS - Extracts a specified 32-bit floating point value from the parameter. Intrinsics: _mm_extract_ps
    • INSERTPS - Inserts a 32-bit integer into a 128-bit parameter and potentially zeroes out some bits. Intrinsics: _mm_insert_ps
    • MOVNTDQA - Loads 128 bits of data from a specified memory location. Intrinsics: _mm_stream_load_si128
    • MPSADBW - Calculates eight offset sums of absolute difference. Intrinsics: _mm_mpsadbw_epu8
    • PACKUSDW - Converts 32-bit signed integers to signed 16-bit integers using 16-bit saturation. Intrinsics: _mm_packus_epi32
    • PBLENDW / BLENDPD / BLENDPS / PBLENDVB / BLENDVPD / BLENDVPS - Blends two parameters together various chunk sizes. Intrinsics: _mm_blend_epi16, _mm_blend_pd, _mm_blend_ps, _mm_blendv_epi8, _mm_blendv_pd, _mm_blendv_ps
    • PCMPEQQ - Compares 64-bit integers for equality. Intrinsics: _mm_cmpeq_epi64
    • PEXTRB / PEXTRW / PEXTRD / PEXTRQ - Extracts an integer from the input parameter. Intrinsics: _mm_extract_epi8, _mm_extract_epi16, _mm_extract_epi32, _mm_extract_epi64
    • PHMINPOSUW - Selects the minimum 16-bit unsigned integer and determines its index. Intrinsics: _mm_minpos_epu16
    • PINSRB / PINSRD / PINSRQ - Inserts an integer into a 128-bit parameter. Intrinsics: _mm_insert_epi8, _mm_insert_epi32, _mm_insert_epi64
    • PMAXSB / PMAXSD - Takes signed integers from two parameters and selects the maximum. Intrinsics: _mm_max_epi8, _mm_max_epi32
    • PMAXUW / PMAXUD - Takes unsigned integers from two parameters and selects the maximum. Intrinsics: _mm_max_epu16, _mm_max_epu32
    • PMINSB / PMINSD - Takes signed integers from two parameters and selects the minimum. Intrinsics: _mm_min_epi8, _mm_min_epi32
    • PMINUW / PMINUD - Takes unsigned integers from two parameters and selects the minimum. Intrinsics: _mm_min_epu16, _mm_min_epu32
    • PMOVSXBW / PMOVSXBD / PMOVSXBQ / PMOVSXWD / PMOVSXWQ / PMOVSXDQ - Converts signed integers of one size to a larger size. Intrinsics: _mm_cvtepi8_epi16, _mm_cvtepi8_epi32, _mm_cvtepi8_epi64, _mm_cvtepi16_epi32, _mm_cvtepi16_epi64, _mm_cvtepi32_epi64
    • PMOVZXBW / PMOVZXBD / PMOVZXBQ / PMOVZXWD / PMOVZXWQ / PMOVZXDQ - Converts unsigned integers of one size to a larger size. Intrinsics: _mm_cvtepu8_epi16, _mm_cvtepu8_epi32, _mm_cvtepu8_epi64, _mm_cvtepu16_epi32, _mm_cvtepu16_epi64, _mm_cvtepu32_epi64
    • PMULDQ - Multiplies 32-bit signed integers and stores the result as 64-bit signed integers. Intrinsics: _mm_mul_epi32
    • PMULLUD - Multiplies 32-bit signed integers. Intrinsics: _mm_mullo_epi32
    • PTEST - Calculates a bitwise test of two 128-bit parameters and returns a value based on the CF and ZF bits of the CC flags register. Intrinsics: _mm_testc_si128¸ _mm_testnzc_si128, _mm_testz_si128
    • ROUNDPD / ROUNDPS - Rounds floating point values. Intrinsics: _mm_ceil_pd, _mm_ceil_ps, _mm_floor_pd, _mm_floor_ps, _mm_round_pd, _mm_round_ps
    • ROUNDSD / ROUNDSS - Combines two parameters, rounding a floating point value from one of them. Intrinsics: _mm_ceil_sd, _mm_ceil_ss, _mm_floor_sd, _mm_floor_ss, _mm_round_sd, _mm_round_ss
  • SSE4.2
    • CRC32 - Calculates the CRC-32C checksum of a parameter. Intrinsics: _mm_crc32_u8¸ _mm_crc32_u16, _mm_crc32_u32, _mm_crc32_u64
    • PCMPESTRI / PCMPESTRM - Compares two parameters of specified length. Intrinsics: _mm_cmpestra, _mm_cmpestrc, _mm_cmpestri, _mm_cmpestrm, _mm_cmpestro, _mm_cmpestrs, _mm_cmpestrz
    • PCMPGTQ - Compares two parameters. Intrinsics: _mm_cmpgt_epi64
    • PCMPISTRI / PCMPISTRM - Compares two parameters. Intrinsics: _mm_cmpistra, _mm_cmpistrc, _mm_cmpistri, _mm_cmpistrm, _mm_cmpistro, _mm_cmpistrs, _mm_cmpistrz
    • POPCNT - Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64
  • Advanced Bit Manipulation
    • LZCNT - Counts the number of zeroes at the start of a parameter. Intrinsics: __lzcnt16, __lzcnt, __lzcnt64
    • POPCNT - Counts the number of bits set to 1. Intrinsics: _mm_popcnt_u32, _mm_popcnt_u64, __popcnt16, __popcnt, __popcnt64
  • Other new intrinsics
    • _InterlockedCompareExchange128 - Compares two parameters.
    • _mm_castpd_ps / _mm_castpd_si128 / _mm_castps_pd / _mm_castps_si128 / _mm_castsi128_pd / _mm_castsi128_ps - Reinterprets between 32-bit floating point values (ps), 64-bit floating point values (pd), and 32-bit integers (si128).
    • _mm_cvtsd_f64 - Extracts the lowest 64-bit floating point value from the parameter.
    • _mm_cvtss_f32 - Extracts a 32-bit floating point value.
    • _rdtscp - Generates RDTSCP. Writes TSC AUX[31:0] to memory and returns the 64-bit Time Stamp Counter result.
  • PingBack from http://www.artofbam.com/wordpress/?p=10328

  • Intrinsics do have a lot of advantages over inline assembly, but I'd like to provide some counterarguments based on my experiences with intrinsics implemented in VC++. In the past, I've had to avoid intrinsics for the following reasons:

    * Incorrect code generation. VC7.1 had a bug with the global optimizer that made MMX intrinsics unusable, because it would move MMX instructions above x87 ones or past EMMS. VC8, even SP1, continues to suffer from a bug that causes it to sometimes misalign return value temporaries that require 16-byte alignment.

    * Poor code generation. VC6-VC7.1 often produced output that was more than 50% move instructions. VC8 is better, but the difference is still a concern when you consider that code that uses intrinsics is usually the most performance critical in the program. I've found that Intel C/C++ produces substantially tighter vector code, with a very low proportion of move instructions.

    * Poor cooperation with scalar code generation. My experience has been that attempts to mix float and __m128 leads to values being bounced to memory, even with /arch:SSE. Also, in image processing code that uses MMX/SSE I often need fractional stepping for addresses, which is hindered by awful VC++ code generation for __int64 math on x86 -- the worst case being that (x << 32) is sometimes converted to a multiply!

    * Incomplete intrinsics support. VC++ supports most or all vector instructions, but scalar instruction coverage is spotty. Until VC7, it wasn't possible to do an endian swap (BSWAP), bit search (BSF/BSR), or interlocked logic op (LOCK AND/OR/XOR). I still can't do a FISTP for a rounded conversion to int or a CDQ or SBB for a branchless test (both of which are required for fast floor/ceil to int). I can do an extended precision multiply, but not an extended precision add (ADC). I often need to use these operations in a critical inner loop, and not having them available means I have to write the entire loop in asm.

    I'm hoping that the compiler redesign in Orcas+N will allow for some major improvements in these areas.

    The silliest assembly I ever wrote was for an AMD64 port of a program which needed to compute Y = (A*B+C)/D, with all vars 64-bit and a 96-bit intermediate result for the dividend. Since half the instructions were not accessible by intrinsics and the compiler doesn't support inline assembly, I ended up with a single .asm file with a dinky five-instruction function in it.

  • "you should always verify that a given intrinsic is supported before you attempt to use it. Not doing so could result in a run-time error."

    Is there an option to get it emulated if an exception occurs?  For comparison, in some architectures including at least one big famous one and one formerly big famous one, hardware floating point used to be optional.  Programmers still wanted to write code using floating point, and get emulation automatically if the hardware wasn't present.

    > Incorrect code generation.

    Interlocked________

  • Does Orcas supports SSE3 or SSE4 code generation? like /arch:SSE4 ?

  • If possible, emulate intrinsics if they do not exist on the platform.  This gives a consistent programming interface across different platforms.

  • Thank you for your comments and questions. Here are some answers I was able to find out from our team.

    "Does Orcas support SSE3 or SSE4 code generation? like /arch:SSE4?"

    No it does not. Access to the SSE3 or SSE4 instructions is provided, but the compiler does not provide any mechanism to automatically take advantage of new instructions (as was done with previous /arch:SSEx switches).

    "Is there an option to get an intrinsic emulated if an exception occurs?"

    As far as I know, there is not. I suspect that this emulation is something that would have to be provided at the OS level (as I believe was done with floating point) and not at the level of the compiler toolset.

    I hope this helps.

  • "Is there an option to get an intrinsic emulated if an exception occurs?"

    I imagine it is possible to catch an invalid instruction exception and perform the operation internally. The performance cost would be huge however, since an exception would be thrown on most instructions in a tight loop, so you're probably best off to simply write both and test for support before starting the operation.

    Much of my code looks like this:

    | if(CPU::Features::supportsSSE2())

    | {

    |     __asm {

    |         // Optimised version

    |     }

    | }

    | else

    | {

    |     // FPU version (compiled)

    | }

    Since inline assembly is not supported on the 64-bit compiler, I would simply move the optimised version into a separate file and call the function.

    Dylan, is it possible to add compiler flags to emulate particular intrinsics? This way different builds can be made for different instruction sets from a single source. IMHO, this would be the best reason for using intrinsics.

  • Do you guys plan on optimizing the _mm_set_ps/_mm_set_ps1 intrinsics for the constant case, as both ICL and GCC compilers do?

    It's very convenient to define local __m128 constants using them. E.g. the following scalar code:

    | float x;

    | ...

    | x = x*1.5f;

    could have the following vector analogy:

    | __m128 x;

    | ...

    | x = _mm_mul_ps(x,_mm_set_ps1(1.5f));

    Or, if you have various factors for different channels:

    | x = _mm_mul_ps(x,

    |   _mm_set_ps(1.5f,2.5f,3.1f,4.0f));

    Unfortunately until now it's not possible with Visual C++, as instead of automatic generation of a constant __m128 value, the compiler performs a full-blown conversion.

  • I am thinking of operations that were in the IA32 architecture from its very inception: add/subtract/shift with carry, rotates of all native integral types (byte/short/long, 64 bit if 64 bit arch), efficient multiplies char*char to short, short*short to long, long*long to __int64 and the reverse divides, __int64 as an x87 floating point type (speed advantage/disadvantage depends on CPU and if driver code is being built).

    For some of the above operations it would be even better if these were not intrinsics but automatic peephole optimizations, so a programmer could simply write things like:

    long x,y,w;

    ...

    x = (x << y) + (x >> (32 - y));

    __int64 z = w * (__int64)x;

    and the compiler would recognize this as a 32 bit left rotate without carry and a 32x32 to 64 IMUL instruction.

    Another common case to peephole optimize is extracting the bytes/words/dwords from a larger int type using portable shift and bitmask operations, like the HIWORD() and LOWORD() macros in windows.h do.

  • Vadim, don't let Sync-modular die, do something for it, cause it's the best !!!

    Thanks

  • I have some questions about business cases and inline assembly.

    Does microsoft want people to write fast codecs and game engines?  If so, do you expect VC++ to be part of the tool set?

    Without automatic vectorization support and increasing support for inline assembly, I beleive one day we may have to use Intel' or AMD's compilers, and VC++ will no longer be the possible for doing optimization.

    You can argue that this is a small segment of the overall dev picture, but the number of users for codecs and game engines is very large.

  • Inline assembly is a must in some cases.

    Please consider bringing the feature back that  you took away that was so valuable.

    In the meantime, I may be looking at Intel's compiler for inline asm support.

  • I tried to use intrinsics in VS2005 and here are the results (time taken):

    code in c++        100% (compiled without sse)

    hand optimized asm 80%  (with sse)

    intrinsics         160%

    So with microsoft compiler it is a complete waste of time...

  • >Inline asm is not supported by Visual C++ on 64-bit machines.

    Just because someone was lazy to support it. It is certainly doable.

    >The intrinsics are updated in the compiler.

    This can (especially in Microsoft's case!) mean that our code speed can get worse over the time for no aparent reason.

    >The optimizer does not work well with inline asm code

    Why not fix the damn optimizer then, instead of making us use stupid intrinsics?

    >Intrinsics are also more portable over code that uses inline asm.

    Yeah, right... what to do if some of them are supported in 64-bit mode (__inbyte, __readmsr) but not in 32-bit? We still have to resort to ugly #ifdef's in that case. You call that portable?!?

    As others have said you have lost your edge on development tools. Soon people will use only IDE part (or perhaps not even that part) from your toolchain.

    NEW COMPILER HAS TO TAKE ADVANTAGE OF NEW HARDWARE FEATURES. IT IS ALMOST 2008 FOR GOD SAKE, WAKE UP!!!

    Or alternatively, name your product adequately Visual Studio 1998 because that's where your compiler has been stuck performance wise.

    Nice to hear that it optimizes for Barcelona by default meaning you and AMD will go down together.

  • Are all the previous (VS2005) intrinsics still supported?  Are they supported for 64-bit targets?

    http://msdn2.microsoft.com/en-us/library/hd9bdb82(VS.80).aspx

Page 1 of 2 (16 items) 12