The SSE3 instruction set adds about a dozen instructions (intrinsics are in the pimmintrin.h header). The main operation these instructions provide is the ability to do “horizontal” adds and subtracts (ARM-NEON refers to these as ‘pairwise’ operations) for float4 and double2 data.

 Result = _mm_hadd_ps(V1,V2);
->
Result[0] = V1[0] + V1[1];
Result[1] = V1[2] + V1[3];
Result[2] = V2[0] + V2[1];
Result[3] = V2[2] + V2[3];

There are variants that use different signs for the two values, but otherwise they are basically the same.

The majority of the DirectXMath library is designed around avoiding the needing for these operations, but they are useful for dot-product operations (VMX128 on the Xbox 360 had a specific instruction for doing dot-products across a vector, but not a general pairwise add).

The existing SSE/SSE2 dot-product for float4:

 inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
XMVECTOR vTemp2 = V2;
XMVECTOR vTemp = _mm_mul_ps(V1,vTemp2);
vTemp2 = _mm_shuffle_ps(vTemp2,vTemp,_MM_SHUFFLE(1,0,0,0));
vTemp2 = _mm_add_ps(vTemp2,vTemp);
vTemp = _mm_shuffle_ps(vTemp,vTemp2,_MM_SHUFFLE(0,3,0,0));
vTemp = _mm_add_ps(vTemp,vTemp2);
return XM_PERMUTE_PS(vTemp,_MM_SHUFFLE(2,2,2,2));
}

can be rewritten using SSE3 as:

 inline XMVECTOR XMVector4Dot(FXMVECTOR V1, FXMVECTOR V2)
{
XMVECTOR vTemp = _mm_mul_ps(V1,V2);
vTemp = _mm_hadd_ps( vTemp, vTemp );
return _mm_hadd_ps( vTemp, vTemp );
}

This version has the same number of multiply/add operations, but there are three fewer shuffles required. As we’ll see in a future installment, there are actually some better options than this in SSE 4.1.

There are also two new instructions which can be used as a special-case substitute for the XMVectorSwizzle<> template. We’ll make use of these in a future installment.

XMVectorSwizzle<0,0,2,2>(V) _mm_moveldup_ps(V)
XMVectorSwizzle<1,1,3,3>(V) _mm_movehdup_ps(V)

The Supplemental SSE3 (SSSE3) instruction set adds the equivalent “horizontal” adds and subtracts for various integer vectors, so they are not particularly useful for DirectXMath. These intrinsics are located in the tmmintrin.h header. There are also some other useful integer operations that make life simpler for implementing algorithms like Fast Block Compress, codecs, or other image processing on integer data which are a bit out of scope for DirectXMath.

There is one SSSE3 intrinsic of interest for DirectXMath: _mm_shuffle_epi8. The purpose of this instruction is to be able to rearrange the bytes in a vector, which makes it an excellent function for doing vector-based Big-Endian/Little-Endian swaps without having to ‘spill’ the vector to memory and reload it.

inline XMVECTOR XMVectorEndian( FXMVECTOR V )
{
static const XMVECTORU32 idx = { 0x00010203, 0x04050607, 0x08090A0B, 0x0C0D0E0F };
__m128i Result = _mm_shuffle_epi8( _mm_castps_si128(V), idx );
return _mm_castsi128_ps( Result );
}

There’s not enough use for this kind of operation to make this function part of the library (Windows x86, Windows x64, and Windows RT are all Little-Endian platforms), but it can be useful for some cross-platform tools processing (Xbox 360 is Big-Endian).

Processor Support

SSE3 is supported by Intel Pentium 4 processors (“Prescott”), AMD Athlon 64 (“revision E”), AMD Phenom, and later processors. This means most, but not quite all, x64 capable CPUs should support SSE3.

Supplemental SSE3 (SSSE3) is supported by Intel Core 2 Duo, Intel Core i7/i5/i3, Intel Atom, AMD Bulldozer, and later processors.

 int CPUInfo[4] = {-1};
__cpuid( CPUInfo, 0 );
bool bSSE3 = false;
bool bSSSE3 = false;
if ( CPUInfo[0] > 0 )
{
__cpuid(CPUInfo, 1 );
bSSE3 = (CPUInfo[2] & 0x1) != 0;
bSSSE3 = (CPUInfo[2] & 0x200) != 0;
}

You can also use the IsProcessorFeaturePresent Win32 API with PF_SSE3_INSTRUCTIONS_AVAILABLE on Windows Vista or later to detect SSE3 support. This API does not report support for SSSE3.

Utility Code

The source code attached to this blog post is bound to the Microsoft Public License (MS-PL).

See Also: SSE, SSE2, and ARM-NEON; SSE4.1 and SSE4.2; AVX; F16C an FMA