The JIT finally proposed. JIT and SIMD are getting married.

The JIT finally proposed. JIT and SIMD are getting married.

Rate This
  • Comments 44

Processor speed no longer follows Moore’s law. So in order to optimize the performance of your applications, it’s increasingly important to embrace parallelization. Or, as Herb Sutter phrased it, the free lunch is over.

You may think that task-based programming or offloading work to threads is already the answer. While multi-threading is certainly a critical part, it’s important to realize that it’s still important to optimize the code that runs on each core. SIMD is a technology that employs data parallelization at the CPU level. Multi-threading and SIMD complement each other: multi-threading allows parallelizing work over multiple cores while SIMD allows parallelizing work within a single core.

Today we’re happy to announce a new preview of RyuJIT that provides SIMD functionality. The SIMD APIs are exposed via a new NuGet package, Microsoft.Bcl.Simd, which is also released as a preview.

Here is an example on how you would use it:

// Initalize some vectors

Vector<float> values = GetValues();
Vector<float> increment = GetIncrement();

// The next line will leverage SIMD to perform the
// addition of multiple elements in parallel:

Vector<float> result = values + increment;

What’s SIMD and why do I care?

SIMD is by far the most popular code gen request and still a fairly popular request overall (~2,000 votes on user voice):

It’s so popular because, for certain kinds of apps, SIMD offers a profound speed-up. For example, the performance of rendering Mandelbrot can be improved a lot by using SIMD: it improves by a factor of 2-3 (using SSE2-capable hardware) up to a factor of 4-5 (using AVX-capable hardware).

Introduction to SIMD

SIMD stands for “single instruction, multiple data”. It’s a set of processor instructions that operate over vectors instead of scalars. This allows mathematical operations to execute over a set of values in parallel.

At a high-level, SIMD enables data parallelization at the CPU level. For example, imagine you need to increment a set of floating point numbers by a given value. Normally, you’d write a for loop to perform this operation sequentially:

float[] values = GetValues();
float increment = GetIncrement();

// Perform increment operation as manual loop:
for (int i = 0; i < values.Length; i++)
{
    values[i] += increment;
}

SIMD allows adding multiple values simultaneously by using CPU specific instructions. This is often exposed as a vector operation:

Vector<float> values = GetValues();
Vector<float> increment = GetIncrement();

// Perform addition as a vector operation:
Vector<float> result = values + increment;

It’s interesting to note that there isn’t a single SIMD specification. Rather, each processor has a specific implementation of SIMD. They differ by the number of elements that can be operated on as well as by the set of available operations. The most commonly available implementation of SIMD on Intel/AMD hardware is SSE2.

Here is a simplified model of how SIMD is exposed at the CPU level:

  1. There are SIMD-specific CPU registers. They have a fixed size. For SSE2, the size is 128 bit.

  2. The processor has SIMD-specific instructions, specific to the operand size. As far as the processor is concerned, a SIMD value is just a bunch of bits. However, a developer wants to treat those bits as a vector of, say, 32-bit integer values. For this purpose, the processor has instructions that are specific to the operation, e.g. addition, and the operand type, e.g. 32-bit integers.

An area where SIMD operations are very useful is graphics and gaming as:

  • These apps are very computation-intensive.
  • Most of the data structures are already represented as vectors.

However, SIMD is applicable to any application type that performs numerical operations on a large set of values; this also includes scientific computing and finance.

Designing SIMD for .NET

Most .NET developers don’t have to write CPU-specific code. Instead, the CLR abstracts the hardware by providing a virtual machine that translates your code into machine instructions, either at runtime (JIT) or at install time (NGEN). By leaving the code generation to the CLR, you can share the same MSIL between machines with different processors without having to give up on CPU-specific optimizations.

This separation is what enables a library ecosystem because it tremendously simplifies code sharing. We believe the library ecosystem is a key part of why .NET is such a productive environment.

In order to keep this separation, we needed to come up with a programming model for SIMD that allows you to express vector operations without tying you to a specific processor implementation, such as SSE2. We came up with a model that provides two categories of vector types:

Both categories of types are what we call JIT intrinsics. That means the JIT knows about these types and treats them specially when emitting machine code. However, all types are also designed to work perfectly in cases where the hardware doesn’t support SIMD (unlikely today) or the application doesn’t use this new version of RyuJIT.

Our goal is to ensure the performance in those cases is roughly identical to sequentially written code. Unfortunately, in this preview we aren’t there yet.

Vectors with a fixed size

Let’s talk about the fixed-size vectors first. There are many apps that already define their own vector types, especially graphic intense applications, such as games or a ray tracer. In most cases these apps use single-precision floating point values.

The key aspect is that those vectors have a specific number of elements, usually two, three or four. The two-element vectors are often used to represent points or similar entities, such as complex numbers. Vectors with three and four elements are typically used for 3D (the 4th element is used to make the math work). The bottom line is that those domains require vectors with a specific number of elements.

To get a sense how these types look, look at the simplified shape of Vector3f:

public struct Vector3f
{
    public Vector3f(float value);
    public Vector3f(float x, float y, float z);
    public float X { get; }
    public float Y { get; }
    public float Z { get; }
    public static bool operator ==(Vector3f left, Vector3f right);
    public static bool operator !=(Vector3f left, Vector3f right);
    // With SIMD, these element wise operations are done in parallel:
    public static Vector3f operator +(Vector3f left, Vector3f right);
    public static Vector3f operator -(Vector3f left, Vector3f right);
    public static Vector3f operator -(Vector3f value);
    public static Vector3f operator *(Vector3f left, Vector3f right);
    public static Vector3f operator *(Vector3f left, float right);
    public static Vector3f operator *(float left, Vector3f right);
    public static Vector3f operator /(Vector3f left, Vector3f right);
}

I’d like to highlight the following aspects:

  • We’ve designed the fixed size vectors so that they can easily replace the ones defined in apps.
  • For performance reasons, we’ve defined those types as immutable value types.
  • The idea is that after replacing your vector with our vector, your app behaves the same, except that it runs faster. For more details, have a look at our Ray Tracer sample application.

Vectors with a hardware dependent size

While the fixed size vector types are convenient to use, their maximum degree of parallelization is limited by the number of components. For example, an application that uses Vector2f can get a speed-up of at most a factor of two – even if the hardware would be capable of performing operations on eight elements at a time.

In order for an application to scale with the hardware capabilities, the developer has to vectorize the algorithm. Vectorizing an algorithm means that the developer needs to break the input into a set of vectors whose size is hardware-dependent. On a machine with SSE2, this means the app could operate over vectors of four 32-bit floating point values. On a machine with AVX, the same app could operate over vectors with eight 32-bit floating point values.

To get a sense for the differences, here is a simplified version of the shape of Vector<T>:

public struct Vector<T> where T : struct {
    public Vector(T value);
    public Vector(T[] values);
    public Vector(T[] values, int index);
    public static int Length { get; }
    public T this[int index] { get; }
    public static bool operator ==(Vector<T> left, Vector<T> right);
    public static bool operator !=(Vector<T> left, Vector<T> right);
    // With SIMD, these element wise operations are done in parallel:
    public static Vector<T> operator +(Vector<T> left, Vector<T> right);
    public static Vector<T> operator &(Vector<T> left, Vector<T> right);
    public static Vector<T> operator |(Vector<T> left, Vector<T> right);
    public static Vector<T> operator /(Vector<T> left, Vector<T> right);
    public static Vector<T> operator ^(Vector<T> left, Vector<T> right);
    public static Vector<T> operator *(Vector<T> left, Vector<T> right);
    public static Vector<T> operator *(Vector<T> left, T right);
    public static Vector<T> operator *(T left, Vector<T> right);
    public static Vector<T> operator ~(Vector<T> value);
    public static Vector<T> operator -(Vector<T> left, Vector<T> right);
    public static Vector<T> operator -(Vector<T> value);
}

Key aspects of this type include the following:

  • It’s generic. To increase flexibility and avoid combinatorial explosion of types, we’ve defined the hardware-dependent vector as a generic type, Vector<T>. For practical reasons, T can only be a primitive numeric type. In this preview, we only support int, long, float and double. The final version will also include support for all other integral numeric types, including their unsigned counterparts.

  • The length is static. Since the length is hardware -dependent but fixed, the length is exposed via a static Length property. The value is defined as sizeof(SIMD-register) / sizeof(T). In other words, two vectors Vector<T1> and Vector<T2> have the same length if T1 and T2 have the same size. This allows us to correlate elements in vectors of different element types, which is a very useful property in vectorized code.

Vectorization is a complicated topic and, as such, it’s well beyond the scope of this blog post. Nonetheless, let me give you a high-level overview of what this would mean for a specific app. Let’s look at a Mandelbrot renderer. Conceptually, Mandelbrot works over complex numbers which can be represented as vectors with two elements. Based on a mathematical algorithm, these complex numbers are color-coded and rendered as a single point in the resulting picture.

In the naïve usage of SIMD, one would vectorize the algorithm by representing the complex numbers as a Vector2f. A more sophisticated algorithm would vectorize over the points to render (which is unbounded) instead of the dimension (which is fixed). One way to do it is to present the real and imaginary components as vectors. In other words, one would vectorize the same component over multiple points.

For more details, have a look at our Mandelbrot sample. In particular, compare the scalar version to the vectorized version.

Using the SIMD preview

In this preview we provide the following two pieces:

  1. A new release of RyuJIT that provides SIMD support
  2. A new NuGet library that exposes the SIMD support

The NuGet library was explicitly designed to work without SIMD support provided by the hardware/JIT. In that case, all methods and operations are implemented as pure IL. However, you obviously only get the best performance when using this library in conjunction with the new release of RyuJIT.

In order to use SIMD, you need to perform the following steps:

  1. Download and install the latest preview of RyuJIT from http://aka.ms/RyuJIT

  2. Set some environment variables to enable the new JIT and SIMD for your process. The easiest way to do this is by creating a batch file that starts your application:

    @echo off
    set COMPLUS_AltJit=*
    set COMPLUS_FeatureSIMD=1
    start myapp.exe
  3. Add a reference to the Microsoft.Bcl.Simd NuGet package. You can do this by right clicking your project and selecting Manage NuGet References. In the following dialog make sure you select the tab named Online. You also need to select Include Prelease in drop down at the top. Then use the textbox in the top right corner to search for Microsoft.Bcl.Simd. Click Install.

Since this a preview, there are certain limitations you may want to be aware of:

  • SIMD is only enabled for 64-bit processes. So make sure your app either is targeting x64 directly or is compiled as Any CPU and not marked as 32-bit preferred.

  • The Vector type only supports int, long, float and double. Instantiating Vector<T> with any other type will cause a type load exception.

  • SIMD is only taking advantage of SSE2 hardware, currently. Due to some implementation restrictions, the RyuJIT CTP cannot automatically switch the size of the type based on local hardware capability. Full AVX support should arrive with a release of the .NET runtime that includes RyuJIT. We have prototyped this, but it just doesn’t work with our CTP deployment model.

Summary

We’ve released a preview that brings the power of SIMD to the world of managed code. The programming model is exposed via a set of vector types, made available via the new Microsoft.Bcl.Simd NuGet package. The processor support for the operations is provided with the new preview of RyuJIT.

We’d love to get your feedback on both pieces. How do you like the programming model? What’s missing? How does the performance look like for your app? Please use the comments for providing feedback or send us mail at ryujit(at)microsoft.com.

Leave a Comment
  • Please add 2 and 4 and type the answer here:
  • Post
  • What about ARM NEON support?  Will this support it?

  • Presumably the from-array constructor does a by-value copy? Is there a way to perform SIMD operations in-place on a float []?  I'm thinking about legacy systems where the API signature is to accept an array.

  • Wonderful, it is a big jump in the right direction....

    There is one more feature not mentioned above. This package may becomes a base used by all packages that use victors which makes sharing code a lot easier.

    But what about other more complicated vector manipulation elements like matrices...?

    SIMD powered matrix multiplication will be a big boost for games and 3D applications...

  • Awesome stuff. One little request, can we also have float overloads added to the Math class with this?

  • It would be great if I could use byte and short type, because operations that I need are executed on 8-bit array that is unpacked (also missing) to 16-bit array.

    Therefore:

     support for byte, short,

     support for SIMD instructions (e.g. unpack...)

  • Awesome! I have been waiting for this a long time. However, I am somewhat concerned with the Vector<T> type and the initialization of this, which requires a managed array as input if multiple values are used.

    We are doing high-performance image processing and usually have data as unsafe memory, having to allocate an array for every time (or outside loop) a Vector<T> is to be used seems incredible wasteful and not the least slow (requires extra copy etc.). I would love to see a unsafe ctor such as:

      Vector(T* data)

    I know this is not possible in the current .NET version due to generic pointer, but getting pointer support for generic value types would greatly simplify a lot of our other code, thus yielding a lot less and a lot faster code if both these issues would be addressed.

    Otherwise, it would be great if other patterns not requiring a managed array for using variable length vector operations could be added.

    Thanks.

  • I guess the public Vector(T[] values) constructor should use the params keyword for values.

  • Never mind, bad idea.

  • Will the new JIT auto-vectorize loops like the VC++ compiler?

    Is this planned for future versions?

  • Finally; THANK YOU!

    This is a huge. For some, its even bigger than new year!

    Would be great to see SIMD (SSE2 & AVX) native support in v4.5.1+...

  • What about static methods for Add/Subtract/Multiply/Divide that take ref parameters? E.g. Add(ref Vector3f left, ref Vector3f right, out Vector3f). Usually you see that in addition to the operators (a la XNA or SharpDX) to avoid copying parameters.

  • Good to see .NET catchup up with Mono.

  • @Nicholas Woodfield, agree with you. But SIMD is optimizing it nicely. The only grip is that if SIMD is not accessible and code is using the default IL bytecode and the JIT is not inlining them (as It did not optimize them in the past), that would indeed hurt perfs without by ref parameters.

    If SIMD in .NET is getting mainstream, available on all platforms (x86/x64/ARM, W7/W8/WP8) we will definitely consider it in SharpDX.

  • @zezba9000: This preview only supports x64 with SSE2. However, NEON is certainly on our radar.

    @Jay: Yes, the array ctors will copy by value. We don't have a direct way of operating on arrays but we've methods that allow you operate on chunks:

       public struct Vector<T>
       {
           public Vector(T[] values);
           public Vector(T[] values, int index);
           public void CopyTo(T[] array);
           public void CopyTo(T[] array, int index);
           // ...
       }

    I'm not even sure whether operating on a raw array is possible; I think the processor requires a load into the register anyways.

    @Fadi: Glad you like it! You're spot on -- we indeed hope that the types in Microsoft.Bcl.Simd will become the exchange currency for vector types. Yes, we're also thinking of adding support for matrices as well.

    @John Katz: We'd love to but we try to avoid building features that require updates to the existing framework as this complicates the story for developers wanting to take a dependency on our feature. The NuGet package provides a System.Numerics.VectorMath type that provides methods like Abs(), Min() and SquareRoot().

    @Darko Jurić: Vector<T> will eventually support all integral types, including byte and short. And yes: we don't consider this preview API-complete. We'll add additional SIMD primitives.

    @Harry: An unsafe constructor that takes a pointer is certainly on our list. To avoid the generic pointer issue we'd probably declare it like this:

       public struct Vector<T>
       {
           public unsafe Vector(void* values);
       }

    Not sure whether this design is feasible though.

    @Azarien: The reason we don't provide params[] based version was to discourage patterns that can easily result in bugs when your program runs on hardware with a smaller/larger SIMD register. In other words, we want to encourage size-independent programming. Judging from your last comment, you've probably arrived at a similar conclusion :-)

    @Nick: Ah, good one! Auto vectorization isn't the kind of optimization we're currently planning for RyuJIT. The reason being that this kind of optimization isn't a natural fit for the JIT as the JIT has a constraint on compilation time. However, we aren't ruling it out either.

    We do, however, very much think about auto vectorization for the new static compilation pipeline, called .NET Native. We've demoed a very early preview of auto vectorization support at //build:

    @Adeel: You're welcome!

    @Nicholas Woodfield: As xoofx mentioned: we don't have ref/out versions because the types are JIT intrinsics. In many cases the JIT will be able to inline the methods and and simply put the values directly in SIMD register without ever having to copy the values of the stack.

    @xoofx: We should certainly talk more :-) I'd love to see your take on what we have so far.

  • @Immo Thank you for your reply.

    I agree that methods with the new SIMD types in the arguments should not be added to the Math class. The VectorMath class is a nice addiction but doesn't sort out all issues.

    My request was about adding overloads to the scalar methods in the Math class. This could be done independently of the SIMD work you guys are doing.

    The benefits would be to allow us to keep the precision and performance we desire in our code which falls inline with the work around the new JIT and SIMD.

    Take for example the method Util.Magnitude in your RayTracer sample:

       public static float Magnitude(this Vector3f v)

       {

           return (float)Math.Abs(Math.Sqrt(VectorMath.DotProduct(v,v)));

       }

    In that method the result of VectorMath.DotProduct is a float that is implicitly casted to double so it can be passed to Math.Sqrt. And then the end result needs to be cast down to float.

Page 1 of 3 (44 items) 123