The best of the two worlds: How to use the raw power of C++ to improve performance of your Windows 8/8.1 HTML5/CSS3/JavaScript application - Eternal Coding - HTML5 / JavaScript / 3D development - Site Home - MSDN Blogs

The best of the two worlds: How to use the raw power of C++ to improve performance of your Windows 8/8.1 HTML5/CSS3/JavaScript application


 

The best of the two worlds: How to use the raw power of C++ to improve performance of your Windows 8/8.1 HTML5/CSS3/JavaScript application

  • Comments 1

The wonderful thing about developing apps for Windows 8/8.1 with HTML5/CSS3/JavaScript is that, thanks to WinRT projection, you can also reference and use additional libraries written using others languages like C#, VB.NET or even C++.

And when it comes to speak about performance, C++ is the golden language.

During this article, I will show you how you can reference a C++ project in order to go down to the metal. The initial idea came first when I was trying to optimize some parts of code of babylon.js. Among others things, I discovered that the multiplication of two matrices was a bit too expensive. And I dreamt of a way to use some extra power like SIMD instructions provide by modern CPU (SIMD stands for Single Instruction Multiple Data and it is a way to process multiple elements that perform the same operation on multiple data points simultaneously, like for instance a multiplication of matrices).

  1. How to represent a matrix with JavaScript?
  2. Initial version
  3. Testing performance?
  4. Let’s optimize things a bit
  5. Going big with C++
  6. Removing projection overhead
  7. Using DirectX Math
  8. Conclusion

 

You want to discuss about this article: reach me on Twitter: @deltakosh

How to represent a matrix with JavaScript?

A matrix is nothing more than an array of 16 floating values. You can then decide if data are stored using a row first or column first approach. In my case I decided to use a row first approach. It means that values are stored like this:

m[0] = first row / first column
m[1] = first row / second column
….
m[4] = second row / first column
etc…

So here is the code used:

var JSPerfTest = JSPerfTest || {};

if (!JSPerfTest.MatrixType) {
    MatrixType = (typeof Float32Array !== 'undefined') ? Float32Array : Array;
}

JSPerfTest.Matrix = function () {
    this.m = new MatrixType(16);

    for (var index = 0; index < 16; index++) {
        this.m[index] = Math.random();
    }
};

A good optimization here is to use a Float32Array to specify the type of values (this is really helpful for the JIT compiler).

For the purpose of our upcoming comparison, I filled the matrix with random values.

Initial version

The first version used a standard (someone said naïve?) way to do multiplication of matrices (like stated in Wikipedia):

JSPerfTest.Matrix.prototype.multiply = function (other, result) {
    result.m[0] = this.m[0] * other.m[0] + this.m[1] * other.m[4] + this.m[2] * other.m[8] + this.m[3] * other.m[12];
    result.m[1] = this.m[0] * other.m[1] + this.m[1] * other.m[5] + this.m[2] * other.m[9] + this.m[3] * other.m[13];
    result.m[2] = this.m[0] * other.m[2] + this.m[1] * other.m[6] + this.m[2] * other.m[10] + this.m[3] * other.m[14];
    result.m[3] = this.m[0] * other.m[3] + this.m[1] * other.m[7] + this.m[2] * other.m[11] + this.m[3] * other.m[15];

    result.m[4] = this.m[4] * other.m[0] + this.m[5] * other.m[4] + this.m[6] * other.m[8] + this.m[7] * other.m[12];
    result.m[5] = this.m[4] * other.m[1] + this.m[5] * other.m[5] + this.m[6] * other.m[9] + this.m[7] * other.m[13];
    result.m[6] = this.m[4] * other.m[2] + this.m[5] * other.m[6] + this.m[6] * other.m[10] + this.m[7] * other.m[14];
    result.m[7] = this.m[4] * other.m[3] + this.m[5] * other.m[7] + this.m[6] * other.m[11] + this.m[7] * other.m[15];

    result.m[8] = this.m[8] * other.m[0] + this.m[9] * other.m[4] + this.m[10] * other.m[8] + this.m[11] * other.m[12];
    result.m[9] = this.m[8] * other.m[1] + this.m[9] * other.m[5] + this.m[10] * other.m[9] + this.m[11] * other.m[13];
    result.m[10] = this.m[8] * other.m[2] + this.m[9] * other.m[6] + this.m[10] * other.m[10] + this.m[11] * other.m[14];
    result.m[11] = this.m[8] * other.m[3] + this.m[9] * other.m[7] + this.m[10] * other.m[11] + this.m[11] * other.m[15];

    result.m[12] = this.m[12] * other.m[0] + this.m[13] * other.m[4] + this.m[14] * other.m[8] + this.m[15] * other.m[12];
    result.m[13] = this.m[12] * other.m[1] + this.m[13] * other.m[5] + this.m[14] * other.m[9] + this.m[15] * other.m[13];
    result.m[14] = this.m[12] * other.m[2] + this.m[13] * other.m[6] + this.m[14] * other.m[10] + this.m[15] * other.m[14];
    result.m[15] = this.m[12] * other.m[3] + this.m[13] * other.m[7] + this.m[14] * other.m[11] + this.m[15] * other.m[15];
};

The multiply function takes two parameter:

  • The second matrix to multiply to
  • The result matrix to get rid off creating a new matrix on every call.

Not creating a new matrix on very call is really important because creating a matrix means allocating 16 floats and if you do that a lot of time, the pressure on the garbage collector will become huge and the overall performance can be greatly reduced. If you want more insights about this point, you can read my another article on this subject.

Testing performance

To evaluate further versions, I created a small Windows 8 application which will launch a multiplication between matrices (1 million) a certain number of times (20) and will then measure the average length of the operation.

var resultText = document.getElementById("result");
var cyclesCount = 20;
var repeatCount = 1000000;

resultText.innerHTML = "";

// JS
var m1 = new JSPerfTest.Matrix();
var m2 = new JSPerfTest.Matrix();
var result = new JSPerfTest.Matrix();

var sum = 0;
for (var index = 0; index < cyclesCount; index++) {
    var duration = doPureJSTest(m1, m2, result, repeatCount);
    resultText.innerHTML += "Operation lasted " + duration + "ms using pure Javascript<br>";
    sum += duration;
}

resultText.innerHTML += "Average duration: " + (sum / cyclesCount) + "ms using pure Javascript<br><br>";

The doPureJSTest function is just a call to multiply function:

var doPureJSTest = function (m1, m2, result, repeatCount) {
    var startDate = new Date().getTime();
    for (var index = 0; index < repeatCount; index++) {
        m1.multiply(m2, result);
    }
    var endDate = new Date().getTime();

    return (endDate - startDate);
};
On my Lenovo X1 Carbon (Core I7), I have the following result (time is in milliseconds):

image

The average duration of the multiplication of 1.000.000 matrices is about 130.25 ms.

Let’s optimize things a bit

Before trying to use C++, I wanted to try to go as far as I could with JavaScript. That is why I produced this second version with a new optimization: I reduced to the minimum the array reads. The code is a bit uglier but there is only one array read per cell:

JSPerfTest.Matrix.prototype.multiplyOneAccess = function (other, result) {
    var tm0 = this.m[0];
    var tm1 = this.m[1];
    var tm2 = this.m[2];
    var tm3 = this.m[3];
    var tm4 = this.m[4];
    var tm5 = this.m[5];
    var tm6 = this.m[6];
    var tm7 = this.m[7];
    var tm8 = this.m[8];
    var tm9 = this.m[9];
    var tm10 = this.m[10];
    var tm11 = this.m[11];
    var tm12 = this.m[12];
    var tm13 = this.m[13];
    var tm14 = this.m[14];
    var tm15 = this.m[15];

    var om0 = other.m[0];
    var om1 = other.m[1];
    var om2 = other.m[2];
    var om3 = other.m[3];
    var om4 = other.m[4];
    var om5 = other.m[5];
    var om6 = other.m[6];
    var om7 = other.m[7];
    var om8 = other.m[8];
    var om9 = other.m[9];
    var om10 = other.m[10];
    var om11 = other.m[11];
    var om12 = other.m[12];
    var om13 = other.m[13];
    var om14 = other.m[14];
    var om15 = other.m[15];

    result.m[0] = tm0 * om0 + tm1 * om4 + tm2 * om8 + tm3 * om12;
    result.m[1] = tm0 * om1 + tm1 * om5 + tm2 * om9 + tm3 * om13;
    result.m[2] = tm0 * om2 + tm1 * om6 + tm2 * om10 + tm3 * om14;
    result.m[3] = tm0 * om3 + tm1 * om7 + tm2 * om11 + tm3 * om15;

    result.m[4] = tm4 * om0 + tm5 * om4 + tm6 * om8 + tm7 * om12;
    result.m[5] = tm4 * om1 + tm5 * om5 + tm6 * om9 + tm7 * om13;
    result.m[6] = tm4 * om2 + tm5 * om6 + tm6 * om10 + tm7 * om14;
    result.m[7] = tm4 * om3 + tm5 * om7 + tm6 * om11 + tm7 * om15;

    result.m[8] = tm8 * om0 + tm9 * om4 + tm10 * om8 + tm11 * om12;
    result.m[9] = tm8 * om1 + tm9 * om5 + tm10 * om9 + tm11 * om13;
    result.m[10] = tm8 * om2 + tm9 * om6 + tm10 * om10 + tm11 * om14;
    result.m[11] = tm8 * om3 + tm9 * om7 + tm10 * om11 + tm11 * om15;

    result.m[12] = tm12 * om0 + tm13 * om4 + tm14 * om8 + tm15 * om12;
    result.m[13] = tm12 * om1 + tm13 * om5 + tm14 * om9 + tm15 * om13;
    result.m[14] = tm12 * om2 + tm13 * om6 + tm14 * om10 + tm15 * om14;
    result.m[15] = tm12 * om3 + tm13 * om7 + tm14 * om11 + tm15 * om15;
};

By doing this we are able to get the following response times:

image

The average duration of the multiplication of 1.000.000 matrices is now about 69.3ms down from 130.25ms which represents a decrease of 53 %! Not so bad for a so simple optimization.

Going big with C++

To go even farther, we will have to use a new tool: C++. Indeed, by using first grade super powerful compilers and by being closer to the metal, C++ will be able to give us even more power.

To use C++ code from our JavaScript code, I created a new project: a WinRT component:

image

This WinRT component can be used and call from our JavaScript code: This is clearly the magic of WinRT.

Here is the code I used to define the header of my component:

#pragma once

namespace CPPPerfTest
{
    public ref class Matrix sealed
    {
    public:
        Matrix();

        void Multiply(Matrix^ other, Matrix^ result);


    private:
        float m[16];
    };
}

Almost the same thing as JavaScript. The associated implementation also hides no secret:

#include "pch.h"
#include "Matrix.h"

using namespace CPPPerfTest;
using namespace Platform;

Matrix::Matrix()
{
    for (int index = 0; index < 16; index++) {
        m[index] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    }
}

void Matrix::Multiply(Matrix^ other, Matrix^ result)
{
    result->m[0] = m[0] * other->m[0] + m[1] * other->m[4] + m[2] * other->m[8] + m[3] * other->m[12];
    result->m[1] = m[0] * other->m[1] + m[1] * other->m[5] + m[2] * other->m[9] + m[3] * other->m[13];
    result->m[2] = m[0] * other->m[2] + m[1] * other->m[6] + m[2] * other->m[10] + m[3] * other->m[14];
    result->m[3] = m[0] * other->m[3] + m[1] * other->m[7] + m[2] * other->m[11] + m[3] * other->m[15];
          
    result->m[4] = m[4] * other->m[0] + m[5] * other->m[4] + m[6] * other->m[8] + m[7] * other->m[12];
    result->m[5] = m[4] * other->m[1] + m[5] * other->m[5] + m[6] * other->m[9] + m[7] * other->m[13];
    result->m[6] = m[4] * other->m[2] + m[5] * other->m[6] + m[6] * other->m[10] + m[7] * other->m[14];
    result->m[7] = m[4] * other->m[3] + m[5] * other->m[7] + m[6] * other->m[11] + m[7] * other->m[15];
          
    result->m[8] = m[8] * other->m[0] + m[9] * other->m[4] + m[10] * other->m[8] + m[11] * other->m[12];
    result->m[9] = m[8] * other->m[1] + m[9] * other->m[5] + m[10] * other->m[9] + m[11] * other->m[13];
    result->m[10] = m[8] * other->m[2] + m[9] * other->m[6] + m[10] * other->m[10] + m[11] * other->m[14];
    result->m[11] = m[8] * other->m[3] + m[9] * other->m[7] + m[10] * other->m[11] + m[11] * other->m[15];
          
    result->m[12] = m[12] * other->m[0] + m[13] * other->m[4] + m[14] * other->m[8] + m[15] * other->m[12];
    result->m[13] = m[12] * other->m[1] + m[13] * other->m[5] + m[14] * other->m[9] + m[15] * other->m[13];
    result->m[14] = m[12] * other->m[2] + m[13] * other->m[6] + m[14] * other->m[10] + m[15] * other->m[14];
    result->m[15] = m[12] * other->m[3] + m[13] * other->m[7] + m[14] * other->m[11] + m[15] * other->m[15];
}

Once again the code looks like JavaScript,

To use it from our JavaScript application, just add a reference to the project and call this code:

// Pure CPP
var m3 = new CPPPerfTest.Matrix();
var m4 = new CPPPerfTest.Matrix();
var result2 = new CPPPerfTest.Matrix();

var sum = 0;
for (var index = 0; index < cyclesCount; index++) {
    var duration = doPureCPPTest(m3, m4, result2, repeatCount);
    resultText.innerHTML += "Operation lasted " + duration + "ms using pure c++<br>";
    sum += duration;
}

Where doPureCPPTest is the following function:

var doPureCPPTest = function (m1, m2, result, repeatCount) {
    var startDate = new Date().getTime();
    for (var index = 0; index < repeatCount; index++) {
        m1.multiply(m2, result);
    }
    var endDate = new Date().getTime();

    return (endDate - startDate);
};

By using this code, we are able to get the following response times:

image

But wait! There is something wrong here, isn’t it? The average duration is around 339.2ms. Which is far too much and sound illogical.

There is a reason for that: the WinRT language projection. This technology allows you to call C++ or .NET code from your JavaScript code (or C++ from .NET also). But, even if WinRT language projection is really efficient there is always a small overhead when you go back and forth between languages.

Removing projection overhead

So this is a really important thing to note: If you want to unleash the power of C++ through WinRT projection, you will have to reduce the number of round trip you will do.

In my simple case, here is a new function for computing the multiplications:

void Matrix::MultiplyTest(Matrix^ other, Matrix^ result, int repeatCount)
{
    for (int index = 0; index < repeatCount; index++) {
        Multiply(other, result);
    }
}
Instead of calling 1.000.000 times the Multiply function, you just have to call MultiplyTest once with the number of multiplication you want.

Obviously, it will be far complex from a "real life" code to be able to create big chunks of code, but you will see that the gain worth the pain.

With this optimizations, here are the results:

image

The average duration is now 34.3ms down from 69.3ms (almost 50% better again).

Using DirectXMath

If we want to achieve the maximum performance we now have to call the hardware for help. And to do so, we can use DirectXMath. The primary goals of this library is to provide an excellent C++ SIMD math library for use for Windows Store apps. The DirectXMath API provides SIMD-friendly C++ types and functions for common linear algebra and graphics math operations common to DirectX applications. The library provides optimized versions for Windows 32-bit (x86), Windows 64-bit (x64), and Windows RT through SSE2 and ARM-NEON intrinsics support in the Visual Studio compiler.

SIMD (Single Instruction Multiple Data) operations are really useful for matrices multiplication because you apply the same treatment to each row and column.

To use DirectXMath within our WinRT component, we just have to include the following line:

#include <DirectXMath.h>

You are now able to access a new struct called XMMATRIX:

#pragma once

namespace CPPPerfTest
{
    public ref class DXMatrix sealed
    {
    public:
        DXMatrix();

        void Multiply(DXMatrix^ other, DXMatrix^ result);
        void MultiplyTest(DXMatrix^ other, DXMatrix^ result, int repeatCount);

    private:
        DirectX::XMMATRIX m;
    };
}

The implementation of our new class is pretty straightforward:

#include "pch.h"
#include "DXMatrix.h"

using namespace DirectX;
using namespace CPPPerfTest;

DXMatrix::DXMatrix()
{
    XMFLOAT4X4 temp;
    for (int x = 0; x < 4; x++) {
        for (int y = 0; y < 4; y++) {
            temp.m[x][y] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
        }
    }

    m = XMLoadFloat4x4(&temp);
}

void DXMatrix::Multiply(DXMatrix^ other, DXMatrix^ result)
{
    result->m = XMMatrixMultiply(m, other->m);
}

void DXMatrix::MultiplyTest(DXMatrix^ other, DXMatrix^ result, int repeatCount)
{
    for (int index = 0; index < repeatCount; index++) {
        Multiply(other, result);
    }
}

The signature of the MultiplyTest method is the same as previously, so we can use our DXMatrix directly from the JavaScript code:

// DirectX(SIMD) CPP
var m5 = new CPPPerfTest.DXMatrix();
var m6 = new CPPPerfTest.DXMatrix();
var result3 = new CPPPerfTest.DXMatrix();

var sum = 0;
for (var index = 0; index < cyclesCount; index++) {
    var duration = doPureCPPTest(m5, m6, result3, repeatCount);
    resultText.innerHTML += "Operation lasted " + duration + "ms using DirectX c++<br>";
    sum += duration;
}
And the result is, let’s say, as expected:

image

We are now able to compute 1.000.000 multiplication of matrices in about 9.5ms!!!! We have optimize the performance by more than 1400%

Here is a global view on what we did:

image

Conclusion

I did not speak about GPU in this article but with the help of DirectX 11 and the Compute Shaders it would have been possible to get even better performance.

Nevertheless, with the help of C++, DirectX Math and WinRT, you are now able to create wonderful HTML5/CSS3/JavaScript application and still be able to get the maximum power of your CPU!

Obviously there is no magic here, you must have a code in your app that can be offloaded to C++ and DirectXMath (heavy computation, 3D, etc.) but in this case you will be able to get astonishing large performance improvements.

Feel free to run the complete tests (Visual Studio 2013 solution available here) and share your results using comments.

Leave a Comment
  • Please add 1 and 8 and type the answer here:
  • Post
  • I see one good candidate that could benefit from this in BabylonJS : It's for Skinned animation blending (for keyframe based skeleton animations, for transitioning between different animation loops, for mixing animations - eg. shoot + run - and IK). All these operations involve many matrix operations at once and so the WinRT boundaries crossing should be amortized by the complexity of the computation. Would be glad to help you on that subject !

Page 1 of 1 (1 items)