Hi, my name is Jerry Higgins and I am the test lead for C++ AMP and PPL. In this post, I will be describing a fun little project to distribute matrix multiplication across multiple accelerators or even one accelerator for matrices that are too large to fit in the accelerators memory. I've taken advantage of fork and join parallelism in PPL as well as the massive parallelism unleashed by C++ AMP.

## Overview of solution

I used a chunk and stream approach, chunking the A & C matrices to the accelerators in parallel by taking advantage of one of my favorite tools from PPL, *parallel_for*, then streaming the B matrix across each* **accelerator* with C++ AMP *parallel_for_each*. This approach limits the amount of redundant data transfers but does not eliminate them.

One of the things that I really like about the programming model for C++ AMP is that it allows you to remain in the dimensionality of your data as well as the dimensionality of your computation. Rather than nesting linear constructs to iterate over multiple dimensions, our data containers and the way parallel_for_each is expressed with an *extent* and *index*, allow the programmer to remain closer to the original problem domain. In this context, it is easier to think of matrix multiplication as a two dimensional computation across the C matrix. See chunking data across multiple C++ AMP kernels for more insight into this and other strategies.

In an earlier iteration of this algorithm, I first partitioned the matrices across accelerators before chunking and streaming. This had a couple of disadvantages; for starters it added an unnecessary constraint that the matrices be divisible by the number of accelerators, secondly the memory locality on the CPU side was not as good, as it created larger gaps in memory during concurrent memory accesses. The current approach achieves good data locality with simple load balancing between accelerators and the control flow is more straightforward.

I'll be describing the code top down, starting from main.

## Main

First we create the 3 single precision floating point matrices in flat memory on the CPU and initialize the input matrices, A and B with random values using the Mersenne Twister random number generator that comes with the standard template library. We also zero initialize the output matrix for cleanliness. Next we get all available accelerators and for the ones that are not emulated, we push their default accelerator views into a vector that we will later pass to the *mxm_amp_distribute* function.

void main() { const int MATRIX_SZ = 1024; const int STREAM_WIDTH = 512; const unsigned int M = MATRIX_SZ; const unsigned int N = MATRIX_SZ; const unsigned int W = MATRIX_SZ; float *A = new float[M*N]; float *B = new float[N*W]; float *C = new float[M*W]; rand_init(A, M*N); // initialize with random numbers rand_init(B, N*W); // initialize with random numbers zero_init(C, M*W); // initialize with zero std::vector<accelerator> accelerators = accelerator::get_all(); std::vector<accelerator_view> accelerator_views; std::for_each(accelerators.begin(), accelerators.end(), [&] (accelerator& accl) { if(!accl.is_emulated) { // if not emulated std::wstring wdescr = std::wstring(accl.description); std::wcout << accl.description << std::endl; accelerator_views.push_back(accl.default_view); // get the default accelerator view } }); mxm_amp_distribute(accelerator_views, STREAM_WIDTH, A, B, C, M, N, W); }

## mxm_amp_distribute

The *mxm_amp_distribute *function is where the coarse grained parallelism is used to distribute the work across accelerators occurs. First the parameters are checked, then each accelerator_view gets its own staging buffer for streaming the B matrix from the CPU and these are pushed into a vector. Using staging arrays for the B matrix streams increased performance for this workload.

void mxm_amp_distribute(std::vector<accelerator_view> acc_views, const unsigned int stream_width, const float* A, const float* B, float* C, const unsigned int M, const unsigned int N, const unsigned int W) { // Performs matrix multiplication C = A * B // throw runtime error if M, N and stream_width are not all multiples of TILE_SIZE if( ((M % stream_width) != 0) || ((N % stream_width) != 0) || ((W % stream_width) != 0)) { throw std::runtime_error("M, N and W must all be multiples of stream_width"); } if(acc_views.size() == 0) { throw std::runtime_error("acc_views is empty, nowhere to run"); } // Create a vector of staging arrays for streams of matrix B, one for each accelerator std::vector<array<float, 2>> b_vect; b_vect.reserve(acc_views.size()); accelerator cpu_accelerator = accelerator(accelerator::cpu_accelerator); std::for_each(acc_views.begin(), acc_views.end(), [&] (accelerator_view acc_vw) { b_vect.emplace_back(extent<2>(N, stream_width),cpu_accelerator.default_view, acc_vw); });

The outer while loop will continue execution until the end of data has been reached. The inner *parallel_for* loop calculates the offset where the particular accelerator will start its work using the accelerator’s index and the *stream_width* parameter, each task checking to make sure it does not overreach its bounds. This last check is necessary as the matrix size may not be evenly divisible by the number of accelerators. Next, the *dispatch_chunk* function is called and the party moves to the accelerator.

unsigned int w_start = 0; // indicates where to start processing while(w_start < W) { parallel_for((size_t) 0, acc_views.size(), [&] (size_t accelerator_ix) { unsigned int stream_offset = w_start + (stream_width * accelerator_ix); if (stream_offset < W) { dispatch_chunk(acc_views[accelerator_ix], A, B, C, M, N, W, stream_width, stream_offset, b_vect[accelerator_ix]); } }); w_start += (acc_views.size() * stream_width); }

** **

## dispatch_chunk

The *dispatch_chunk* function creates C++ AMP arrays that are fixed to the underlying storage for the A & C matrices. Extents are created for the data as well as the computation to be done for each invocation of the parallel_for_each. The shape of the computation is square in this case and its size is fixed to the *stream_width* parameter.

void dispatch_chunk(accelerator_view acc_vw, const float* A, const float* B, float* C, const unsigned int M, const unsigned int N, const unsigned int W, const unsigned int stream_width, const unsigned int stream_offset, array<float,2> &staging_b) { extent<2> extent_A(stream_width, N); extent<2> extent_C(stream_width, M); extent<2> extent_compute(stream_width, stream_width); // chunk storage on GPU for A & C array<float, 2> matrix_a(extent_A, A + (stream_offset * N), acc_vw); array<float, 2> matrix_c(extent_C, C + (stream_offset * M), acc_vw);

Next, we wrap an array_view around this accelerator’s staging array for the B matrix, which was passed in as a parameter.

array_view<float,2> matrix_b(staging_b)

The outer *for* loop is used to break the B matrix into *stream_width* streams using the *populate_stream* function and passing it to the *mxm_calc_stream* function which calculates that stream's corresponding column of C. This is done until an entire *stream_width* chunk of C is resolved and copied back out to its CPU backed storage.

// stream B to GPU

for(unsigned int start_column = 0; start_column < M; start_column += stream_width) { populate_stream(staging_b.data(), B, start_column, stream_width, N, M); mxm_calc_stream(extent_compute, N, start_column, matrix_a, matrix_b, matrix_c); } copy(matrix_c, C + (stream_offset * M));

## populate_stream

The populate_stream function copies a column stream from a matrix to a staging array, maintaining row major order. The underlying data pointer is used for performance reasons, because on the cpu side using the index operator is slower (note that on the accelerator, this is not an issue).

// populate B matrix column stream void populate_stream(float* column_stream_ptr, const float* const arr_in, unsigned int start_column, unsigned int stream_width, unsigned int arr_length, unsigned int arr_width) { for(unsigned int row = 0; row < arr_length; row++) { memcpy(column_stream_ptr + (row * stream_width), arr_in + (row * arr_width)+ start_column, stream_width * sizeof(float)); } }

## mxm_calc_stream

The *mxm_calc_stream* function is just another matrix multiplication implementation. I currently use a matrix multiplication which takes advantage of tile_static memory.

Since the A matrix is read from within the parallel_for_each kernel, it is copied to the accelerator. As C is only written to, data for C is not copied from the CPU to the accelerator. However, since C is captured in the lambda and written to, storage is allocated on the accelerator.

// matrix multiply, resolve C for this stream void mxm_calc_stream(extent<2> extent_compute, unsigned int N, unsigned int start_column, array<float,2>const &matrix_a, array_view<float,2> const &matrix_b, array<float,2> &matrix_c) { parallel_for_each(extent_compute.tile<TILE_SIZE,TILE_SIZE>(), [=, &matrix_a, &matrix_c] (tiled_index<TILE_SIZE,TILE_SIZE> t_idx) restrict(amp) { float accum_c = 0; // accumulate sum in local register to avoid extra round // trips to global memory for (unsigned int i = 0; i < N; i += TILE_SIZE) { tile_static float local_B[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix B tile_static float local_A[TILE_SIZE][TILE_SIZE]; // tile memory for partial sum of matrix A local_A[t_idx.local[0]][t_idx.local[1]] = matrix_a(t_idx.global[0], i + t_idx.local[1]); local_B[t_idx.local[0]][t_idx.local[1]] = matrix_b(i + t_idx.local[0], t_idx.global[1]); t_idx.barrier.wait(); // wait for all threads in the tile to pull their corresponding values into tile_static memory for (unsigned int k = 0; k < TILE_SIZE; k++) { accum_c += local_A[t_idx.local[0]][k] * local_B[k][t_idx.local[1]]; // accumulate partial sum for current tile } t_idx.barrier.wait(); // wait for all threads in the tile } matrix_c(t_idx.global[0], t_idx.global[1] + start_column) = accum_c; // done with this tile, update global memory }); }

That's all for this post. Thanks for tuning in and I hope you enjoy programming in C++ AMP as much as we do. Feedback as always welcome in the comments or in our MSDN Forum.