Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

Scan algorithms in C++ AMP Algorithms Library

Scan algorithms in C++ AMP Algorithms Library

  • Comments 0

Hi there! I wanted to let you know that we recently added important algorithms to our C++ AMP Algorithms Library: these are scan, segmented scan, multiscan. If you have not heard before about C++ AMP Algorithms Library, then I encourage you to visit our CodePlex page. In essence, it is a collection of STL-like parallel algorithms that developers can freely use in their own projects. In this blog post I will briefly describe what the scan algorithm does, present an application of scan and show you the interface of scan algorithm in C++ AMP.

Introduction to scan

The scan algorithm (also known as prefix sum) takes an input array A and produces the output array B of the same size where each value at index i, is the sum of all elements that precede i, but not including value at i. Here is the sequential version of the algorithm:

B[0] = 0;

for (int i=1; i<A.size(); ++i)

{

B[i] = B[i-1] + A[i-1];

}

 

So, given input A with following values: [1, 1, 4, 2, 1] the output B would contain: [0, 1, 2, 6, 8].

In the example above we preformed addition operation between the elements of the input array A to calculate the resulting value in output B, but in general it can be any binary associative operator. Additionally, please note that we have not accessed the last value in the input array A, this form of scan is called exclusive scan. Similar version of this algorithm is called inclusive scan, when the output value at index i is the sum of all elements that proceed i, including the value at i.

If you are interested to learn how the output array can be efficiently computed in parallel, then I encourage you to read "Fast Scan Algorithms on Graphic Processors" which is what C++ AMP Algorithms Library uses under the covers.

Applications of scan

Although it might not be obvious at first, scan is the building block for many other algorithms such as radix sort, sparse matrix vector multiply, stream compaction etc.

To build a better understanding, take a look how stream compaction can be achieved with a scan algorithm. Given an input array A of size n, we want to copy an element at index i to first available spot in the output array B only if that element meets certain characteristics. Here is the sequential code for stream compaction:

int j = 0;

for (int i=0; i<A.size(); ++i)

{

if (predicate(A[i]))

{

B[j++] = A[i];

}

}

So, given the input A with the following values: [1, 3, 4, 2, 7] and the predicate that returns true if value is odd and false if even, the output B would contain: [1, 3, 7].

Now how would we achieve the same in parallel? Well, we can certainly run the predicate function in parallel. All the difficulty comes in the fact that we do not know what would be the output index for the element that we evaluated, this is where scan algorithm comes in. We use intermediate array C, in which we will store 1 at index i if the element at index i from A passes the predicate and 0 otherwise. So for our example above our intermediate array would have following values: C = [1, 1, 0, 0, 1]. Now we exclusive scan as defined above on array C and store the results in yet another intermediate array D = [0, 1, 2, 2, 2]. Values in D represent indices at which elements from A should be placed in B. Additionally notice that element from array A is copied to B only if its value in C is equal to 1, so there is no race for space at index 2 in B. Here are the steps for each thread:

  • Thread 0 looks at index 0 in C and since the value is 1 it copies element from A to B using index in D. B[D[0]] = A[0].
  • Thread 1 looks at index 1 in C and since the value is 1 it copies element from A to B using index in D. B[D[1]] = A[1].
  • Thread 2 looks at index 2 in C and since the value is 0 it stops.
  • Thread 3 looks at index 3 in C and since the value is 0 it stops.
  • Thread 4 looks at index 4 in C and since the value is 1 it copies the element from A to B using index in D. B[D[4]] = A[4];

We are done, the output array B contains [1, 3, 7].

Scan in C++ AMP Algorithms Library

Now that we built the understanding for scan algorithm and went through the application of scan, we can now take a look at the API design for scan algorithm in C++ AMP Algorithms Library. The scan algorithm is declared in amp_algorithms.h available in amp_algorithms namespace, here are few sample usages:

void test_scan(array<int> &A, array<int> &B)

{

// creates scan object

// we need information about the maximum number of elements to prep temp array

scan s(A.extent.size());

 

// performs scan on A and stores the results in B

// the default operator is amp_algorithms::sum<T>()

s.scan_exclusive(A, B);

 

// performs in-place prefix sum on A

s.scan_exclusive(A, A);

 

// performs scan with multiplication as binary operation between the elements

s.scan_exclusive(A, B, amp_algorithms::mul<int>());

 

// performs scan with bit xor operation in backwards direction

s.scan_exclusive(A, B, scan_direction::backward, amp_algorithms::bit_xor<int>());

}

 

Another two similar algorithms that we added are segmented scan and multiscan, they are exposed as member function on scan object (segmented_scan_exclusive and multi_scan_exclusive). Segmented scan takes additional input array that denotes sub-arrays (segments) in the input array A and computes scan operation within each segment of the input array. Multiscan on the other hand operates on 2D input array and performs scan operation on every row.

 

Please leave us a comment below if there are any specific algorithms you would like us to add to the algorithms library or if you would like to suggest any changes to the existing set of algorithms. Thanks!

Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 8 and 2 and type the answer here:
  • Post