This blog post shows how parallel prefix sum or “Scan” is implemented using C++ AMP.
In main(), create an instance of scan, compute prefix sum (execute) and validate results (verify). There are 3 instances of scan each of different data type. In this sample, the 1 byte data is generated in constructor. The data is 1 byte in size to avoid overflow.
In this function, the input buffer and the output buffer are wrapped in a concurrency::array_view object. Now a global function “scan_tiled” is called with tile size as its template parameter, input and output concurrency::array_view parameters.
The function calls “scan_tiled” function. The 2nd parameter of “scan_tiled” is created on the thread stack as a concurrency::array and a concurrency::array_view of it is passed as output parameter. The output data is then copied back to the input array. This function is used to recursively compute prefix sum of prefix sum.
This function does a tile wise prefix sum of elements in the “_tile_size” portion of the input data at an offset unique to each tile. The number of tiles is rounded up (based on “input” concurrency::array) to a multiple of “_tile_size” and then the kernel is invoked. Inside the kernel, each GPU threads checks their bound before accessing any input data. This function uses tile_static memory to do cooperative reads from global memory and compute on elements in tile_static memory. When the data is read from the GPU global memory, all threads in the tile wait on barrier before proceeding. The for loop part of kernel has the prefix sum calculation. In each iteration threads synchronize using tile barrier. After the loop is complete, each thread will updates their prefix sum, using their global thread id, to an output buffer (named “tilewise_scans”). Then the “_tile_size-1”-th thread in each tile update tile prefix sum, based on tile index, to another buffer (named “sums_of_tiles”) for further calculation.
In this function the number of tiles is rounded up (based on “input” parameter) and “compute_tilewise_scan” is called to scan. The per tile scan result is stored in another temporary array “scan_of_sums_of_tiles”. If the number of tiles is more than one, then scan of scan (new scan of old scan) should be computed and the tile prefix sum should be propagated. Scan of scan is taken care by calling “prefix_scan” recursively and passing in the local temporary array (“scan_of_sums_of_tiles”.) as parameter. The prefix sum is propagated in a different kernel (inline in this function) where prefix sum of previous tile (computed based on thread index) is applied to all elements.
In this function, the prefix sum is calculated on the CPU and compared with the GPU results sequentially.
Please download the attached sample of the Scan that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.