Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

Histogram using C++ AMP

Histogram using C++ AMP

  • Comments 1

In statistics, Histogram helps understand distribution of data or help gauge density of certain data. This is mainly used in data mining applications and image processing. In this blog post I am sharing a C++ AMP sample of a histogram implementation ported from histogram_256 implementation from CUDA SDK samples.

main – Entry point

This function creates an object of the histogram user defined type, generates data (histogram::setup) for calculation, calculates the histogram (histogram::run) and verifies (histogram::verify) the results of the computation on the CPU with results on the accelerator. If the results don’t match this application prints out an error message.

 

histogram - C++ Implementation

This class is an interface to the actual histogram calculation class. This class has a constructor, methods to calculate histogram (setup, run, verify and cleanup) and member variables to store the data and result. The constructor of this class sets member variables to their default value. Below are the key methods:

histogram::setup

This method generates random data for calculation and maintains or stores them in histogram::data member variable. In this sample we are using unsigned char type data.

histogram::cleanup

This method resets all data and information maintained by an instance. This is similar to constructor.

histogram::verify

This method helps determine the correctness of the C++ AMP implementation. In this method, the histogram is calculated using the CPU and the results are compared with results from the accelerator.

histogram::run

In this function, data on the host memory is encapsulated using concurrency::array and then the computation is offloaded to an instance of histogram_amp (covered later in this post). When the computation is done, the result is copied from the accelerator memory to the CPU memory. Note that unsigned char data type is not supported in C++ AMP kernels, as it is not in the list of supported fundamental types in restrict (amp) code. We have documented separately how to work around the char type restriction.

 

histogram_amp - C++ AMP kernels

When a kernel is part of member function, the compiler tries to marshal in “this” pointer. If a class has unsupported data types, you will receive a compiler error about unsupported data types. So I created a new class with member functions implementing AMP kernels.

histogram_amp::addword_256

This function understands the internal structure of the user data, in this case an unsigned int consist of 4 unsigned chars’. For each packed input data, this function extracts actual user data and uses it to determine its bin. Once bin is determined, bin count is atomically incremented to ensure correctness across threads.

histogram_amp::histo_kernel

This function is called from histogram::run member function. The kernel in this function divides the input data among _no_of_tiles*HISTOGRAM256_TILE_SIZE number of threads. Initially each thread in the tile will initialize its portion of tile_static memory, then reads data from the GPU global memory and calls histogram_amp::addword_256 to calculate partial histogram in tile_static memory. The per-tile partial histogram is stored in tile_static memory. In the kernel, tile_static memory is virtually organized as a 2D array of THREAD_COUNT rows and HISTOGRAM_BIN_COUNT columns, where a sub group of threads is responsible for a partial histogram. At the end, each thread will sum up bin counts, within a tile, in tile_static memory and save it to an array on the GPU global memory.

histogram_amp::histo_merge_kernel

This function is also called from histogram::run member function. Output of histogram_amp::histo_kernel is bin count merged among threads in a tile. In the last phase per tile partial histogram is merged to get a complete histogram. Here each tile calculates one of the bins’ count. Each thread in a tile sums one bin data across different partial results. And then all threads go through parallel reduction where number of threads participating in each iteration halves. Eventually zero-th thread in the tile will update final bin count.

Download the sample

Please download the attached project of Histogram using C++ AMP that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.

As with all the samples we post, they are meant to help you get started with C++ AMP. Once you are up and running, you may be able to come up with a better C++ AMP implementation of one of the algorithms. If you do, please share it with us.


Attachment: histogram.zip
Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 8 and 7 and type the answer here:
  • Post