Convolution operations are found in many areas of engineering like image processing and signal processing to mention a couple. In general, Convolution is applying or processing two input functions producing a third function. In this sample, I used C++ AMP to implement a implemented Convolution separable filter.
Let me start with function main, here I generate filter matrix (1D) and image matrix(2D) randomly, call different implementations of convolution (Simple and Tiled implementation using C++ AMP) and eventually verify results using CPU implementation as reference. Here our data is similar to an image and filter is certain characteristics applied to the image. For simplicity, both the inputs are randomly generated rather than using actual image and filter.
I am starting with C++ implementation because it will help one to understand how we can easily move to AMP.
Given image and filter matrixes, I apply the filter along each dimension of the image (first the row and then the column). The two outer loops are moving across the image, and the inner most loop is applying filters for each element of the matrix (or pixel). The application of the filter along each dimension (row or column) is determined in the innermost loop.
For each element in the data we start applying the filter starting from radius number of elements before the selected element through radius number of elements after the selected element along a dimension. For example, if we are applying the filter along columns and element 5 is selected by this thread and the radius is 2, then we start applying the filter from element 3 through 7 in the image. When we talk about “applying the filter” to a point in the image, what we mean is simply summing the products of the data elements with the filter elements. The following pseudo-code shows what I just described.
0 <= d < width
0 <= x < width
foreach element[x, y] in image[width, height]
for (k = -radius; k <= radius; k++)
d = element[x+k, y];
sum += image[d] * filter[k];
element[x,y] = sum;
This is our wrapper call to simple convolution kernel. Here there are two input array objects (image and filter), one output array (resulting image) and a temporary array. For each input array object, data source is one of the function’s parameters. At the end of function, there is a copy operation to copy out data from output array to the output parameter. There are two parallel_for_each which call kernel convolution_dim_simple along column and then along row. The temporary array is used across parallel_for_each invocations; meaning in the first parallel_for_each temporary buffer is used as output array and then becomes input for the second parallel_for_each without having to copy back to the CPU. As you can see there is no copy out from temporary buffer data between parallel_for_each, this is because array once copied, or in this case assigned values in first parallel_for_each, resides on GPU global memory.
Let’s compare parallel_for_each of this function with for loops in convolution_cpu (described earlier). In convolution_cpu, there are three nested for loops, the outer two loops are used to go along image and inner most is used to apply the filter, Here in convolution_simple I have replaced two outer for loops with a parallel_for_each which means that there will be one GPU thread created for each pixel of the image (first parameter specifies the number of GPU threads created).
This function is called from parallel_for_eachs’ of convolution_simple function. This is a template kernel function, where the template parameter is the dimension along which to apply the filter. Other than being a template function, this is similar to innermost loop in convolution_cpu, where we sum the products along a dimension through 2 * radius + 1 number of elements with its corresponding filter element.
This is our wrapper call to the tiled version of the convolution kernel. The number of arrays and how they are used are similar to convolution_simple implementation. But the number of GPU threads and how they are organized into tiles is different – each tile will have TILE_SIZE number of GPU threads and there will be dim/(TILE_SIZE – 2 * radius) number of tiles which is more than the number of pixels. In the below picture if the light shade represents our data/image, the dark shade + light shade represents the total number of threads that are create. The dark shade represents additional threads (of size radius) that are created along each sides of image when image is laid in 1D.
Then as each tile executes, it will divide data into a smaller size (TILE_SIZE – 2 * radius) and operate on them. The below picture show a tile while applying convolution filter along a column, the blue colored region of threads will read in boundary data and won’t participate in apply the filter. At the same time, the light colored region of threads will read from global memory plus will apply the filter.
This function is called from both parallel_for_each of convolution_tiling. In convolution_dim_simple kernel, while applying the filter it reads image data from global memory, but in the tiled version we are trying to reduce global memory accesses. Each thread picks a pixel to operate on based on its tile index and local index. Below is the code for this:
int idx_convolve = (tile_idx[dim_to_convolve])*(TILE_SIZE - 2 * radius) +
(int)(local_idx[dim_to_convolve]) - radius;
(int)(local_idx[dim_to_convolve]) - radius;
In the above code there is a possibility to overflow or underflow image/array at boundaries, so we use clamping to protect against going out of bounds. Note that by clamping indexes, some of the threads (in the first and the last tile) will end up reading the same data again (especially along the borders), but that is OK. This is what the code looks like for clamping:
a_idx[dim_to_convolve] = direct3d::clamp(idx_convolve, 0, max_idx_convolve-1);
The code above calculates the index of the pixel we use to start applying the filter. Next, each thread will use the resulting index to read the image data into tile_static memory.
Next, all threads in a tile should wait until they synchronize (using tile_barrier) to ensure that all the necessary data are in tile_static memory before applying the filter.
Finally, the filter function is executed by only TILE_SIZE – 2 * radius threads of TILE_SIZE threads.
Please download the attached project of the Convolution that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.