Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

Transferring Data between accelerator and host memory

Transferring Data between accelerator and host memory

  • Comments 0

Hello, I am Hasibur Rahman, an engineer on the C++ AMP team. As C++ AMP is meant to utilize the computing power of accelerators such as the GPU, it is natural that any application leveraging this computational power through C++ AMP will involve transferring/copying data between the host memory and accelerator memory, when the latter is a discrete device. To make it convenient, C++ AMP provides a rich set of APIs to accomplish data transfer between host and accelerator both synchronously and asynchronously. In this blog post I will walk you through complete list of copy APIs with some examples.

For this blog post I am assuming that you are familiar with concurrency::array and concurrency::array_view, two types provided by C++ AMP for handling and manipulating data. C++ AMP provides a set of global concurrency::copy functions returning void (and corresponding concurrency::copy_async functions returning a concurrency::completion_future) to allow copying data between arrays, array views, STL iterators etc. residing on same or different accelerators.

Depending on where the source data is allocated, copying data between host and accelerator may involve creation of a CPU side staging buffer. The data is first copied from the source data container to the staging buffer and then from the staging buffer to the destination data container. The copy APIs abstract away these details and provide a smooth experience of copying data analogous to std::copy in STL. Copying data is not supported while executing on an accelerator and hence concurrency::copy functions can only be invoked on CPU (they only have a restrict(cpu) clause).

Copying between array(s) and array_view(s)

C++ AMP provides following APIs for copying data between array and array_view.

1.    template <typename ValueType, int Rank> void copy(const array<ValueType,Rank>& Src, array<ValueType,Rank>& Dest)

2.  template <typename ValueType, int Rank> void copy(const array_view<ValueType, Rank>& Src, array_view<ValueType, Rank>& Dest)

3.  template <typename ValueType, int Rank> void copy(const array_view<const ValueType, Rank>& Src, array_view <ValueType, Rank>& Dest)

4.  template <typename ValueType, int Rank> void copy(const array<ValueType, Rank>& Src, array_view <ValueType, Rank>& Dest)

5.  template <typename ValueType, int Rank> void copy(const array_view <const ValueType, Rank>& Src, array<ValueType, Rank>& Dest)

6.  template <typename ValueType, int Rank> void copy(const array_view <ValueType, Rank>& Src, array<ValueType, Rank>& Dest)

 

The first API copies data from an array to another array (including staging arrays). The next two APIs copy data from an array_view to another array_view and the last three APIs copy data between an array and array view.

array_view<const T, N> is a partial specialization of array_view<T, N> and represents a view over elements of type const T with rank N. It differs from array_view<T, N> in that it provides read-only access to the underlying data.

In all of the above APIs both the source and destination data container can reside either on the host memory or on the accelerator memory. For copying between array and array view, the following conditions should be satisfied:

  • Source and destination should have the same rank.
  • The type of value stored in source and destination should be same.
  • The extent (and hence the size) of source and destination should match.

std::vector<int> v(20, 5);

array_view<int, 2> srcArrayView(10, 2, v);

// Gets the handle to accelerator
accelerator gpuDevice = GetGpuDevice();

array<int, 2> destArray(10, 2, gpuDevice.default_view);

copy(srcArrayView, destArray);

In the case of copying from an array to another array, the third condition does not necessarily need to be true. In fact, copying from an array to another array will work as long as both the source and destination array have the same rank and value type and the number of elements in the source is less than or equal to that of destination (i.e. srcArray.extent.size() <= destArray.extent.size()). This is due to the fact that in a concurrency::array the underlying data is laid out contiguously in memory. The rank and extent are only used to calculate the index into the contiguous data. However, the same cannot be said for an array view as it is just a wrapper over a real source and does not own the data. Hence, copy involving array_view objects requires that the extent and size of source and destination match.

std::vector<int> v(15);

array<int, 2> srcArray(3, 5, v.begin());

// Gets the handle to GPU
accelerator gpuDevice = GetGpuDevice();

array<int, 2> destArray(5, 3, gpuDevice.default_view);

copy(srcArray, destArray); // This works.

array<int, 2> destArray1(10, 2, gpuDevice.default_view);

copy(srcArray, destArray1); // This works.

std::vector<int> v_empty(15);

array_view<int, 2> destArrayView(5, 3, v_empty);

// This will fail with runtime error: ‘Failed to copy because extents do not match.’
copy(srcArray, destArrayView);

Copying between STL Iterators and array/ array_view

C++ AMP provides the following APIs for copying data between STL iterators and C++ AMP data types: array and array_view.

1.    template <typename InputIterator, typename ValueType, int Rank> void copy(InputIterator SrcFirst, InputIterator SrcLast, array<ValueType, Rank> &Dest)

2.  template <typename InputIterator, typename ValueType, int Rank> void copy(InputIterator SrcFirst, InputIterator SrcLast, array_view<ValueType, Rank> &Dest)

3.  template <typename InputIterator, typename ValueType, int Rank> void copy(InputIterator SrcFirst, array_view <ValueType, Rank> &Dest)

4.  template <typename InputIterator, typename ValueType, int Rank> void copy(InputIterator SrcFirst, array<ValueType, Rank> &Dest)

5.  template <typename OutputIterator, typename ValueType, int Rank> void copy(const array<ValueType, Rank> &Src, OutputIterator DestIter)

6.  template <typename OutputIterator, typename ValueType, int Rank> void copy(const array_view <ValueType, Rank> &Src, OutputIterator DestIter)

 

The first API copies data from an input iterator to an array and the second API copies data to an array_view. The first argument is the iterator pointing to the start of data, the second argument is the iterator pointing to the end of data, and the last argument is the array or array_view object. The next two APIs do the same thing as the previous two except that they only take the start of data as input and calculate the number of elements to be copied from the size of the destination array or array_view object. The last two APIs copy data to an output iterator from an array and array_view respectively. The number of elements to be copied is calculated from the size of the array or array_view object.

For copying between STL iterators and array/array_view objects, the following conditions should be satisfied:

  • The type of value stored in the source and the destination should be same.
  • The size (i.e. number of elements that can be stored) of the source must be less than or equal to size of destination.

std::vector<int> v(10);

std::fill(v.begin(), v.end(), 5);

// Gets the handle to GPU
accelerator gpuDevice = GetGpuDevice();

array<int, 2> srcArray(2, 5, v.begin(), gpuDevice.default_view);

std::vector<int> v_dest(10);

copy(srcArray, dest.begin());

std::vector<int> data(15);

array_view<int, 2> destArrayView(3, 5, data);

copy(v.begin(), v.end(), destArrayView);

The APIs above expect the source iterator to support all the operations required by an input iterator and the destination iterator to support all the operations required by an output iterator (You can read more about iterators on MSDN). Hence, we can also use these APIs to copy to and from raw pointers.

std::vector<int> v(10);

std::fill(v.begin(), v.end(), 5);

int* ptr = v.data();

array<int, 2> destArray(2, 5);

array_view<int, 2> destArrayView(destArray);

copy(ptr, destArrayView);

std::vector<int> v_dest(10);

int* ptr1 = v_dest.data();

copy(destArray, ptr1);

However, when you pass raw pointers to these APIs directly and compile the code with /MDd flag, you will get the warning stating: “Function call with parameters that may be unsafe - this call relies on the caller to check that the passed values are correct. To disable this warning, use -D_SCL_SECURE_NO_WARNINGS. See documentation on how to use Visual C++ 'Checked Iterators'”.

This is the same behavior as for std::copy. This is due to the fact that raw pointers are unsafe/unchecked iterators and can inadvertently overwrite the bounds of container. Hence the compiler warns about it. The solution is to convert the raw pointer into checked iterator by wrapping it with stdext::checked_array_iterator class as shown below.

copy(destArray, checked_array_iterator<int*>(ptr1, 10));

You can read more about checked iterators on MSDN.

Copying asynchronously with copy_async

For large data sizes, the copy operation may take a while to complete and we may want to use the current thread to perform some other task in the application during this period rather than waiting for the copy operation to complete. In other words, we may wish to copy asynchronously. C++ AMP provides complete set of global asynchronous concurrency:copy_aync functions corresponding to each of synchronous copy function that we have already seen. Asynchronous copy APIs have the same copying semantics as their synchronous counterparts, except that instead of void they return a concurrency::completion_future object that can be waited on. Following is the list of asynchronous copy APIs:

·   template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array<ValueType,Rank>& Src, array<ValueType,Rank>& Dest)

·  template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array_view<ValueType, Rank>& Src, array_view <ValueType, Rank>& Dest)

·  template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array_view <const ValueType, Rank>& Src, array_view <ValueType, Rank>& Dest)

·  template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array<ValueType, Rank>& Src, array_view <ValueType, Rank>& Dest)

·  template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array_view <const ValueType, Rank>& Src, array<ValueType, Rank>& Dest)

·  template <typename ValueType, int Rank> concurrency::completion_future copy_async(const array_view <ValueType, Rank>& Src, array<ValueType, Rank>& Dest)

·  template <typename InputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(InputIterator SrcFirst, InputIterator SrcLast, array<ValueType, Rank> &Dest)

·  template <typename InputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(InputIterator SrcFirst, InputIterator SrcLast, array_view <ValueType, Rank> &Dest)

·  template <typename InputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(InputIterator SrcFirst, array<ValueType, Rank> &Dest)

·  template <typename InputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(InputIterator SrcFirst, array_view <ValueType, Rank> &Dest)

·  template <typename OutputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(const array<ValueType, Rank> &Src, OutputIterator DestIter)

·  template <typename OutputIterator, typename ValueType, int Rank> concurrency::completion_future copy_async(const array_view <ValueType, Rank> &Src, OutputIterator DestIter)

 

Following is an example making use of asynchronous copy:

std::vector<int> v(15);

std::fill(v.begin(), v.end(), 5);

array_view<int, 2> srcArrayView(3, 5, v);

array<int, 2> destArray(3, 5);

completion_future w = copy_async(srcArrayView, destArray);

/***** Do some other operation here ******/

// wait for copy to complete
w.wait();

This blog post only briefly introduces the asynchronous copy API. For detailed explanation of C++ AMP asynchronous model, please read asynchronous operations and continuations in C++ AMP.

Transferring data without using the global copy (and copy_async) methods

Beyond using the global concurrency copy and copy_async methods for transferring data as discussed in this blog post, data transfers may happen under the covers in other ways including the following:

a) By capture of array_view in a lambda in concurrency::parallel_for_each call.

b) By passing data to an array constructor, e.g. array<int,1> a(10, myCpuContainerOfData).

c) By calling copy_to methods on array and array_view (which under the covers call the copy functions I have described in this post)

d) By calling synchronize and synchronize_async on an array_view or simply accessing the array_view after a parallel_for_each call or upon destruction of the last copy of an array_view object. For details, please read synchronizing array_view in C++ AMP.

In closing

I hope you now have a better understanding of how to transfer data between the host and accelerator memory.

Although use of above copy APIs makes transferring data between host and accelerator easier and convenient, it is important to keep in mind that such data transfers can be expensive and unwanted/redundant copying of data can also lead to performance hits. Hence, while writing algorithms using C++ AMP you should try to minimize the movement of data between host and accelerator and avoid/eliminate any redundant copy operations.

I would love to read your comments below or in our MSDN forum.

Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 4 and 1 and type the answer here:
  • Post