Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

concurrency::array_view - Introduction

concurrency::array_view - Introduction

  • Comments 0

This is the first in a series of posts where I will take you through some of the finer semantic aspects of the C++ AMP concurrency::array_view template type and its inner workings, knowledge that I hope you will find useful in writing functionally correct and well-performing C++ AMP code using array_views. Along the way, I will share key guidelines on using array_views in your C++ AMP code, and illustrate the application of these guidelines through example code.

Introduction

C++ AMP provides the array_view template type as a primary vehicle for reading and writing dense multi-dimensional rectangular data collections in your C++ AMP code. An array_view provides the abstraction of a reference to existing memory locations of a data source (such as CPU memory pointer, an STL container like std::vector and std::array or a concurrency::array container in host CPU or accelerator memory). This abstraction enables programmers to access the data on any accelerator_view and/or the CPU, without coding explicit data transfers.

Let us look at some example code illustrating how an array_view can be accessed on an accelerator_view and the CPU without requiring any explicit data transfers.

// matrixC = matrixA X matrixB
// matrixD = matrixC X matrixA
 
// pA, pB, pC and pD are pointers and contain the contents of 
// matrixA, matrixB, matrixC and matrixD respectively in CPU memory
 
array_view<const float, 2> matrixA(N, N, pA);
array_view<const float, 2> matrixB(N, N, pB);
array_view<float, 2> matrixC(N, N, pC);
matrixC.discard_data();
 
// array_views "matrixA" and "matrixB" are captured in this parallel_for_each
// and the data is implicitly transferred from the CPU (pA, pB) to the accelerator
parallel_for_each(matrixC.extent, [=](index<1> idx) restrict(amp) {
    MatrixMultiplyKernel(idx, matrixA, matrixB, matrixC);
});
 
array_view<float, 2> matrixD(N, N, pD);
matrixD.discard_data();
 
// array_views "matrixC" and "matrixA" are captured in the parallel_for_each.
// Since their contents are already cached on the accelerator no data transfers
// are performed by the runtime.
parallel_for_each(matrixD.extent, [=](index<1> idx) restrict(amp) {
    MatrixMultiplyKernel(idx, matrixC, matrixA, matrixD);
});
 
// The contents of "matrixC" and "matrixD" are implicitly transferred
// from the accelerator to the CPU when the array_views are accessed on the CPU
float firstC = matrixC(0, 0);
float firstD = matrixD(0, 0);

 

Deferring the management of data transfers to the runtime not only leads to concise code (in absence of any explicit data transfers in code), but is also recommended for performance portability. On architectures where an accelerator is capable of accessing memory local to the CPU (such as AMD APU or Intel Ivy Bridge processors) or other accelerators, some or all of these data transfers may be unnecessary and the C++ AMP runtime could easily elide them without requiring any changes in your application code. In its first release (Visual Studio 2012) the C++ AMP runtime does not take advantage of this ability; but the point is that when it does, you application will reap the benefits without requiring any code changes if you are using array_views. In contrast, if the data transfers are explicit in your application code, you are responsible for conditionally avoiding the data transfers if they are not required on the target accelerator.

Some performance fanatics may argue that losing control on data transfer management may often result in sub-optimal performance. While acknowledging the need for concurrency::array and explicit data transfer management in some advanced scenarios (served by the concurrency::array type and concurrency::copy APIs), I dare to claim that optimal performance is mostly achievable using array_views which as I earlier mentioned are also preferable for reasons of code conciseness and performance portability across diverse architectures.

array_view semantics

Sections 8.2, 8.3 and 8.4 of the C++ AMP open specification describe the semantics regarding array_view access in host and accelerator code. Let us look at some key aspects of array_view access semantics.

An array_view represents whole or part of its underlying data source, and can be used to read and/or write the underlying data on the host CPU and/or one or more accelerator_views (also referred to as location henceforth).

A data source is defined as:

a) When the underlying storage referenced by an array_view is a concurrency::array (bound to host CPU or another accelerator), that array is the data source.

b) When the underlying storage referenced by an array_view is a CPU memory pointer or a STL container, the data source is the original top-level array_view created using the host memory pointer or container.

Data Coherence between array_view objects

Two array_view objects created from the same concurrency::array have the same data source, viz. the array, and the runtime maintains coherence between multiple array_views of the same data source as long as all accesses to the array_view are legal per the semantics described later in the post.

// Multiple array_views created from the same array share the same data source
// and are guaranteed to be coherent
array<int> array_data(10);
array_view<int> array_data_view1(array_data); // OK, array_data is the data source
array_view<int> array_data_section_view(array_data.section(0, 5)); // OK, array_data is the data source
array_view<int> array_data_view1_alias(array_data);                // OK, array_data is the data source
 
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { array_data_section_view[2] = 15; });
 
assert(array_data_view1[2] == array_data_section_view[2]);       // OK, never fails, both equal 15
assert(array_data_section_view[2] == array_data_view1_alias[2]); // OK, never fails, both equal 15

 

However, when two different array_views are created independently on top of the same CPU memory pointer or an STL container, these top level array_views are themselves considered to be different and independent data_sources and are not guaranteed to appear coherent. Coherence is guaranteed only between array_views that reference the same data source. The code below illustrates this behavior: 

// Creating multiple array_views independently from a raw CPU pointer
// results in incorrectly aliased array_views which are not
// guaranteed to be coherent
int storage[10];
array_view<int> storage_view(10, &storage[0]);
 
// Note: "storage_bad_alias_view" is a top-level array_view and incorrectly aliases
// another top level array_view "storage_view" since both of them are created
// over the same raw "storage" independently. array_views aliased in this fashion
// are not guaranteed to appear coherent; i.e. modifications made through one
// array_view may not be visible when accessing the other array_view
array_view<int> storage_bad_alias_view(10, &storage[0]); 
 
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { storage_bad_alias_view[7] = 16; });
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { storage_view[7] = 17; });
 
assert(storage_view[7] == storage_bad_alias_view[7]); // undefined results
 
// The array_view "storage_good_alias_view" is OK as it is 
// created from the top-level array_view "storage_view"
array_view<int> storage_good_alias_view(storage_view);
 
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { storage_good_alias_view[7] = 16; });
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { storage_view[7] = 22; });
 
assert(storage_view[7] == storage_good_alias_view[7]); // OK, never fails, both equal 22

 

 

In the code above, “storage_view” and “storage_bad_alias_view” are considered to have different data sources (as they are independently created from raw CPU memory) though both of the array_views actually reference the same CPU memory. Consequently, when one of these array_views is modified on an accelerator_view inside the parallel_for_each invocation, the other array_view is oblivious of this modification as it is made to a different data source. Only modifications made through array_views referencing the same “data source” are synchronized, as is the case for the array_views “storage_view” and “storage_good_alias_view”.

array_view access semantics

Now let us see what constitutes an array_view access. An array_view is accessed on a location through one of the following operations. The access can be a read-access, a write-access or both (read-write) depending on the array_view object type or on how and which API the array_view is accessed through.

a) Capturing an array_view in a parallel_for_each. The type of access (read or write) is determined by the captured array_view object type; capturing an array_view<const T> object constitutes a read-access while capturing an array_view<T> object constitutes a read-write access or just a write-access if the array_view’s contents have been discarded using the discard_data API.

b) Using an array_view object in a concurrency::copy or concurrency::copy_async operation. Using an array_view object as the source of the copy constitutes a read-access and its use as the copy destination constitutes a write.

c) Accessing an array_view on the CPU using the subscript operator[], function operator() or the data() member method. Again the type of the access is determined by the array_view object type; array_view<const T> access is read-only and array_view<T> access constitutes a read-write access (even if it is only read from and not written to).

d) A synchronize or synchronize_async operation constitutes a read-only access.

e) An array_view::refresh operation constitutes a write-only access.

Note that an array_view just represents a reference to the data source and creating other array_view objects from an existing array or array_view using APIs such as section, projection, view_as, reinterpret_as and copy constructor does not constitute an access of the array_view.

Two array_views are said to overlap when they have the same data source and at least one element of the underlying data source is referenced by both the array_views.

Two key rules with regards to array_view accesses are:

a) From a data coherence perspective, accessing any element of an array_view using one of the aforementioned operations constitutes an access to the entire portion of the underlying data source represented by that array_view.

b) An array_view or any other array_view overlapping it cannot be concurrently accessed if any of the accesses is a write access. Such concurrent accesses constitute a data race and have undefined behavior. Concurrent accesses can be made from multiple CPU threads or on a single thread through any of the async APIs, and should be properly synchronized for correctness.

std::vector<int> sourceData(size, 2);
array_view<int> view(size, sourceData);
array_view<int> view1 = view.section(0, size/2 + 1);
array_view<int> view2 = view.section(size/2 - 1, size/2 + 1); // Overlaps "view1"
 
concurrency::task<void> asyncCPUTask([&]() {
    view1[0]++;
});
 
// Undefined behavior: Accessing "view1" when another access to "view1" 
// (by asyncCPUTask above) may be concurrently in progress,
// even if different elements of the "view1" are accessed by the concurrent
// operations
parallel_for_each(extent<1>(1), [=](index<1> idx) restrict(amp) {
    view1[1]++;
});
 
asyncCPUTask.get();
 
// OK to access "view1" after the concurrent accesses are synchronized (asyncCPUTask.get())
view1[0]++;
 
std::vector<int> outData(size);
auto fut = copy_async(view2, outData.begin());
 
// Incorrect: Illegal to access "view1" since an asynchronous copy operation
// on an overlapping array_view "view2" is in flight.
parallel_for_each(extent<1>(1), [=](index<1> idx) restrict(amp) {
    view1[1]++;
});
 
 
array_view<const int> readOnlyView1(view1);
// OK to access "readOnlyView1" in read-only fashion when the concurrent copy
// operation is accessing the overlapping array_view "view2" only for reading
int value = readOnlyView1[0];
 

In closing

Having looked at the introductory concepts pertaining array_view, we will dive deeper into the functional and performance aspects of array_view in subsequent posts in this series - stay tuned!

I would love to hear your feedback, comments and questions below or in our MSDN forum.

Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 4 and 2 and type the answer here:
  • Post