Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

Binomial Option using C++ AMP

Binomial Option using C++ AMP

  • Comments 11

In the financial industry, Binomial Option is a one of the methods to valuate options. Another one is Black-Scholes, which is faster than Binomial Option, but is not as accurate.

main – Program entry point

Let’s start with main(). First it generates random data (financial options), then it normalizes random options (meaning updates the options’ move up or down factor based on volatility, and the discount factor based on time increment), then main calls two functions (the binomial_options_cpu function that executes on the CPU and the binomial_options_gpu function that executes on the GPU), and finally it verifies correctness of the results.

Below is the explanation both a CPU implementation and GPU implementation (using C++ AMP).

binomial_options_cpu – Regular C++ implementation (on CPU)

Here we go through each option one at a time, creating a binomial tree for each, and walking up this tree calculating the result of the value of the option. For more please see the actual code.

binomial_options_gpu - C++ AMP implementation

In our sample we have a small data set and normal data set (defined in macros at the beginning of file). The MAX_OPTIONS macro defines the number of options and we use that to specify the number of GPU threads to execute our kernel. We also use that as the size of our input data.

For the input data we are using array_view instances that wrap the original data that resides in std::vector instances. The last container (vector/array_view) is the result of the computation, while the rest are inputs to the computation.

Instead of iterating through each option, like we did for the CPU case, we use concurrency::parallel_for_each (passing our kernel algorithm as a lambda, that captures the array_view instances) with restrict(amp). Executing our kernel on the GPU will result in a GPU thread calculating the result of the value of the option, i.e. create binomial tree and walk up its tree to get the value of this option. Each thread calls binomial_options_kernel.

Below is the source of the function we just described.

void binomial_options_gpu(std::vector<float>& v_s, std::vector<float>& v_x,
	std::vector<float>& v_vdt, std::vector<float>& v_pu_by_df,
	std::vector<float>& v_pd_by_df,
	std::vector<float>& call_value)
{
	const unsigned data_size = v_s.size();

	// all data size should be same
	assert(data_size == MAX_OPTIONS); // kernel uses this macro directly
	assert((v_s.size() == data_size) && (v_x.size() == data_size) && (v_vdt.size() == data_size) && 
		(v_pu_by_df.size() == data_size) && (v_pd_by_df.size() == data_size) && (call_value.size() == data_size));

	extent<1> e(data_size);

	array_view<float, 1> av_call_value(e, call_value);
        av_call_value.discard_data();
	array_view<const float, 1> av_s(e, v_s);
	array_view<const float, 1> av_x(e, v_x);
	array_view<const float, 1> av_vdt(e, v_vdt);
	array_view<const float, 1> av_pu_by_df(e, v_pu_by_df);
	array_view<const float, 1> av_pd_by_df(e, v_pd_by_df);

	// Used as temporary buffer on GPU
	extent<1> ebuf(MAX_OPTIONS*(NUM_STEPS + 16));
	array<float, 1> a_call_buffer(ebuf);

	// Totally we have
	// number of tiles = (CACHE_SIZE * MAX_OPTIONS)/CACHE_SIZE
	// number of threads/tile = CACHE_SIZE;
	extent<1> compute_extent(CACHE_SIZE * MAX_OPTIONS);
	parallel_for_each(compute_extent.tile<CACHE_SIZE>(), [=, &a_call_buffer](tiled_index<CACHE_SIZE> ti) restrict(direct3d)
	{
		binomial_options_kernel(ti, av_s, av_x, av_vdt, av_pu_by_df, av_pd_by_df, av_call_value, a_call_buffer);
	});
	av_call_value.synchronize();
}

Before we examine the binomial_options_kernel, let’s briefly revisit our usage of array_view and also use of the single array object that we have not talked about until now.

Note that for all input std::vector objects we have created corresponding array_view wrapper objects as follows: array_view<const float, 1>. By using const before the element type we indicate that data can be discarded after the kernel finishes and there is no need to sync data back to the host – so in effect this can save unnecessary copy out and help our performance.

For our lone output parameter we have created the array_view as follows:

          array_view<float, 1> av_call_value(e, call_value);

          av_call_value.discard_data();

The method discard_data() indicates to runtime that data from backing store doesn’t need to be copied to the GPU – so in effect this can save unnecessary copy in and help our performance. After kernel execution, the result can be copied back to system memory: we can explicitly call the synchronize() member function of array_view in a CPU thread, or let the variable go out of scope (so the destructor will call the synchronize function).

Finally, we are also using an array object, because we need a temporary buffer to which we can read or write from our kernel. Observe that the array is not backed by host memory, so we don’t need to copy it between the CPU and GPU.

binomial_options_kernel - C++ AMP Kernel implementation

You’ll remember the binomial_options_kernel function that we called from the binomial_options_gpu function that we examined earlier.

The entire data is divided into chunks of tile size (CACHE_SIZE) and each thread in each tile will operate on this chunk calculating a partial result until it finishes will all chunks (at which point it has the final result). Notice that we copy the data from the captured array_view instances (which reside in global memory) into the array buffer (also in global memory), and then into tile_static arrays. Following that, repeated accesses for the computation are performed on the tile_static arrays. This was the reason we introduced tiles. Accessing tile_static memory is much faster than accessing global memory repeatedly.

 

expiry_call_value – restrict based overloaded function

Both the GPU kernel and CPU implementation call a function called expiry_call_value. The fast_math function used here uses function overload based on the restrict modifier. Hence implementing this function as restrict(amp, cpu) enables calling it from GPU kernel and/or CPU function.

//----------------------------------------------------------------------------
// GPU/CPU implementation - Call value at period t : V(t) = S(t) - X
//----------------------------------------------------------------------------
float expiry_call_value(float s, float x, float vdt, int t) restrict(amp, cpu)
{
	float d = s * direct3d::fast_exp(vdt * (2.0f * t - NUM_STEPS)) - x;
	return (d > 0) ? d : 0;
}

Download the sample

Please download the attached sample of the binomial options that we discussed here and run it on your hardware, and try to understand what the code does and to learn from it. You will need, as always, Visual Studio 11.


Attachment: BinomialOptions.zip
Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 4 and 7 and type the answer here:
  • Post