Asynchronous Agents Library - Intro to Message Blocks
02 July 09 07:22 PM | Steve Gates | 0 Comments   

In my previous post I talked about the agent class. Now I will introduce the Agents Library’s message blocks, how to use them, and the fundamentals of what they do. I will cover some basics that apply to all of the message blocks, introduce the messaging APIs, and then specifically explain three blocks: unbounded_buffer, overwrite_buffer, and single_assignment.

Message Blocks

In the Agents Library we have created a set of interfaces and defined a protocol for message blocks to communicate and exchange messages. Message blocks are intended to be used to establish defined communication protocols between isolated components and develop concurrent applications based on data flow. The message blocks provided by the Agents Library can be used in conjunction with the agents class itself or separately.

For more information on data flow - http://en.wikipedia.org/wiki/Dataflow

ISource and ITarget Interfaces

ISource and ITarget are base interfaces message blocks work with. They declare functions for handling linking and exchanging messages. A source block implements ISource, a target block ITarget, and a propagator block, a block that is a source and a target, implements both. In this blog I will only be discussing the functions needed to use the message blocks provided by the Agents Library. A more comprehensive knowledge of the functions in the ISource and ITarget interfaces are required when creating your own message blocks; this will be discussed in a later post.

Linking/Unlinking and Creating Messaging Networks

Often it is desirable to link message blocks together to create a network. One such reason for building messaging networks is to create data flow pipelines.  The following three functions, declared on the ISource interface, are used to connect source blocks to target blocks:

void link_target(ITarget<_Type> * _PTarget); – adds a link to the specified target. This causes the source block to offer any of its messages to the target until it is unlinked.

void unlink_target(ITarget<_Type> * _PTarget); – removes an existing link with the specified target, once unlinked no more messages will be presented to the target.

      void unlink_targets(); – removes all links with any target blocks.

Here is a simple example connecting and then disconnecting two unbounded_buffers:

            unbounded_buffer<int> buffer1, buffer2;

      buffer1.link_target(&buffer2);

      ...

      buffer2.unlink_target(&buffer2);

Some blocks have restrictions on the number of targets allowed; invalid_link_target exception is thrown in these cases.

Basic Message Propagation

Messages are exchanged between blocks using light weight tasks (LWTs) on a scheduler, because of this message propagation works cooperatively with any other work in the same scheduler. This means long running tasks that never block or yield can slow down or cease the forward progress of message delivery. Thus is the nature of working in a cooperative environment.

All of the message blocks built into the Agents Library guarantee in-order message delivery. It is possible to create your own blocks which do not preserve order, however all of built in ones do.

Messaging APIs - send, asend, receive, and try_receive

Once creation of messaging networks and propagation of messages is understood the only other thing needed to start programming is how to insert and remove messages directly from individual blocks.

Two global functions are used to create and insert messages:

            bool send(ITarget<_Type> &_Trg, const _Type &_Data);

      bool asend(ITarget<_Type> &_Trg, const _Type &_Data);

Each of these takes a target and the data to transmit. Send synchronously originates a message with a target, whereas asend asynchronously does. This means a send call will block until the target either takes the message or declines it. Send returns true if the message was delivered and false otherwise. Asend will not block until the target takes the message, it offers the message and immediately returns. A return value of true means the target has accepted the message and will eventually take it, otherwise false means the target either declined the message or postponed the decision on whether or not to take it until later.

Likewise here are the two global functions for removing or extracting messages:

template <class _Type>

_Type receive(ISource<_Type> & _Src, unsigned int _Timeout = COOPERATIVE_TIMEOUT_INFINITE);

 

template <class _Type>

bool try_receive(ISource<_Type> & _Src, _Type & _value);

 

Receive takes a source block to extract a message from and an optional timeout, the extracted value is the return value. If a message is currently not available in the source then receive will block until one is, optionally a timeout also may be specified. Correspondingly try_receive will only obtain a message if the source has one at that instance, otherwise it returns immediately. A return value of true on try_receive indicates a message was received, false means one was not.

All of these functions when blocking do so cooperatively with the Concurrency Runtime. To learn more about working cooperatively with the Concurrency Runtime take a look at the series of posts on Synchronization with the Concurrency Runtime.

unbounded_buffer

Unbounded_buffer is one of the most basic messaging blocks; it acts very similar to a queue. As its name suggests unbounded_buffer can store any number of messages, limited only by memory, collected from its source links (links to blocks that have the unbounded_buffer as a target). Unbounded_buffer always accepts all messages offered to it. Messages propagated to an unbounded_buffer are collected into a queue and then offered one at a time to each of its targets. Each message in an unbounded_buffer will only be given to one of its targets based on link ordering. This means targets of an unbounded_buffer compete for messages.

Unbounded_buffer provides two utility functions:

            bool enqueue(_Type const& _Item);

      _Type dequeue();

Each of these is equivalent to send and receive respectively, and basically are wrappers around them.

unbounded_buffer<int> buffer;

      // These are equivalent.

      buffer.enqueue(1);

      send(buffer, 1);

 

      // And so are these.

      int value = buffer.dequeue();

      int value = receive(buffer);

Unbounded_buffers are excellent for producer/consumer patterns. In a previous post Introduction to Asynchronous Agents Library the FindString agents sample makes use of unbounded_buffers to communicate between the individual agents.

overwrite_buffer

Essentially overwrite_buffer is a simple broadcaster. Overwrite_buffer is a message block that holds one message at a time, very similar to a variable. Every time an overwrite_buffer receives a message it offers a copy of it to any of its targets and then stores the message internally, replacing any previously stored message. The important thing to note here is there is no competition for data. Every time a message comes into an overwrite_buffer it is offered to all of its targets, then afterwards it can be overwritten at any point.

Overwrite_buffer also provides two utility functions:

bool has_value() const ;

_Type value();

Has_value returns true or false indicating whether or not the overwrite_buffer has received its first message. Value is a wrapper around receive. Has_value can be used to check if overwrite_buffer has a message, if overwrite_buffer does then calling value or receive will not be a blocking call.

Uses of overwrite_buffer include tracking state, continuously monitoring status, or broadcasting messages. In the Agents Library the agent class takes advantage of this using an overwrite_buffer internally to track its state. Calling the start, cancel, done, and wait functions work with agent’s the internal overwrite_buffer.

single_assignment

Single_assignment behaves very similar to overwrite_buffer, except it will only accept one message from any of its sources. Once single_assignment accepts a message all subsequent offered messages will be declined. Just like overwrite_buffer, single_assignment gives a copy of the message to each of its targets; there is no competition for data.

Single_assignment provides the following two utility functions:

            bool has_value() const ;

      _Type const & value();

These perform exactly same as in overwrite_buffer.

Single_assignments are useful when a single value is read by many, similar to a const variable. A single_assignment can also be used to pick the first available message from a group of blocks. The offered message will be accepted and all others will be declined. Used in this form single_assignment can act as a choice or chooser from a group of blocks.

In following posts I will introduce more of the message blocks and provide sample applications.

Synchronization with the Concurrency Runtime - Part 3
04 June 09 09:51 PM | vinodsu | 0 Comments   

In my previous posts, I addressed the motivation behind using concurrency runtime’s synchronization primitives and also introduced Critical Section and reader writer lock. In this blog, I will cover concurrency runtime’s event.

Event

This is a bi-state type class which, unlike Critical Section or Reader Writer Lock, does not protect access to shared data. Events synchronize flow of execution and use concurrency runtime’s facilities to enable cooperative schedule of work. They behave similar to Win32 manual-reset event. The main difference between the concurrency runtime’s event and Win32 event is that the concurrency runtime’s event are designed to cooperatively yield to other cooperative tasks in the runtime when blocked in addition to preempting whereas Win32 events are, by design purely pre-emptive in nature.

 

Example:

 This example enables the scheduler to create two threads and then calls DemoEvent function that takes event class and a Win32 manual-reset event class as template parameters. The Demo function first creates several tasks that simulate some work and then wait for a shared event to become signaled.

 

Special thanks to Rick Molloy for sharing this example.

// event.cpp : Defines the entry point for the console application.

//

// compile with: /EHsc

#include <windows.h>

#include <concrt.h>

#include <concrtrm.h>

#include <ppl.h>

 

 

using namespace Concurrency;

using namespace std;

 

class WindowsEvent

{

    HANDLE m_event;

public:

    WindowsEvent()

        :m_event(CreateEvent(NULL,TRUE,FALSE,TEXT("WindowsEvent")))

    {

    }

 

    ~WindowsEvent()

    {

        CloseHandle(m_event);

    }

 

    void set()

    {

        SetEvent(m_event);

    }

 

    void wait(int count = INFINITE)

    {

        WaitForSingleObject(m_event,count);

    }

};

 

template<class EventClass>

void DemoEvent()

{

    EventClass e;

    LONG volatile taskCtr = 0;

 

    //create a taskgroup and schedule multiple copies of the task

    task_group tg;

    for(int i = 0;i < 8; ++i)

        tg.run([&e, &taskCtr]{

            //Simulate some work

            Sleep(100);

 

            printf_s("\tTask %d waiting for the event\n", InterlockedIncrement(&taskCtr));

 

            e.wait();

    });

 

    Sleep(1000);

 

    printf_s("  Setting the event\n");

 

    //Set the event

    e.set();

 

    //wait for the tasks

    tg.wait();

}

 

void main ()

{

    // Create a scheduler that uses two threads.

    CurrentScheduler::Create(SchedulerPolicy(2, MinConcurrency, 2, MaxConcurrency, 2));

 

    printf_s("Cooperative Event\n");

    DemoEvent<event>();

 

    printf_s("Windows Event\n");

    DemoEvent<WindowsEvent>();   

}

 

Sample Output:

Cooperative Event

        Task 1 waiting for the event

        Task 2 waiting for the event

        Task 3 waiting for the event

        Task 4 waiting for the event

        Task 5 waiting for the event

  Setting the event

Windows Event

        Task 1 waiting for the event

        Task 2 waiting for the event

  Setting the event

        Task 3 waiting for the event

        Task 4 waiting for the event

        Task 5 waiting for the event

We observe that when using cooperative event, we execute all 5 tasks, each task when waiting on the event that is not set, cooperatively yields so that another task can be scheduled and run in the thread’s quantum. When using windows event, we observe that the 2 tasks scheduled block the thread until the event is set.

Introduction to Asynchronous Agents Library
03 June 09 11:08 PM | Steve Gates | 1 Comments   

One of the native concurrency components coming in Visual Studio 2010 is the Asynchronous Agents Library. The goal of the Agents Library is to improve developer productively and enable developers to take advantage of concurrency with an agent-based message passing model to build isolated and composable components. In this blog, I will provide an introduction to the features available in the Asynchronous Agents Library and go into more detail about the agent class itself with an example program.

An Agent Based Model

Programming with an agent-based model allows concurrency to be inherently baked into the program from the start. Separating parts of a program into composable and reusable pieces that follow well-defined boundaries to communicate with message passing can tolerate latencies and effectively use parallel resources. By avoiding sharing memory when possible and focusing on data dependencies, scaling performance can be obtained using higher level abstractions, like agents, for parallelism. For more information on agent based models take a look here:

http://en.wikipedia.org/wiki/Actor_model#Fundamental_concepts

Overview of Agents Library Features

The Agents Library can be broken down into three basic parts:

The Agent Class – The agent class itself is intended for course grained parallelism/components that handle larger computationally intensive tasks or collections of smaller tasks. Fundamentally, agents are tasks that have an observable lifecycle and communicate with other agents by using message passing. Agents are NOT intended to be used for fine-grained parallelism; for that, the patterns and constructs in the Parallel Patterns Library are better suited.

Message Blocks – To enable mechanisms for communicating between asynchronous agents and isolated program components, we provide a collection of various message blocks:

  • unbounded_buffer
  • overwrite_buffer
  • single_assignment
  • call
  • transformer
  • timer
  • join
  • multitype_join
  • choice

For example, an unbounded_buffer works very similar to a queue. It collects messages from each of its source blocks and offers them in order to its target blocks.

In addition, infrastructure for easily creating your own message blocks is included.

Coordination Primitives – For inserting and removing messages from a group of messaging blocks (a messaging network), we include a set of APIs:

  • send
  • asend
  • receive
  • try_receive

In later blogs, I will dive into the specifics of each message block and coordination primitive explaining how and when to use them.

Agent Class in More Detail

Essentially an agent is a light weight task (LWT) with observable state and cancellation. The lifecycle of an agent is as follows:

AgentStateDiagram

To see how an agent passes through the states, let’s look at the simplest case creating, starting, and waiting on an agent:

    MyAgent myagent();

myagent is now in the ‘agent_created’ state.

    myagent.start();

Calling start() moves the agent to ‘agent_runnable’.

    agent::wait(&myagent);

Wait blocks until myagent reaches one of the final states (agent_cancelled, agent_done, or agent_faulted).

The agent class contains a protected pure virtual run() method which is called when the agent is executed:

    virtual void run() = 0;

When start() is called a light weight task is scheduled and the function will immediately return; once there are available resources the scheduler will start executing the agent task, which will call the run() method when it executes. When an agent is finished, it must call the done() function to signal completion of its work. Commonly, done() is called at the end of the run() function; however, in some circumstances, it may be desirable to decouple the lifetime of the agent from the execution of the run method, such as if the agent does work asynchronously.

A single agent can be waited on by using the wait function, or multiple agents using the wait_for_one and wait_for_all functions on an array of agents. Waiting on an agent is one way of observing an agents state or status. The two following methods allow for directly querying the status and obtaining a message block to receive from:

    agent_status status(); 
    ISource<agent_status> * status_port();

Agents also can be canceled:

    bool cancel();

If an agent has not started yet and cancel is called it will not be run and its status will be canceled, however if the agent has already started then calling cancel has no effect. Already started agents can’t be stopped.

Agents Sample:

To better illustrate agents, consider this implementation of a find string. The program works by searching a directory recursively for files matching an expression and then searching each file for a given string, very similar to ‘findstr’. The program is separated into several agents with specific jobs. One agent is for locating matching files, several for parsing the files looking for the specified string, and one for writing out found matches to the console. This sample also uses unbounded_buffer, asend, and receive.

FindStringDiagram

FileFinder.h:

This agent takes a string containing the type of files to search for. Each matching file found is sent to an unbounded_buffer m_pFileNames which the FileReader agents receive from.

Here is the run() method. It searches each directory recursively for matching files. A special end message is also used at the end to signal there are no more files. Notice at the end of run() the done() function is called.

void run()
{
    wchar_t currentDirectory[MAX_FILE_NAME] = L".\\";
    FindFilesRecursively(currentDirectory);

    // Send END_MSG to signal done processing.
    asend(m_pFileNames, (wchar_t *)END_MSG);
    done(agent_done);
}

Here is the code where matching file names are sent to the unbounded_buffer:

while(moreFiles)
{
    // Copy file name, receiver will be responsible for cleanup.
    pFileName = new wchar_t[MAX_FILE_NAME];
    wcscpy_s(pFileName, MAX_FILE_NAME, pDirectory);
    wcscat_s(pFileName, MAX_FILE_NAME, findFileData.cFileName);

    //
    // Send the filename to the unbounded_buffer.
    //
    asend(m_pFileNames, pFileName);
    moreFiles = FindNextFile(hFind, &findFileData);
}

FileReader.h:

The FileReader agent keeps receiving file names to parse from an input unbounded_buffer and sends any matches to another unbounded_buffer which the ConsoleWriter will receive from.

Here is the run() method. Notice the while loop that keeps receiving messages, until the end message is reached, from the same unbounded_buffer the FileFinder agent was using.

void run()
{
    wchar_t *pFileName;

    //
    // Repeatedly pull filename messages out of the 
    // unbounded_buffer to parse until the END_MSG is 
    // reached. Note this will ConcRT aware block until
    // a message is avaliable in the buffer.
    //
    size_t currentLineNum;
    const size_t lineSize = 2000;
    wchar_t line[lineSize];
    while((int)(pFileName = receive(m_pFileNames)) != END_MSG)
    {
        currentLineNum = 1;
        wifstream inputFile;
        inputFile.open(pFileName, ifstream::in);
        while(inputFile.good() 
            && inputFile.getline(line, lineSize))
        {
            if(wcsstr(line, m_pSearchString) != NULL)
            {
                //
                // Create a new message payload and send it to 
                // the unbounded_buffer for the ConsoleWriter 
                // to receive from.
                //
                asend(m_pFoundBuffer, new Payload(pFileName, 
                    wcsnlen(pFileName, MAX_FILE_NAME)+1,
                    currentLineNum,
                    line,
                    wcsnlen(line, lineSize)+1));
            }

            ++currentLineNum;
            line[0] = '\0';
        }
        inputFile.close();
        delete pFileName;
    }

    // Resend the END_MSG for any other FileReaders.
    asend(m_pFileNames, (wchar_t *)END_MSG);
    done(agent_done);
}

ConsoleWriter.h:

This agent receives messages containing information about found matches and prints the information to the console.

Here is a snippet from its run method. It loops receiving messages from the unbounded_buffer the FileReaders send to.

//
// Keep receiving messages to print to the console until
// the END_MSG is received. Note this will ConcRT aware
// block until a message is avaliable in the buffer.
//
while((int)(pPayload = receive(m_pFoundBuffer)) != END_MSG)
{
    SetConsoleTextAttribute(hConsole, FOREGROUND_BLUE 
        | FOREGROUND_GREEN | FOREGROUND_RED | FOREGROUND_INTENSITY); 
    printf("%ls:%lu:", pPayload->m_pFileName, 
        (unsigned long)pPayload->m_lineNumber);
    SetConsoleTextAttribute(hConsole, FOREGROUND_BLUE 
        | FOREGROUND_GREEN | FOREGROUND_RED);
    printf(":%ls\n", pPayload->m_pLine);
    delete pPayload;
}

Payload.h:

Defines a class used to pass messages between FileReaders and the ConsoleWriter. It holds the file name a match was found in, the line number, and the line itself.

FindString.cpp

Driver of the program creates, starts, and then waits on all the agents.

Here is the code to create the unbounded_buffers:

    unbounded_buffer<wchar_t *> fileBuffer;
    unbounded_buffer<Payload *> foundBuffer;

Creation of the agents:

// Agent to find files.
FileFinder fileFinder(pFilePattern, &fileBuffer);
fileFinder.start();

// Agents to handle parsing files.
FileReader **ppFileReaderAgents = new FileReader*[numAgents];
for(unsigned int i = 0; i < numAgents; ++i)
{
    ppFileReaderAgents[i] = new FileReader(&fileBuffer, 
        pStringPattern, &foundBuffer);
    ppFileReaderAgents[i]->start();
}

// Agent to handle writing to the console.
ConsoleWriter consoleWriter(&foundBuffer);
consoleWriter.start();

Waiting on the agents to finish:

// Wait for agents to finish.
agent::wait(&fileFinder);
for(unsigned int i = 0; i < numAgents; ++i)
{
    agent::wait(ppFileReaderAgents[i]);
    delete ppFileReaderAgents[i];
}
delete [] ppFileReaderAgents;

// Now that all other agents have finished signal the ConsoleWriter
// it can finish.
send(foundBuffer, (Payload *)END_MSG);
agent::wait(&consoleWriter);

In this program, I created agents to each handle large chunks of isolated work and I/O. Notice how parallelism is built in to the program by using higher levels of abstraction from the beginning. I created a pipeline of agents that work concurrently together with almost no knowledge of each other. Each agent takes in messages, performs some action, and then accordingly sends a message. Since each agent can be running at the same time I can exploit any available cores on the system; notice how there are many FileReader agents used, this means the program will be parsing multiple files concurrently at the same time. By designing my program in this fashion I can easily add, remove, or modify individual agents without needed to restructure anything else. I also avoided having to deal explicitly with locks and other difficult forms of synchronization.

To see all the code and play around with this sample download the project and source files here:

Code Samples

Example command line execution of this program is:

FindString.exe agents *.txt

In future posts, I will go into more detail about each message block the Asynchronous Agents Library provides, how to use them, the coordination primitives, and how to take advantage of our infrastructure and easily construct your own message blocks.

Samples posted for the Parallel Pattern Library and Concurrency Runtime
02 June 09 02:40 PM | rickmolloy | 1 Comments   

Following the release of Visual Studio 2010, we've just published a set of sample applications for using the Parallel Pattern Library, the Agents Library and the Concurrency Runtime on code gallery.  These supplement the documentation and samples provided on msdn for Beta1.

The samples include using task parallelism with PPL, a few based examples of using the Agents Library (Dining Philosophers, an example using choice) and a find in files example (which will be blogged about here in the next day or so).  There are also some examples of using the Concurrency Runtime itself, a simple event based application demonstrating cooperative blocking and UMS threads and a task based example (fibonacci) which shows the Concurrent Suballocator.

As always your feedback here or in the forums is welcome.

-Rick

Auction Simulation written in the Asynchronous Agents Library
30 May 09 12:11 AM | rickmolloy | 1 Comments   

One of our architects, Niklas Gustafsson has just posted an actor based auction simulation written in the Asynchronous Agents Library. If you follow the blog trail back, you'll see that this was implemented originally in scala and that Matthew Podwysocki has implemented this in F#. Niklas also alluded to, but hasn't shared yet an implementation built in Axum.  I find it fascinating to compare and contrast the implementations for similarities and differences.

It’s also worth stressing his concluding remarks here which really highlight why we’re providing an asynchronous message passing and actor based programming model:

As a parting thought, I want to draw your attention to one simple fact: we just wrote an application with 17 agents working together on a problem in parallel, interacting in non-trivial patterns updating state and arriving at a shared understanding of “their world” without a single lock, mutex, or critical section.

The ability to write such complex and concurrent code without dealing with the headaches of managing synchronization primitives is incredibly exciting (at least to me it is), so go check this out.

-Rick

Implementing Dining Philosophers with the Agents Library
28 May 09 12:34 PM | rickmolloy | 1 Comments   

The latest issue of msdn magazine includes an article that I wrote which illustrates implementing the Dining Philosophers purely in message passing and without using any explicit locking.  If you have Visual Studio 2010 Beta1 installed, you can also download the code and build the project, even if you don't you can browse the code online.

For folks that are curious about how locks are avoided, I'll state that the chopsticks themselves are messages and rather than the classic solution with semaphores, the philosopher uses the join message block to receive both messages from the table when it is time to pick them up. i.e. the code looks like this:

vector<Chopstick*> PickupChopsticks()

{

    //create the join

    join<Chopstick*,non_greedy> j(2);

    m_LeftChopstickProvider->link_target(&j);

    m_RightChopstickProvider->link_target(&j);

 

    //pickup the chopsticks

    return receive(j);

}

I'd encourage you to read the article and provide feedback either here or in our forums

-Rick

What's new in the Concurrency Runtime and the Parallel Patterns and Asynchronous Agents Libraries
21 May 09 01:23 PM | rickmolloy | 2 Comments   

Visual Studio 2010 Beta1 has been released, and it is a full install version.  The team has been busy, busy busy since the CTP release last fall to deliver most of the APIs and objects we’ve blogged about here into Beta1.  So I wanted to take a moment to summarize what’s new in the Beta for the Parallel Pattern Library, the Asynchronous Agents Library and the Concurrency Runtime. 

You can find the official reference documentation and walkthroughs in the MSDN Library here.

What’s new in the Parallel Patterns Library

The Parallel Patterns Library (PPL) provides an imperative programming model that promotes scalability and ease-of-use for developing concurrent applications.  In general the library’s surface area remains largely unchanged, with a few high impact additions.  You will also find it more stable, more robust, faster, and more scalable – lots of goodness!

What’s new in the Asynchronous Agents Library

The Asynchronous Agents Library (or just Agents Library) provides a programming model that enables you to increase the robustness of concurrency-enabled application development. The Agents Library is a C++ template library that promotes an actor-based programming model and in-process message passing for fine-grained dataflow and pipelining tasks. The Agents Library builds upon the scheduling and resource management components of the Concurrency Runtime.

Here’s a summary of what’s we’ve added and revised in the Agents library:

  • Added the choice, join and multitype_join message blocks to allow users to wait on a set of messages; choice waits for any message, join waits for all messages of a single type, and multitype_join waits for all messages and allows for multiple different types.
  • Renamed the transform message block to transformer (we didn’t want Concurrency::transformer to conflict with std::transform)
  • Added functor support to the pipeline message blocks call and transformer
  • Building custom message blocks is now easier due to some refactoring. 

What’s new and improved in the Concurrency Runtime itself

The underlying Concurrency Runtime has undergone significant feature improvements and performance enhancements, many of which we have been discussing already on this blog.  These changes will have improvements to everything built on top of the runtime, including the Parallel Patterns Library and the Agents Library.

Call to Action

Download and install Microsoft Visual Studio 2010 Beta 1 today.  Impress your friends and coworkers by being among the first to learn and use our libraries.  Provide early and critical feedback on our forum that will shape the way that these libraries are ultimately released. 

 

See this blog post on the value of your feedback:  On Achieving Perfection –or– Why We Love Your Feedback (and Why You Can Love Giving It).

 

Happy coding!

 

Debugging PPL Tasks
15 May 09 12:20 PM | dmccrady | 1 Comments   

You have read several articles here about how the Microsoft Concurrency Runtime enables task-based parallelism.  Now Daniel Moth has an article on his blog that describes the debugging functionality offered by VS 2010's Parallel Tasks window.  The parallel tasks window works with both native PPL tasks and with .NET TPL tasks.  Check it out. 

Synchronization with the Concurrency Runtime - Part 2
14 May 09 07:46 PM | vinodsu | 2 Comments   

In my previous post, I addressed the motivation behind using concurrency runtime’s synchronization primitives and also introduced Critical Section. In this blog, I will cover concurrency runtime’s reader writer lock.

Reader Writer Lock

This class enables multiple threads to read from a shared resource at the same time but only allows one thread to write to it at a time. They share many characteristics with concurrency runtime’s critical section, reader writer locks are non-reentrant and block cooperatively. The reader writer lock resembles the Win32 Slim reader/writer locks (SRWLock).  Reader writer lock performs better than critical section in read-mostly environments.

Similarity to Win32 Slim reader/writer locks:

-          Can be used only by threads of a single process.

-          The reader writer lock can be owned by multiple reader-threads or only one writer thread at any time.

-          Non-reentrant.

-          Do not support upgrades or downgrades.

Differences with Win32 Slim reader/writer locks:

-          The concurrency runtime’s reader writer lock object guarantees that the order of exclusive (writer) lock-ownership is on a first-come, first-serve basis.

-          There is no need to explicitly call Initialization of resources before use of the concurrency runtime’s reader writer lock and release after the use of the reader writer lock.

-          Cannot specify spin count for the concurrency runtime’s reader writer lock object.

-          The concurrency runtime’s reader writer lock enforces cooperative blocking where they yield to other cooperative tasks in the runtime when blocked.

-          The concurrency runtime’s reader writer locks give writer preference over readers; i.e. if there are readers and writer(s) simultaneously waiting for the lock, the lock would be handed over to the first writer in queue.

-          Exceptions are thrown by the concurrency runtime’s reader writer lock object; on recursive calls, or if unlock is called when the lock is not held, or if a lock is destroyed when being held.

 

Example:

Given below is a code sample illustrating reader_writer_lock using 4 readers and 1 writer. The readers output the value of the shared data and the writer updates the value of the shared data and outputs it.

 

// reader_writer_lock.cpp

// compile with: /EHsc

#include <ppl.h>

#include <stdio.h>

#include <windows.h>

 

using namespace std;

using namespace Concurrency;

 

//number of iterations each thread performs

static const int NUM_ITERATIONS = 2;

 

//the shared data that needs protection from race/tearing

static unsigned int sharedData = 0;

 

//Demonstrates the use of the reader lock

void Reader(reader_writer_lock* pRWLock)

{

    for( int i = 0; i < NUM_ITERATIONS; ++i)

    {

        //use the reader lock

        pRWLock->lock_read();

 

        printf_s("Reading %d\n", sharedData);

 

        //Sleep for some time, this is to simulate potential work done while holding the lock

        Sleep(100);

 

        //release the lock

        pRWLock->unlock();

    }

}

 

//Demonstrates the use of the writer lock

void Writer(reader_writer_lock* pRWLock)

{

    for( int i = 0; i < NUM_ITERATIONS; ++i)

    {

        //use the writer lock

        pRWLock->lock();

 

        printf_s("\tWriting %d\n", ++sharedData);

 

        //Sleep for some time, this is to simulate potential work done while holding the lock

        Sleep(100);

 

        //release the lock

        pRWLock->unlock();

    }

}

 

int main()

{

    reader_writer_lock rwlock;

 

    //performs reader writer operations in parallel

    parallel_invoke(

        [&] { Reader(&rwlock); },

        [&] { Reader(&rwlock); },

        [&] { Reader(&rwlock); },

        [&] { Reader(&rwlock); },

        [&] { Writer(&rwlock); }

    );

 

    return 0;

}

 

 

Sample output:

Reading 0

Reading 0

Reading 0

        Writing 1

Reading 1

Reading 1

Reading 1

Reading 1

        Writing 2

Reading 2

 

An interesting point to note here is that even though we use 4 readers, we notice that only 3 readers output their value and the lock is taken by the writer. This happens because the lock prefers writers. We do not guarantee the order of task execution but if such a guarantee is required, you could consider using events, which I will cover in the next blog post.

 

Synchronization with the Concurrency Runtime - Part 1
22 April 09 09:21 PM | vinodsu | 3 Comments   

In a concurrent world multiple entities work together to achieve a common goal. A common way to interact and coordinate is to use shared data.  However, shared data must be accessed carefully. This can be achieved through synchronization, primarily using:

i)                    Blocking methods such as locks and mutexes  

ii)                   Non-blocking methods such as lock-free programming techniques.

I will talk about the synchronization using blocking methods within a process, using constructs provided as part of the concurrency runtime and exposed through the Parallel Pattern Library (PPL). In this blog, I will address the concurrency runtime’s critical section and will cover reader writer lock and events in subsequent blog posts.

For a general picture of the native concurrency runtime, and high level roles of each of its components please refer to this post.

 

Motivation

Goals of the concurrency runtime’s synchronization primitives:

1.       Simple APIs

Unlike their Win32 equivalent, concurrency runtime’s synchronization primitives don’t have C-style initialization and release/destroy type of resource management calls. The exposed interfaces are simple and conform to the C++0x standards and the synchronization objects throw meaningful exceptions on certain illegal operations.

2.       Block in a cooperative manner

The synchronization objects are cooperative in nature, in that they yield to other cooperative tasks in the runtime in addition to preempting.  For an illustration of this scenario, refer to this post.

 

Critical Section

This represents a non-reentrant, cooperative mutual exclusion object that uses concurrency runtime’s facilities to enable cooperative scheduling of work when blocked. This class satisfies all Mutex requirements specified in C++0x standards. The concurrency runtime’s critical section provides a C++ façade as compared to its C-styled Win32 equivalent:  Windows CRITICAL_SECTION

Similarity to the Win32 CRITICAL_SECTION:

-          Can be used only by threads of a single process.

-          The critical section object can only be owned by one thread at a time.

Differences with Win32 CRITICAL_SECTION:

-          The concurrency runtime’s critical sections are non-recursive. Exceptions are thrown upon recursive calls.

-          The concurrency runtime’s critical section object guarantees that threads waiting on a critical section acquire it on a first-come, first-serve basis.

-          There is no need to explicitly call Initialization/allocation of resources before use of the concurrency runtime’s critical section and release resources after the use of the critical section.

-          Cannot specify spin count for the concurrency runtime’s critical section object.

-          The concurrency runtime’s critical section enforces cooperative blocking where they yield to other cooperative tasks in the runtime when blocked.

-          Exceptions are thrown by the concurrency runtime’s critical section object; on unlock calls when the lock is not held, or if a lock is destroyed when being held.

 

Example:

The sample below alternates between printing to standard output from FunctionA and FunctionB.

 

// critical_section.cpp

// compile with: /EHsc

#include <ppl.h>

#include <stdio.h>

#include <windows.h>

 

using namespace std;

using namespace Concurrency;

 

//number of iterations each thread performs

static const int NUM_ITERATIONS = 5;

 

//Demonstrates use of critical section

void FunctionA(critical_section* pMutex)

{

    for( int i = 0; i < NUM_ITERATIONS; ++i)

    {

        //use exclusive lock

        pMutex->lock();

 

        printf_s("A %d\n", i);

 

        //Sleep for some time, this is to simulate potential work done while holding the lock

        Sleep(100);

 

        //release exclusive lock

        pMutex->unlock();

    }

}

 

//Demonstrates use of critical section

void FunctionB(critical_section* pMutex)

{

    for( int i = 0; i < NUM_ITERATIONS; ++i)

    {

        //use exclusive lock

        pMutex->lock();

 

        printf_s("\tB %d\n", i);

 

        //Sleep for some time, this is to simulate potential work done while holding the lock

        Sleep(100);

 

        //release exclusive lock

        pMutex->unlock();

    }

}

 

int main()

{

    critical_section mutex;

 

    //call FunctionA and FunctionB in parallel

    parallel_invoke(

        [&] { FunctionA(&mutex); },

        [&] { FunctionB(&mutex); }

    );

 

    return 0;

}

 

 

Sample output:

A0

        B0

A1

        B1

A2

        B2

A3

        B3

A4

        B4

 

Note: There is a possibility that the order may be swapped, where B gets the lock before A; the locks are handed out on a first-come, first-serve basis, it’s a race to try and get the lock at the beginning. One way of guaranteeing consistency is using events.
Resource Management in the Concurrency Runtime – Part 1
10 March 09 11:03 AM | Atilla Gunal | 4 Comments   

In my previous blog post, I gave an introduction to the native concurrency runtime in Visual Studio 2010. For a general picture of the native concurrency runtime, and high level roles of each of its components please refer to this post. Today, I will talk about the lowest level in that component stack: the Resource Manager (RM).

Resource Manager Responsibilities

In short, there are two main responsibilities of the Resource Manager:

1-      Initial Allocation: Allocating resources to schedulers when schedulers are created.

2-      Dynamic Migration: Constantly monitoring utilization of resources by schedulers, and dynamically migrating resources between schedulers.

This post will focus on the initial allocation and the dynamic migration will be detailed in my next blog post.

What is a Resource Anyway?

Before going into details about the “Resource Manager” for the Concurrency Runtime, let’s define what is meant by a resource in this context. A resource is the unit of the processing power of the machine which is owned by the RM and used by schedulers. A resource is abstracted in the form of a “Virtual Processor” in Concurrency Runtime. At a given time each virtual processor is associated with a thread running on a processor core (see below - Resource Boundary). The RM creates virtual processors and hands them to schedulers; meanwhile, scheduler instances use them to execute tasks in parallel.


Why do we need a Resource Manager?

Clearly, we wouldn’t need a Resource Manager if it was possible for a single scheduler to fit all the needs of parallel tasks. However, there are a couple of reasons why it is feasible for an app to use multiple schedulers:

·         The hotter the cache, the faster the processing: Virtual processors have affinity to group of processor cores; the sockets or nodes. The more they do related work, the more likely they are to find that the data they need is already in the cache. 

o   Assume that your app uses a bunch of libraries. It is highly probable that work performed by a given library is related. Having a scheduler per library will keep the cache warm and enable faster processing.

·         Not all schedulers are same: A scheduler instance is policy driven. This means different schedulers might have different characteristics.

o   Assume that a game engine is to be implemented. As an example, it could have an image rendering component, a sound rendering component and an object state update component. All these will execute in parallel; however, we might want state update to have a higher thread priority and would like to decouple sound processing tasks from image rendering tasks. Having multiple schedulers will enable us to customize scheduler behavior and separate task execution.

Now given that there can be multiple schedulers in a process for the reasons mentions above, let’s talk about if a resource manager was not present what would happen:

1-      Schedulers would be free to use any resource

2-      Schedulers would implement a way of sharing resources

Apparently the first one would end up with a non-optimum utilization of resources; some processors cores would be over utilized and some would be underutilized due to no communication between schedulers. On the other hand, the second one would end up with a complex design contradicting with design patterns like ‘Low Coupling’ or ‘High Cohesion’ (http://en.wikipedia.org/wiki/GRASP_(Object_Oriented_Design) ). Thanks to RM that it helps resources to be utilized evenly meanwhile still keeping the design of a scheduler simpler. By the way it is important to keep the design of the scheduler simpler because users can also implement a custom scheduler on top of RM (see below for more on this).

Additionally, the scheduler implemented as part of the concurrency runtime is a general purpose scheduler; for specialty scenarios, it may be possible to implement a scheduler tuned for a particular purpose, and it’s important that such a scheduler be usable in conjunction with other work in the same process and without oversubscribing on resources. Thanks to RM again that by using its APIs custom schedulers can be built meanwhile respecting the resource usage of other schedulers in the process.

 

The Initial Allocation of Resources

Now we can describe initial allocation, one of the core functions of the Resource Manager.

As mentioned in the Resource Manager Responsibilities paragraph, initial allocation is about providing the resources to a scheduler at its creation time. A diagram of initial allocation steps can be found below.

 

A scheduler instance will place its request for resources by providing a set of policies to the resource manager. The resource manager will do the allocation depending on the policies and taking into account the other schedulers in the process. Eventually the resources will be provided to the scheduler in need.

The policy values given in step-1 play an important role in how many virtual processors the scheduler will have. Let’s go through those policies that effect initial allocation:

MinConcurrency and MaxConcurrency: These two policies specify the concurrency level of the scheduler. The number of virtual processors allocated to the scheduler by the RM will be bounded by these policy values. Depending on the existence and policy of the other schedulers, any number of resources can be allocated in between.

TargetOversubscriptionFactor: Oversubscription is defined as the number of threads associated with a particular processor core. RM will try to allocate resources up to MaxConcurrency by oversubscribing each core with the given TargetOversubscriptionFactor.

Here is an example where the scheduler actually asks for minimum of 1 processor core and maximum of 2 processor cores both oversubscribed by a factor of 2. If there are no other schedulers, on a 2 core machine the allocation will be as shown below.

ResourceAllocationPriority: RM will try to satisfy the requests of the schedulers with higher priority at the cost of lower priority schedulers.

Now let’s take a look at a general overview of the rules of initial allocation and some examples of it in order to understand the usage of the policy values.

Rules of Initial Allocation

Initial allocation has a set of rules that defines its behavior. I will mention those rules and then in the next paragraph will try to give examples to explain in more detail.

1)      RM will never allocate less resources then MinConcurrency and more resources than MaxConcurrency

2)      RM will share cores within schedulers as minimum as possible

3)      The resources will be allocated as close as possible. The closeness criteria here is that the processor cores in a NUMA node / processor socket are closer with respect to processor cores in other NUMA nodes / processor sockets.

4)      The resources will be proportionally distributed with respect to MaxConcurrency within the schedulers of equal priority if MaxConcurrency cannot be met.

5)      Until MaxConcurrency resources are provided to a higher priority scheduler, lower priority schedulers will be reduced to their MinConcurrency

Examples of Initial Allocation

We have mentioned that multiple schedulers may reside in a process. What if there is an existing Scheduler (S1) when a Scheduler (S2) with the same ResourceAllocationPriority requests resources?

RM will try to avoid sharing of processor cores between S1 and S2 as much as possible.

Scenario1: If there are no resources available to satisfy S2 then a proportional allocation will be performed between all schedulers that have the same ResourceAllocationPriority (S1 and S2). Allocation is proportional to each scheduler’s policy value of MaxConcurrency.

Scenraio2: If there are N available resources such that MinConcurrency of S2 <= N <= MaxConcurrency of S2, N resources are allocated to S2, and S1 is left untouched.

 

 

Please note here that S1’s number of allocated cores is reduced from 4 to 2 when a scheduler with equal ResourceAllocationPriority is created.

 

Resources of S1 is not touched and S2 gets only 3 of its requested resources when there are at least MinConcurrency number of resources available for S2.

What if ResourceAllocationPriority of  S2  is greater than S1?

In this case RM will lower S1 to MinConcurrency, and try to allocate N resources where MinConcurrency of S2 <= N <= MaxConcurrency to S2. If N is less than the minimum that S2 requires, RM will share processors cores with S1, to satisfy S2’s minimum request.

What if the machine supports NUMA architecture? Is locality taken into account?

Yes, RM will try to localize the allocation as much as possible. For example, on a machine with 2 sockets and 2 processor cores in each socket, the allocations will be made within socket boundaries.

 

What if both S1 and S2 has MinConcurrency=MaxConcurrency=Number of processor cores of the machine?

Since RM can’t allocate less than MinConcurrency and both schedulers have requested all cores, all the cores will be shared between S1 and S2. In general, RM will share cores in the final step of initial allocation algorithm, and share as little as possible.

What if one of S1 shutdowns? Are the resources made available to S2?

If S2 doesn’t have MaxConcurrency allocated at the time S1 shutdowns, dynamic migration will increase the number of resources of S2. However dynamic resource management is a whole topic unto itself, and I will write about it in my next blog.

Concurrency Runtime and Windows 7
04 February 09 06:30 PM | dmccrady | 17 Comments   

Microsoft has recently released a beta for Windows 7, and a look at the official web site (http://www.microsoft.com/windows/windows-7/default.aspx) will show you a pretty impressive list of new features and usability enhancements. I encourage you to go look them over for yourself, but in this article I’m going to focus on a couple of Win7 features that are of particular importance to achieving the most performance of your parallel programs.

1. Support for more than 64 processors

2. User-Mode Scheduled Threads

Both of the new features I’m going to talk about will be supported in the Microsoft Concurrency Runtime, which will be delivered as part of Visual Studio 10.

An important note about each of these features is that they’re only supported on the 64-bit Windows 7 platform.

More Than 64 Processors

The most straightforward of these new features is Windows 7’s support for more than 64 processors. With earlier Windows OS’s, even high end servers could only schedule threads among a maximum of 64 processors. Windows 7 will allow threads to run on more than 64 processors by allowing threads to be affinitized to both a processor group, and a processor index within that group. Each group can have up to 64 processors, and Windows 7 supports a maximum of 4 processor groups. The mechanics of this new OS support are detailed in a white paper which is available at http://www.microsoft.com/whdc/system/Sysinternals/MoreThan64proc.mspx.

However, unless you actively modify your application to affinitize work amongst other processor groups, you’ll still be stuck with a maximum of 64 processors. The good news is that if you use the Microsoft Concurrency Runtime on Windows 7, you don’t need to be concerned at all with these gory details. As always, the runtime takes care of everything for you, and will automatically determine the total amount of available concurrency (e.g., total number of cores), and utilize as many as it can during any parallel computation. This is an example of what we call a “light-up” scenario. Compile your Concurrency Runtime-enabled application once, and you can run it on everything, from your Core2-Duo up to your monster Win7 256-core server.

User Mode Scheduling

User Mode Scheduled Threads (UMS Threads) is another Windows 7 feature that “lights up” in the Concurrency Runtime.

As the name implies, UMS Threads are threads that are scheduled by a user-mode scheduler (like Concurrency Runtime’s scheduler), instead of by the kernel. Scheduling threads in user mode has a couple of advantages:

1. A UMS Thread can be scheduled without a kernel transition, which can provide a performance boost.

2. Full use of the OS’s quantum can be achieved if a UMS Thread blocks for any reason.

To illustrate the 2nd point, let’s assume a very simple scheduler with a single work queue. In this example, I’ll also assume that we have 100 tasks that can be run in parallel on 2 CPU’s.

ConcRT and Win7 UMS

Here we’ve started with 100 items in our task queue, and two threads have picked up Task 1 and Task 2 and are running them in parallel. Unfortunately, Task 2 is going to block on a critical section. Obviously, we would like the scheduler (i.e. the Concurrency Runtime) to use CPU 2 to run the other queued tasks while Task 2 is blocked. Alas, with ordinary Win32 threads, the scheduler cannot tell the difference between a task that is performing a very long computation and a task that is simply blocked in the kernel. The end result is that until Task 2 unblocks, the Concurrency Runtime will not schedule any more tasks on CPU 2. Our 2-core machine just became a 1-core machine, and in the worst case, all 99 remaining tasks will be executed serially on CPU 1.

This situation can be improved somewhat by using the Concurrency Runtime’s cooperative synchronization primitives (critical_section, reader_writer_lock, event) instead of Win32’s kernel primitives. These runtime-aware primitives will cooperatively block a thread, informing the Concurrency Runtime that other work can be run on the CPU. In the above example, Task 2 will cooperatively block, but Task 3 can be run on another thread on CPU 2. All this involves several trips through the kernel to block one thread and unblock another, but it’s certainly better than wasting the CPU.

The situation is improved even further on Windows 7 with UMS threads. When Task 2 blocks, the OS gives control back to Concurrency Runtime. It can now make a scheduling decision and create a new thread to run Task 3 from the task queue. The new thread is scheduled in user-mode by the Concurrency Runtime, not by the OS, so the switch is very fast. Now both CPU 1 and CPU 2 can now be kept busy with the remaining 99 non-blocking tasks in the queue. When Task 2 gets unblocked, Win7 places its host thread back on a runnable list so that the Concurrency Runtime can schedule it – again, from user-mode – and Task 2 can be continued on any available CPU.

You might say, “hey my task doesn’t do any kernel blocking, so does this still help me?” The answer is yes. First, it’s really difficult to know whether your task will block at all. If you call “new” or “malloc” you may block on a heap lock. Even if you didn’t block, the operation might page-fault. An I/O operation will also cause a kernel transition. All these occurrences can take significant time and can stall forward progress on the core upon which they occur. These are opportunities for the scheduler to execute additional work on the now-idle CPU. Windows 7 UMS Threads enables these opportunities, and the result is greater throughput of tasks and more efficient CPU utilization.

(Some readers may have noticed a superficial similarity between UMS Threads and Win32 Fibers. The key difference is that unlike a Fiber, a UMS Thread is a real bona-fide thread, with a backing kernel thread, its own thread-local state, and the ability to run code that can call arbitrary Win32 API’s. Fibers have severe restrictions on the kinds of Win32 API’s they can call.)

Obviously the above example is highly simplified. A UMS Thread scheduler is an extremely complex piece of software to write, and managing state when an arbitrary page fault can swap you out is challenging to say the least. However, once again users of the Concurrency Runtime don’t have to be concerned with any of these gory details. Write your programs once using PPL or Agents, and your code will run using Win32 threads or UMS Threads.

For more details about UMS Threads, check out Inside Windows 7 - User Mode Scheduler (UMS) with Windows Kernel architect Dave Probert on Channel9.

Better Together

Windows 7 by itself is a great enabler of fine grain parallelism, because of the above two specific features, as well as some other performance improvements made at the kernel level. The Concurrency Runtime helps bring the fullest potential of Windows 7 into the hands of programmers in a very simple and powerful way.

Dave Probert ‘goes deep’ on Win7 User Mode Scheduled Threads
02 February 09 12:32 PM | rickmolloy | 0 Comments   

Charles Torre just posted a great Channel9 session with Windows kernel architect Dave Probert on functionality new to Windows7: User Mode Scheduled (UMS) Threads.  Much like fibers, UMS threads enable a user mode scheduler (like the Concurrency Runtime) to have control over scheduling of threads in user mode; however unlike fibers, UMS threads keep the TEB and thread local information intact and provides notification of kernel blocking which enables more general use.

In the near future we'll be blogging here about how the Concurrency Runtime is taking advantage of UMS threads on Win7, but if you have some time to spare, check out what Dave has to say on this it's a wonderful session.

http://channel9.msdn.com/shows/Going+Deep/Dave-Probert-Inside-Windows-7-User-Mode-Scheduler-UMS/

-Rick

On Channel9: a chat with members from the Visual C++ and Concurrency Runtime teams
13 January 09 05:36 PM | rickmolloy | 0 Comments   

Last week, Don McCrady, Damien Watkins from the C++ team and I sat down with Charles from Channel9 and chatted about the Concurrency Runtime, new C++ work in Visual Studio 2010 and how our teams are working together.

 

As Charles pointed out, we also touched briefly upon new features in the C++0x standard that haven’t been discussed before.

 

Here's the link:

http://channel9.msdn.com/shows/Going+Deep/Parallel-Computing-in-Native-Code-New-Trends-and-Old-Friends/

 

Enjoy and as always your feedback is very welcome,

Rick

An Introduction to Native Concurrency in Visual Studio 2010
12 January 09 03:25 PM | Atilla Gunal | 6 Comments   

In this blog I will be giving an introduction to the native concurrency support in Visual Studio 2010. My motivation is that an architectural understanding of the features will enable the reader to make the most of the underlying infrastructure. In future blogs, I will go deeper into each sub topic.

 

The Goal

 

The ultimate goal of the native concurrency is to provide tools to the developers that will enable them to introduce scalable and maintainable parallelism into their applications.

 

One side of the goal is to inject efficient parallelism as easy as writing the following:

parallel_for(0, 10, [=](int i){ foo(i); } );

where foo() will be evaluated in parallel by the runtime. The other side is to enable productivity via the provided programming models.

 

The Architecture of Native Concurrency in Visual Studio 2010

 

Here’s a picture of the major components in the native concurrency development stack. I’d like to go through each of them and the interactions in between. 

 

Native Concurrency Development Stack

Resource Manager (RM)

 

As the name of the component suggests, the main responsibility of the RM is to manage the resources, where the resources here are the processor cores of the system. These resources are requested by the scheduler(s) to execute parallel work and are then distributed by the RM to the scheduler(s).

 

In order to accomplish the distribution, the RM captures the machine topology (i.e. number of processor cores and the closeness of the processor cores to each other) at its creation time. As the schedulers are created, it takes into account various factors to make the allocation including:

 

i-     Minimum and maximum number of cores requested by each scheduler

ii-  Scheduler priority

iii- Utilization of allocated processor cores per scheduler

iv-  Closeness of the allocated cores of each scheduler

(i.e. on a NUMA Architecture RM will try to allocate cores as close as possible. In other words, RM will span the allocation to a minimum number of NUMA nodes)

 

RM will not only allocate resources at the scheduler's creation time but also continue to monitor all schedulers to maximize the utilization of resources by taking from the ones that don't utilize and by giving to the ones that are in need.

 

Users that are willing to implement their own scheduler will mostly use this interface of the runtime.

 

Scheduler

 

Each parallel application consists of multiple tasks to be executed. As an example for the code given above; foo(0), foo(1), ..., foo(9) can be considered as tasks to be executed in parallel. It is critical to have a component to have this set of work distributed to the processor cores available for execution in an efficient way. Within the Concurrency Runtime this is responsibility of the Scheduler.

 

The Scheduler, despite having a predefined default behavior, can be customized through a set of policies such as: the number of resources it will use, the number of OS threads mapped to its allocated resources, its resource allocation priority, and either it should give priority to execute tasks to increase cache hits or improve fairness across tasks.

 

An important aspect of the Scheduler is that it is cooperative in the sense that it will not preempt a running task until it finishes execution or that task cooperatively yields its execution on behalf of other tasks. This is particularly important since it avoids context switches and cache being thrashed due to randomization introduced by preemption.

 

Users that are willing to implement their own concurrent library will mostly target this interface of the runtime.

 

PPL

 

PPL stands for the Parallel Pattern Library and is meant to provide a convenient interface to execute work in parallel. By using PPL, you can introduce parallelism without even having to manage a scheduler. Here are the common patterns available:

 

  Task execution patterns:

 

  parallel_invoke: To execute from 2 to 10 tasks in parallel

  parallel_for: To execute tasks in parallel over a range of integers

  parallel_for_each: To execute tasks in parallel over a collection of items in an STL container

 

  Synchronization patterns:

 

reader_writer_lock: A cooperative reader-writer lock that yields to other tasks instead of preempting.

critical_section: A cooperative mutual exclusion primitive that yields to other tasks instead of preempting.

 

  Data sharing pattern:

 

combinable: A scalable object that has a local copy for each thread where processing can be done lock free on the local copy and combined afterwards when parallel processing is done. For more info on combinable please refer to this.

 

Application developers will mostly use this interface to inject parallelism.

  

Agents

 

We all know that building software is not trivial. Building software that has to manage concurrent access to shared state is even harder. Agents are a model for decomposing parallel applications to make them easy to design, build and maintain. With Agents, you can implement the main components of the software in isolation, communicate between them by message passing (i.e. via messaging blocks), and in each Agent introduce concurrent computation using finer grain constructs like the PPL. For more info about using Agents and messaging bocks please refer to this blog or this Channel9 video.

 

What next?

 

I hope this blog helped in a better understanding of what the Concurrency Runtime is about. I will continue to detail each component as I give more concrete examples while digging that particular feature. Your comments are valuable for me alot so please feel free to provide feedback.

 

More Posts Next page »
Page view tracker