Parallel Programming in Native Code

Parallel programming using C++ AMP, PPL and Agents libraries.

Walkthrough: Cartoon Effect filter using data-flow network

Walkthrough: Cartoon Effect filter using data-flow network

  • Comments 0

This topic shows how to implement a cartoon-like filter that can be applied to video frames by using a data-flow network. The filter consists of two stages:

1. Color simplification: where a Gaussian average of the neighbor area of a pixel is calculated and assigned to that pixel. This is an iterative filter that is done multiple times for each frame. In this walkthrough it is done three times.

2. Edge detection: where edge pixels is assigned a black color.

image

Here is the serial implementation of both filters:

Color simplification serial code

   1: void FrameProcessing::ApplyColorSimplifier(unsigned int startHeight,
   2:  unsigned int endHeight, unsigned int startWidth, unsigned int endWidth)
   3: {
   4:     for (unsigned int j = startHeight; j < endHeight; ++j)
   5:     {
   6:         for(unsigned int i = startWidth; i < endWidth; ++i)
   7:         {
   8:             SimplifyIndexOptimized(m_pBufferImage, i, j);
   9:         }
  10:     }
  11: }
  12:  
  13: void FrameProcessing::SimplifyIndexOptimized(BYTE* pFrame,
  14:  int x, int y)
  15: {
  16:     COLORREF orgClr = GetPixel(pFrame, x, y, m_Pitch, m_BPP);
  17:     int shift = m_NeighborWindow / 2;
  18:     double sSum = 0;
  19:     double partialSumR = 0, partialSumG = 0, partialSumB = 0;
  20:     double standardDeviation = 0.025;
  21:     for(int j = y - shift; j <= (y + shift); ++j)
  22:     {
  23:         for(int i = x - shift; i <= (x + shift); ++i)
  24:         {
  25:             // don't apply filter to the requested index,
  26:             // only to the neighbors
  27:             if (i != x ||  j != y)
  28:             {
  29:                 COLORREF clr = GetPixel(pFrame, i, j,
  30:                      m_Pitch, m_BPP);
  31:                 int index = (j - (y - shift)) *
  32:                      m_NeighborWindow + i - (x - shift);
  33:                 double distance = Util::GetDistance(
  34:                         orgClr, clr);
  35:                 double sValue = pow(M_E, -0.5 *
  36:                      pow(distance / standardDeviation,2));
  37:                 sSum += sValue;
  38:                 partialSumR += GetRValue(clr) * sValue;
  39:                 partialSumG += GetGValue(clr) * sValue;
  40:                 partialSumB += GetBValue(clr) * sValue;
  41:             }
  42:         }
  43:     }
  44:  
  45:     COLORREF simplifiedClr;
  46:     int simpleRed, simpleGreen, simpleBlue;
  47:     simpleRed   = (int)min(max(partialSumR / sSum, 0), 255);
  48:     simpleGreen = (int)min(max(partialSumG / sSum, 0), 255);
  49:     simpleBlue  = (int)min(max(partialSumB / sSum, 0), 255);
  50:     simplifiedClr = RGB(simpleRed, simpleGreen, simpleBlue);
  51:     SetPixel(m_pBufferImage, x, y, m_Pitch, m_BPP, simplifiedClr);
  52: }

Edge Detection Serial code

   1: void ApplyEdgeDetection(BYTE* pImageFrame, unsigned int startHeight,
   2:  unsigned int endHeight, unsigned int startWidth, unsigned int endWidth)
   3: {
   4:     const float alpha = 0.3f;
   5:     const float beta = 0.8f;
   6:     const float s0 = 0.054f;
   7:     const float s1 = 0.064f;
   8:     const float a0 = 0.3f;
   9:     const float a1 = 0.7f;
  10:     BYTE* pFrame = new BYTE[m_Size];
  11:     memcpy_s(pFrame, m_Size, pImageFrame, m_Size);
  12:     for(unsigned int y = startHeight; y < endHeight; ++y)
  13:     {
  14:         for(unsigned int x = startWidth; x < endWidth; ++x)
  15:         {
  16:             float Sy, Su, Sv;
  17:             float Ay, Au, Av;
  18:             CalculateSobel(m_pBufferImage, x, y, Sy, Su, Sv);
  19:             CalculateSobel(m_pCurrentImage, x, y, Ay, Au, Av);
  20:             float edgeS = (1 - alpha) * Sy +
  21:                 alpha * (Su + Sv) / 2;
  22:             float edgeA = (1 - alpha) * Ay +
  23:                 alpha * (Au + Av) / 2;
  24:             float i = (1 - beta) * Util::SmoothStep(s0, s1, edgeS)
  25:                 + beta * Util::SmoothStep(a0, a1, edgeA);
  26:             float oneMinusi = 1 - i;
  27:             COLORREF clr = GetPixel(m_pBufferImage, x, y,
  28:                                     m_Pitch, m_BPP);
  29:             COLORREF newClr = RGB(GetRValue(clr)*oneMinusi,
  30:             GetGValue(clr)* oneMinusi, GetBValue(clr) * oneMinusi);
  31:  
  32:             this->SetPixel(pFrame, x, y, m_Pitch, m_BPP, newClr);
  33:         }
  34:     }
  35:     memcpy_s(pImageFrame, m_Size, pFrame, m_Size);
  36:     delete[] pFrame;
  37: }
 

Pipelining and data flow network

pipe

The pipeline resembles the above diagram. It has three color simplification stages and one edge detection stage. This pipeline, however, uses only half of the available computation resources if it is running on an eight core machine. To extend it to use all CPUs, the frame can be divided into chunks; each chunk is passed to a network similar to the one above.

pipe

pipe

.

.

pipe

The number of chunks matches the number of networks, which is the number of CPUs divided by four (number of stages inside each network). This network would use 100% of CPU utilization on any CPU it is run on, but it would hit a thread safety issue regarding edge detection. The problem here is the dependency between the two filters implemented on each frame. The rule is that color simplification must end before the edge detection begins which is not the case with the network above. In this scenario chunks of the same frame might be in an edge detection stage while other chunks of the same frame are inside color simplification stages. To solve this problem, a frame must wait to be done with color simplification before entering edge detection. This is done using a join message block waiting for all color simplification stages to end before allowing a frame to pass to edge detection like in the diagram below. After edge detection is done, the video reader block is signaled to send the next frame into the network. The reason for this feedback loop is to prevent the video reader from overwhelming the network with messages while relatively slower processing of frames takes place. However, initially the video reader sends some frames to insure that the network is busy at all times with a number of frames that is larger than one. In our case here we send four times the number of color simplification stages of frames to the network initially (12 frames). This way the network is ensured to always have 12 frames to process at all times.

Big network

Now this is ready for implementation. The code below shows the Video agents that behaves as the connection point between the UI and the network signals the video reader to read the initial frames.

   1: void VideoMultiFramePipelineAgent::run()
   2: {
   3:     QueryPerformanceFrequency(&m_Frequency);
   4:     QueryPerformanceCounter(&m_StartTime);
   5:     QueryPerformanceCounter(&m_EndTime);
   6:     m_Overhead = m_EndTime.QuadPart - m_StartTime.QuadPart;
   7:  
   8:     QueryPerformanceCounter(&m_StartTime);
   9:  
  10:     FrameData data;
  11:     data.m_pCartoonAgent = this;
  12:     data.m_fParallel = m_fParallel;
  13:     data.m_PhaseCount = m_nPhases;
  14:     data.m_pVideoReader = &m_VideoReader;
  15:     data.m_neighbourArea =
  16:             ((CCartoonizerDlg*)m_pUIDlg)->m_NeighbourWindow;
  17:  
  18:     m_pFrameProcessor->SetNeighbourArea(data.m_neighbourArea);
  19:     
  20:     for (int i = 0; i < 4 * m_nPhases; ++i)
  21:     {
  22:         m_VideoReader.ReadNextFrame(data);
  23:     }
  24:  
  25:     done();
  26: }
 
Here in the following code the network is initialized. A transformer block is used for edge detection and color simplification as both stages sends and receives data. The video reader is a call block as it only sends data.
 
   1: void MultiplePipelineNetworkAgent::InitializeNetwork()
   2: {
   3:     m_fNetworkInitialized = true;
   4:     m_ProcessedMsgs = 0;
   5:     m_fAllMsgsRecieved = false;
   6:  
   7:     m_colorSimplifier = new transformer<FrameData,
   8:              FrameData>** [m_pipeLines];
   9:     
  10:     m_edgeDetectionJoin = new join<FrameData>(m_pipeLines);
  11:     m_edgeDetection = new transformer<vector<FrameData>,
  12:          FrameData>([&](vector<FrameData> const& arrData) -> FrameData
  13:     {
  14:         if(arrData[1].m_EndHeight != 0)
  15:         {
  16:             arrData[1].m_pFrameProcesser->SetParallelOption(true);
  17:             arrData[1].m_pFrameProcesser->ApplyEdgeDetection();
  18:             arrData[1].m_pFrameProcesser->FrameDone();
  19:             arrData[1].m_pCartoonAgent->FrameFinished(arrData[0]);
  20:             m_Finished = true;
  21:         }
  22:         return arrData[1];
  23:     });
  24:  
  25:     for(int i = 0; i < m_pipeLines; ++i)
  26:     {
  27:         m_colorSimplifier[i] = new transformer<FrameData,
  28:                              FrameData>* [m_phaseCount];
  29:  
  30:         for (int count = 0; count < m_phaseCount; ++count)
  31:         {
  32:             m_colorSimplifier[i][count] = new transformer<FrameData,
  33:              FrameData>([](FrameData const& data) -> FrameData
  34:             {
  35:                 data.m_pFrameProcesser->ApplyColorSimplifier(
  36:                     data.m_StartHeight, data.m_EndHeight,
  37:                     data.m_StartWidth, data.m_EndWidth);
  38:  
  39:                 return data;
  40:             });
  41:  
  42:             if (count > 0)
  43:             {
  44:                 m_colorSimplifier[i][count-1]->link_target(
  45:                             m_colorSimplifier[i][count]);           
  46:             }
  47:         }
  48:         m_colorSimplifier[i][m_phaseCount-1]->link_target(
  49:                         m_edgeDetectionJoin);
  50:     }
  51:     m_edgeDetectionJoin->link_target(m_edgeDetection);
  52:  
  53:     m_reader = new call<FrameData>([](const FrameData & data)
  54:     {
  55:         FrameData newData = data;
  56:         if(NULL != data.m_pVideoReader)
  57:         {
  58:             data.m_pVideoReader->ReadNextFrame(newData);
  59:         }
  60:     });
  61:  
  62:     m_edgeDetection->link_target(m_reader);
  63: }
Then here is the code to divide a frame into chunks and send it to the network. The chunks are only sent to the first color simplification column as the network connection will pass the chunks to the next stages when a first color simplification is done.
 
   1: void MultiplePipelineNetworkAgent::DoWork(FrameData& data)
   2: {
   3:     m_Finished = false;
   4:     unsigned int shift = data.m_neighbourArea / 2;
   5:     int count = 0;
   6:     m_ProcessedMsgs = 0;
   7:     int index = 0;
   8:     for (unsigned int h = shift; h < (data.m_EndHeight - shift);
   9:             h += m_step, ++count)
  10:     {
  11:         FrameData localData     = data;
  12:         localData.m_StartHeight = h;
  13:         localData.m_EndHeight   = min(h + m_step,
  14:                                 (data.m_EndHeight - shift));
  15:         localData.m_StartWidth  = shift;
  16:         localData.m_EndWidth    = data.m_EndWidth - shift;
  17:         localData.m_final       = (localData.m_EndHeight ==
  18:                                   (data.m_EndHeight - shift));
  19:  
  20:         index = count % m_pipeLines;
  21:         asend(m_colorSimplifier[index][0], localData);
  22:     }
  23: }

Results

Using data flow network showed liner speedup up to 48 cores on a video stream with frame size of 640x360.


image

Mohamed Magdy Mohamed                                                                                     Parallel Computing Platform Team

Blog - Comment List MSDN TechNet
  • Loading...
Leave a Comment
  • Please add 5 and 2 and type the answer here:
  • Post