Willy's Reflections - Visual Studio ALM Rangers

| Willy-Peter Schaub | Visual Studio ALM Rangers | In search of IT simplicity, quality and tranquility |

Rambling … loops and multi-processing. When things literally slow down…

Rambling … loops and multi-processing. When things literally slow down…

  • Comments 2

Continuing from Rambling … why is my dual processor colleague not twice as fast? I thought I should have a quick look at some basic looping and multi-processor support, released with the Parallel Extensions to the .NET Framework in .NET 3.5. Again I developed some simple and probably ‘silly’ code to experiment with loops and parallel extensions. The code simply increments a long variable many, many times and performs some meaningless maths, just to keep the processor occupied. What I am trying to figure is whether the parallel extensions bring any simple relief.

Snippet 1 – Simple Loop

C# Code MSIL
   1: private static void TestZero(bool log)
   2: {
   3:     DateTime stampStart = DateTime.Now;
   4:     ResetState(log);
   5:     for (int k = 0; k < 2; k++)
   6:     {
   7:         if (log)
   8:         {
   9:             Console.WriteLine("0 -> k:{0}", k);
  10:         }
  11:         for (int i = 0x5f5e100; i > 0; i--)
  12:         {
  13:             SillyProcessing(k);
  14:         }
  15:     }
  16:     if (log)
  17:     {
  18:         Log(stampStart, "TestZero");
  19:     }
  20: }
   1: .method private hidebysig static void TestZero(bool log) cil managed
   2: {
   3:     .maxstack 2
   4:     .locals init (
   5:         [0] valuetype [mscorlib]System.DateTime stampStart,
   6:         [1] int32 k,
   7:         [2] int32 i)
   8:     L_0000: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
   9:     L_0005: stloc.0 
  10:     L_0006: ldarg.0 
  11:     L_0007: call void Test.Program::ResetState(bool)
  12:     L_000c: ldc.i4.0 
  13:     L_000d: stloc.1 
  14:     L_000e: br.s L_003d
  15:     L_0010: ldarg.0 
  16:     L_0011: brfalse.s L_0023
  17:     L_0013: ldstr "0 -> k:{0}"
  18:     L_0018: ldloc.1 
  19:     L_0019: box int32
  20:     L_001e: call void [mscorlib]System.Console::WriteLine(string, object)
  21:     L_0023: ldc.i4 0x5f5e100
  22:     L_0028: stloc.2 
  23:     L_0029: br.s L_0035
  24:     L_002b: ldloc.1 
  25:     L_002c: call void Test.Program::SillyProcessing(int32)
  26:     L_0031: ldloc.2 
  27:     L_0032: ldc.i4.1 
  28:     L_0033: sub 
  29:     L_0034: stloc.2 
  30:     L_0035: ldloc.2 
  31:     L_0036: ldc.i4.0 
  32:     L_0037: bgt.s L_002b
  33:     L_0039: ldloc.1 
  34:     L_003a: ldc.i4.1 
  35:     L_003b: add 
  36:     L_003c: stloc.1 
  37:     L_003d: ldloc.1 
  38:     L_003e: ldc.i4.2 
  39:     L_003f: blt.s L_0010
  40:     L_0041: ldarg.0 
  41:     L_0042: brfalse.s L_004f
  42:     L_0044: ldloc.0 
  43:     L_0045: ldstr "TestZero"
  44:     L_004a: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
  45:     L_004f: ret 
  46: }
  47:  
  48:  

As shown, the first test, TestZero() simply loops and calls the SillyProcessing() 100,000,000 times, whereby SillyProcessing() does nothing more than:

  1. float calculation = (float)index * (float)index * (float)data[index] / (float)3.14; // wake up the maths part of the processor
  2. string result = (Math.Cos(calculation)).ToString(); // use some heavy string processing spices
  3. data[index] = data[index] + 1; // actual logic

… do not analyse the logic :)

Snippet 2 – Simple Loop, getting rid of some of the loops

C# Code
   1: private static void TestOne(bool log)
   2: {
   3:     DateTime stampStart = DateTime.Now;
   4:     ResetState(log);
   5:     for (int k = 0; k < 2; k++)
   6:     {
   7:         if (log)
   8:         {
   9:             Console.WriteLine("1 -> k:{0}", k);
  10:         }
  11:         for (int i = 0x5f5e100; i > 0; i -= 0x19)
  12:         {
  13:             SillyProcessing(k);
  14:             SillyProcessing(k);
  15:             SillyProcessing(k);
  16:             SillyProcessing(k);
  17:             SillyProcessing(k);
  18:             SillyProcessing(k);
  19:             SillyProcessing(k);
  20:             SillyProcessing(k);
  21:             SillyProcessing(k);
  22:             SillyProcessing(k);
  23:             SillyProcessing(k);
  24:             SillyProcessing(k);
  25:             SillyProcessing(k);
  26:             SillyProcessing(k);
  27:             SillyProcessing(k);
  28:             SillyProcessing(k);
  29:             SillyProcessing(k);
  30:             SillyProcessing(k);
  31:             SillyProcessing(k);
  32:             SillyProcessing(k);
  33:             SillyProcessing(k);
  34:             SillyProcessing(k);
  35:             SillyProcessing(k);
  36:             SillyProcessing(k);
  37:             SillyProcessing(k);
  38:         }
  39:     }
  40:     if (log)
  41:     {
  42:         Log(stampStart, "TestOne");
  43:     }
  44: }
  45:  
  46:  

The second test, TestOne(), calls SillyProcessing() 25 times and increments the loop counter by 25, instead of the 1:1 as in testZero,

Snippet 3 – Parallel Extensions one level

C# Code MSIL
   1: private static void TestTwo(bool log)
   2: {
   3:     DateTime stampStart = DateTime.Now;
   4:     ResetState(log);
   5:     Parallel.For(0, 2, 1, delegate (int k) {
   6:         if (log)
   7:         {
   8:             Console.WriteLine("2 -> k:{0}", k);
   9:         }
  10:         for (int i = 0x5f5e100; i > 0; i--)
  11:         {
  12:             SillyProcessing(k);
  13:         }
  14:     });
  15:     if (log)
  16:     {
  17:         Log(stampStart, "TestTwo");
  18:     }
  19: }
  20:  
  21:  
   1: .method private hidebysig static void TestTwo(bool log) cil managed
   2: {
   3:     .maxstack 6
   4:     .locals init (
   5:         [0] valuetype [mscorlib]System.DateTime stampStart,
   6:         [1] class Test.Program/<>c__DisplayClass1 CS$<>8__locals2)
   7:     L_0000: newobj instance void Test.Program/<>c__DisplayClass1::.ctor()
   8:     L_0005: stloc.1 
   9:     L_0006: ldloc.1 
  10:     L_0007: ldarg.0 
  11:     L_0008: stfld bool Test.Program/<>c__DisplayClass1::log
  12:     L_000d: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
  13:     L_0012: stloc.0 
  14:     L_0013: ldloc.1 
  15:     L_0014: ldfld bool Test.Program/<>c__DisplayClass1::log
  16:     L_0019: call void Test.Program::ResetState(bool)
  17:     L_001e: ldc.i4.0 
  18:     L_001f: ldc.i4.2 
  19:     L_0020: ldc.i4.1 
  20:     L_0021: ldloc.1 
  21:     L_0022: ldftn instance void Test.Program/<>c__DisplayClass1::<TestTwo>b__0(int32)
  22:     L_0028: newobj instance void [mscorlib]System.Action`1<int32>::.ctor(object, native int)
  23:     L_002d: call void [System.Threading]System.Threading.Parallel::For(int32, int32, int32, class [mscorlib]System.Action`1<int32>)
  24:     L_0032: ldloc.1 
  25:     L_0033: ldfld bool Test.Program/<>c__DisplayClass1::log
  26:     L_0038: brfalse.s L_0045
  27:     L_003a: ldloc.0 
  28:     L_003b: ldstr "TestTwo"
  29:     L_0040: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
  30:     L_0045: ret 
  31: }
  32:  
  33:  

With TestTwo() we implemented the parallel extensions on the outer loop.

Snippet 4 – Parallel Extensions two levels

C# Code MSIL
   1: private static void TestThree(bool log)
   2: {
   3:     DateTime stampStart = DateTime.Now;
   4:     ResetState(log);
   5:     Parallel.For(0, 2, 1, delegate (int k) {
   6:         if (log)
   7:         {
   8:             Console.WriteLine("3 -> k:{0}", k);
   9:         }
  10:         Parallel.For(0, 0x5f5e100, 1, delegate (int i) {
  11:             SillyProcessing(k);
  12:         });
  13:     });
  14:     if (log)
  15:     {
  16:         Log(stampStart, "TestThree");
  17:     }
  18: }
  19:  
  20:  
   1: .method private hidebysig static void TestThree(bool log) cil managed
   2: {
   3:     .maxstack 6
   4:     .locals init (
   5:         [0] valuetype [mscorlib]System.DateTime stampStart,
   6:         [1] class Test.Program/<>c__DisplayClass5 CS$<>8__locals6)
   7:     L_0000: newobj instance void Test.Program/<>c__DisplayClass5::.ctor()
   8:     L_0005: stloc.1 
   9:     L_0006: ldloc.1 
  10:     L_0007: ldarg.0 
  11:     L_0008: stfld bool Test.Program/<>c__DisplayClass5::log
  12:     L_000d: call valuetype [mscorlib]System.DateTime [mscorlib]System.DateTime::get_Now()
  13:     L_0012: stloc.0 
  14:     L_0013: ldloc.1 
  15:     L_0014: ldfld bool Test.Program/<>c__DisplayClass5::log
  16:     L_0019: call void Test.Program::ResetState(bool)
  17:     L_001e: ldc.i4.0 
  18:     L_001f: ldc.i4.2 
  19:     L_0020: ldc.i4.1 
  20:     L_0021: ldloc.1 
  21:     L_0022: ldftn instance void Test.Program/<>c__DisplayClass5::<TestThree>b__3(int32)
  22:     L_0028: newobj instance void [mscorlib]System.Action`1<int32>::.ctor(object, native int)
  23:     L_002d: call void [System.Threading]System.Threading.Parallel::For(int32, int32, int32, class [mscorlib]System.Action`1<int32>)
  24:     L_0032: ldloc.1 
  25:     L_0033: ldfld bool Test.Program/<>c__DisplayClass5::log
  26:     L_0038: brfalse.s L_0045
  27:     L_003a: ldloc.0 
  28:     L_003b: ldstr "TestThree"
  29:     L_0040: call void Test.Program::Log(valuetype [mscorlib]System.DateTime, string)
  30:     L_0045: ret 
  31: }
  32:  
  33:  

With the last test, TestThree(), we went one step further and implemented the parallel extensions on the inner loop. Normally not needed, but as part of the test I decided to go where no-one would normally venture :)

Pop-Quiz

“Pop-Quiz”, which test will be the quickest and which will be problematic? I was surprised …

Analysis

Let us first look at the processor view, using the good old trusted task manager:
image  
Figure 1 – Processor View (Left:Processor 1, Right:Processor 2)

Blue Male Doctor Holding a Clipboard Clipart Illustration With both TestZero() and TestOne() the total processor utilisation remains around 50% and the processing seems to plod along quite sequentially. With TestTwo() and TestThree() you notice an uprising of threads and the total processor utilisation practically maxes on both processors as expected.

But who is the quickest and who is the problematic case?
image  
Figure 2 – Results

TestZero() and TestOne() basically arrive at the finish line at the same time. I expected TestOne() to be quicker, but with all the sceptical re-runs it always proved to be slightly slower than TestZero(). I used the profiling tools to get a more detailed picture, whereby the results are also not making me feel 100% happy although TestOne() is faster than TestZero() with instrumented code. Using sampling we get a less accurate picture, but with less impact on the application behaviour.
image 

TestTwo() was faster as expected and TestThree() proved to be the champion in terms of speed. It is important to test your application before making extensive use of parallel processing, because if we Blue Man Bending Over to Inspect Something Through a Magnifying Glass Clipart Illustrationremove the first two statements of the SillyProcessing() method, the parallelism actually proves slower than the classic counterparts. Also TestOne() speeded up as expected, when the calculation statement and especially the heavy string statement were removed as shown:
image

What is also evident is that one of the tests has a problem … yes, TestThree() seems to suffer from lost increments. Before we start debugging and analysing the anomaly in great detail, we can emphasise that the code is not thread safe and that the data counters are simply overridden by the inner loops competing for the battle zone.

So what, if anything have we learnt in this chat?

Throwing processors and threads at a problem will not necessarily speed up your application. There are exceptional profiling tools in VSTS 2008 and even better ones in VSTS 2010, which will enable you to profile and analyse your application in a number of environments. Remember that too much of a good thing, i.e. too much parallelism or too much salt, can destroy an exceptional application or an exceptional meal respectively q;-)

Leave a Comment
  • Please add 5 and 4 and type the answer here:
  • Post