.NET4 is too good to be true

Lately I played around the Parallel extensions in .NET 4 and I was almost impressed that it was better than expected :-)
--> In the initial example we did use Thread.Sleep inside the loops, so our speed up gain for going parallel was more than 4x …  :-)

Pre-requisites

  • I run this on VS2010 in a VPC with single CPU and around 1.5GB of RAM
  • There is no Disk IO or network IO involved in the benchmark
  • We have a long running process called DoCalc were we tested different algorithms from the port of the SciMark 2.0 Benchmark to C#
    The original benchmark was in Java and can be found at http://math.nist.gov/scimark2
  • I used the code from the “measureSOR” method, because
    • It takes the same time on each run (stable between consecutive runs)
    • It is the fastest one of all the calculation examples
  • The “measureSOR” method is a port of the SciMark2a Java Benchmark to C# by Chris Re (cmr28@cornell.edu ) and Werner Vogels (vogels@cs.cornell.edu )
    Thanks for that!

Code

See the following example (I removed all Debug output and Stopwatch code)

for (int i = 0; i < 5; i++)
            {
                DoCalc();
            }

And the same code with the Parallel extensions

Parallel.For(0, 5, i =>
            {
                DoCalc();
            });

The DoCalc is just a wrapper around measureSOR, because of easy replacing and testing

private static void DoCalc()
        {
            var res = PerformCalculationMeasureSOR();
        }

        private static double PerformCalculationMeasureSOR()
        {
            SciMark2.Random R = new SciMark2.Random(SciMark2.Constants.RANDOM_SEED);
            var res = SciMark2.kernel.measureSOR(SciMark2.Constants.SOR_SIZE, SciMark2.Constants.RESOLUTION_DEFAULT, R);
            return res;
        }

When we run these 2 methods we get

Output

Output from running in a for loop
Calculation process started at 26/08/2009 12:45:28 PM
Starting process 0
Run: 0   Result: 404.89
Completed process 0 took 5.4336615 seconds

Starting process 1
Run: 1   Result: 404.89
Completed process 1 took 4.7462174 seconds

Starting process 2
Run: 2   Result: 407.99
Completed process 2 took 4.7405446 seconds

Starting process 3
Run: 3   Result: 402.85
Completed process 3 took 4.7832635 seconds

Starting process 4
Run: 4   Result: 408.33
Completed process 4 took 4.7051044 seconds

Calculation finished at 26/08/2009 12:45:52 PM and took 24.418825
Hit <Enter>

Output from running parallel

Calculation process started at 26/08/2009 12:43:07 PM
Non-parallelized for loop
Starting process 0
Starting process 1
Starting process 2
Starting process 3
Starting process 4
Run: 2   Result: 68.17
Completed process 2 took 5.9277232 seconds

Run: 1   Result: 90.60
Completed process 1 took 9.0088243 seconds   // take longer because overlapping

Run: 0   Result: 90.45
Completed process 0 took 9.0282256 seconds   // take longer because overlapping

Run: 3   Result: 84.78
Completed process 3 took 5.2585445 seconds

Run: 4   Result: 192.39
Completed process 4 took 6.3435232 seconds

Calculation finished at 26/08/2009 12:43:20 PM and took 13.2818383
Hit <Enter>

 

Interesting notes here

  • In the NON parallel loop, each method call takes ~same amount of time
  • In the parallelized loop those methods that overlap, take longer (9 secs)
  • The parallel run in not 4x faster than the iterative run! (as it was with Thread.Sleep :-)
  • We almost halved the execution time as expected
    BUT sometimes the execution time is faster than half the time (around 10 seconds) image
     
  • Additionally if we run the parallel for loop 10 times we just need another 17 seconds...
    For an explanation we could have a deeper look at the algorithm behind SOR
    Jacobi Successive Over-relaxation (SOR) http://math.nist.gov/scimark2/about.html

clip_image002
Figure: Running the parallel for  10times takes only ~ 17 seconds

 

With the help of Paul we were able to figure out what is going on here…

Findings

Our findings running in the VPC image:

  • If we run the "normal" for loop, the CPU doesn't go crazy (only around 88%-98% of usage)
  • If we run the parallel loop , the CPU usage is much higher (~100%)

Our findings running on bare metal (real dual core CPU)

clip_image001
Figure 1.  Linear vs Parallel CPU utilization for loops

Conclusion

#1 CPU usage (=performance) is slightly scheduler dependent

I am not a OS expert but I guess the above means: “Windows sees more threads, and gives them more time on the CPU”

#2 Using the Parallel extensions is VERY EASY! Looking forward to the final release!

No comments:

Post a Comment

Latest Posts

Popular Posts