Lately I played around the Parallel extensions in .NET 4 and I was almost impressed that it was better than expected :-)
--> In the initial example we did use Thread.Sleep inside the loops, so our speed up gain for going parallel was more than 4x … :-)
Pre-requisites
- I run this on VS2010 in a VPC with single CPU and around 1.5GB of RAM
- There is no Disk IO or network IO involved in the benchmark
- We have a long running process called DoCalc were we tested different algorithms from the port of the SciMark 2.0 Benchmark to C#
The original benchmark was in Java and can be found at http://math.nist.gov/scimark2 - I used the code from the “measureSOR” method, because
- It takes the same time on each run (stable between consecutive runs)
- It is the fastest one of all the calculation examples
- The “measureSOR” method is a port of the SciMark2a Java Benchmark to C# by Chris Re (cmr28@cornell.edu ) and Werner Vogels (vogels@cs.cornell.edu )
Thanks for that!
Code
See the following example (I removed all Debug output and Stopwatch code)
for (int i = 0; i < 5; i++) { DoCalc(); }
And the same code with the Parallel extensions
Parallel.For(0, 5, i => { DoCalc(); });
The DoCalc is just a wrapper around measureSOR, because of easy replacing and testing
private static void DoCalc() { var res = PerformCalculationMeasureSOR(); } private static double PerformCalculationMeasureSOR() { SciMark2.Random R = new SciMark2.Random(SciMark2.Constants.RANDOM_SEED); var res = SciMark2.kernel.measureSOR(SciMark2.Constants.SOR_SIZE, SciMark2.Constants.RESOLUTION_DEFAULT, R); return res; }
When we run these 2 methods we get
Output
Output from running in a for loopCalculation process started at 26/08/2009 12:45:28 PM
Starting process 0
Run: 0 Result: 404.89
Completed process 0 took 5.4336615 seconds
Starting process 1
Run: 1 Result: 404.89
Completed process 1 took 4.7462174 seconds
Starting process 2
Run: 2 Result: 407.99
Completed process 2 took 4.7405446 seconds
Starting process 3
Run: 3 Result: 402.85
Completed process 3 took 4.7832635 seconds
Starting process 4
Run: 4 Result: 408.33
Completed process 4 took 4.7051044 seconds
Calculation finished at 26/08/2009 12:45:52 PM and took 24.418825
Hit <Enter>
Output from running parallel
Calculation process started at 26/08/2009 12:43:07 PM Non-parallelized for loop Starting process 0 Starting process 1 Starting process 2 Starting process 3 Starting process 4 Run: 2 Result: 68.17 Completed process 2 took 5.9277232 seconds Run: 1 Result: 90.60 Completed process 1 took 9.0088243 seconds // take longer because overlapping Run: 0 Result: 90.45 Completed process 0 took 9.0282256 seconds // take longer because overlapping Run: 3 Result: 84.78 Completed process 3 took 5.2585445 seconds Run: 4 Result: 192.39 Completed process 4 took 6.3435232 seconds Calculation finished at 26/08/2009 12:43:20 PM and took 13.2818383 Hit <Enter>
Interesting notes here
- In the NON parallel loop, each method call takes ~same amount of time
- In the parallelized loop those methods that overlap, take longer (9 secs)
- The parallel run in not 4x faster than the iterative run! (as it was with Thread.Sleep :-)
- We almost halved the execution time as expected
BUT sometimes the execution time is faster than half the time (around 10 seconds)
- Additionally if we run the parallel for loop 10 times we just need another 17 seconds...
For an explanation we could have a deeper look at the algorithm behind SOR
Jacobi Successive Over-relaxation (SOR) http://math.nist.gov/scimark2/about.html
Figure: Running the parallel for 10times takes only ~ 17 seconds
With the help of Paul we were able to figure out what is going on here…
Findings
Our findings running in the VPC image:
- If we run the "normal" for loop, the CPU doesn't go crazy (only around 88%-98% of usage)
- If we run the parallel loop , the CPU usage is much higher (~100%)
Our findings running on bare metal (real dual core CPU)
Figure 1. Linear vs Parallel CPU utilization for loops
Conclusion
#1 CPU usage (=performance) is slightly scheduler dependent
I am not a OS expert but I guess the above means: “Windows sees more threads, and gives them more time on the CPU”
No comments:
Post a Comment