Question

下面的程序计算pi = 3.1415最初是用C ++编写的，使用OpenMP #pragma omp for和Reduction（+：sum）我试图使用Parallel.For在C＃（newbie）中进行相同的计算出于某种原因，我无法使其工作，如何在C＃中为每个线程设置私有？

using System;
using System.Diagnostics;
using System.Threading.Tasks;


namespace pi
{
    class Program
    {
        static void Main(string[] args)
        {
            long num_steps = 100000000;
            double step;
            double x, pi, sum = 0.0;
            step = 1.0 / num_steps;
            Stopwatch timer = Stopwatch.StartNew();
            Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = 4 }, i =>
                {
                    x = (i - 0.5) * step;                    
                    sum = sum + 4.0 / (1.0 + x * x);
                });            
            pi = step * sum;
            timer.Stop();
            Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));
            Console.ReadKey();
        }
    }
}

Answer 1

我不认为通过将此任务与Parallel.For并行化，您将获得更好（更快）的结果。但是如果你仍然想这样做 - 看看ThreadLocal课程。它提供＆＃34;私人＆＃34;每个线程的存储空间。

long num_steps = 100000000;
var sum = new ThreadLocal<double>(true);
var step = 1.0 / num_steps;
Stopwatch timer = Stopwatch.StartNew();
Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = 4 }, i =>
{
    var x = (i - 0.5) * step;
    sum.Value = sum.Value + 4.0 / (1.0 + x * x);
});
var pi = step * sum.Values.Sum();
timer.Stop();
sum.Dispose();
Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));
Console.ReadKey();

Answer 2

您只需将锁定添加到现有算法：

// Don't do this!
lock (lockObject)
{
    x = (i - 0.5) * step;
    sum = sum + 4.0 / (1.0 + x * x);
}

但它很慢。

尝试这种方法：

object lockObject = new object();

long num_steps = 100000000;
Stopwatch timer = Stopwatch.StartNew();
double step = 1.0 / num_steps;
double sum = 0;

Parallel.For(1, num_steps + 1, () => 0.0, (i, loopState, partialResult) =>
{
    var x = (i - 0.5) * step;
    return partialResult + 4.0 / (1.0 + x * x);
},
localPartialSum =>
{
    lock (lockObject)
    {
        sum += localPartialSum;
    }
});

var pi = step * sum;
timer.Stop();
Console.WriteLine("\n pi with {0} steps is {1} in {2} miliseconds ", num_steps, pi, (timer.ElapsedMilliseconds));

它使用实现并行聚合模式的重载版本Parallel.For方法。

在我的系统上，这比添加锁定的原始算法快7倍，比Aleksey L版本快2倍。

Answer 3

您提供了问题的解决方案，作为您自己问题的答案。而您的解决方案是使用Parallel.For执行此任务的正确方法。但是，您似乎得到了不一致（和缓慢）的结果，但这很可能是因为您编译为调试模式。切换到发布模式将提供更好的性能和一致的结果，其中每个新线程都会增加吞吐量。

Parallel.For提供了将计算划分为并行执行的N个不同部分的一般方法。但是，当使用Parallel.For时，循环体是一个委托，并且（可能还有其他东西）会增加一些开销。您可以通过自己进行分区来避免这种情况。

让我们假设您要将问题分为两部分（例如，使用两个线程）。然后，您可以计算一个线程上索引0-499,999,999的部分和和另一个线程上的索引500,000,000-999,999,999的部分和。最终结果通过对部分和求和来计算。这个想法延伸到更多线程。

您需要一个函数来计算部分和：

Double ComputePartialSum(Int32 startIndex, Int32 count, Double step) {
  var endIndex = startIndex + count;
  var partialSum = 0D;
  for (var i = startIndex; i < endIndex; i += 1) {
    var x = (i - 0.5D)*step;
    partialSum += 4D/(1D + x*x);
  }
  return partialSum;
}

然后，您需要启动一些任务（每个并行度一个）来计算所有部分和：

var degreesOfParallelism = Environment.ProcessorCount; // Or another value
var stepCount = 1000000000;
var step = 1D/stepCount;
var stopwatch = Stopwatch.StartNew();
var partitionSize = stepCount/degreesOfParallelism;
var tasks = Enumerable
  .Range(0, degreesOfParallelism)
  .Select(
    partition => {
      var count = partition < degreesOfParallelism - 1
        ? partitionSize
        : stepCount - (degreesOfParallelism - 1)*partitionSize;
      return Task.Run(() => ComputePartialSum(partition*partitionSize, count, step));
    }
  )
  .ToArray();
Task.WaitAll(tasks);
var sum = tasks.Sum(task => task.Result);
stopwatch.Stop();
var pi = step*sum;

如果你想提高一个档次，你可以使用System.Numerics在每个分区的for循环中使用SIMD指令。您必须在64位进程中执行此操作（我相信使用RyuJIT这是对.NET JIT的一个相对较新的更改）。在我的计算机上，Vector<Double>最多可包含4个元素，因此如果分区应该处理100,000个元素，则通过在循环内并行计算4个双精度，可以将for循环减少到25,000次迭代。

代码不是很漂亮，但它完成了工作。您需要执行堆分配以从向量中获取值，但我非常小心不要在循环体内执行任何分配：

Double ComputePartialSum(Int32 startIndex, Int32 count, Double step) {
  var vectorSize = Vector<Double>.Count;
  var remainder = count%vectorSize;
  var endIndex = startIndex + count - remainder;
  var partialSumVector = Vector<Double>.Zero;
  var iVector = new Vector<Double>(Enumerable.Range(startIndex, vectorSize).Select(i => (Double) i).ToArray());
  var loopIncrementVector = new Vector<Double>(Enumerable.Repeat((Double) vectorSize, vectorSize).ToArray());
  var point5Vector = new Vector<Double>(Enumerable.Repeat(0.5D, vectorSize).ToArray());
  var stepVector = new Vector<Double>(Enumerable.Repeat(step, vectorSize).ToArray());
  var fourVector = new Vector<Double>(Enumerable.Repeat(4D, vectorSize).ToArray());
  for (var i = startIndex; i < endIndex; i += vectorSize) {
    var xVector = (iVector - point5Vector)*stepVector;
    partialSumVector += fourVector/(Vector<Double>.One + xVector*xVector);
    iVector += loopIncrementVector;
  }
  var partialSumElements = new Double[vectorSize];
  partialSumVector.CopyTo(partialSumElements);
  var partialSum = partialSumElements.Sum();
  for (var i = endIndex; i < startIndex + count; i += 1) {
    var x = (i - 0.5D)*step;
    partialSum += 4D/(1D + x*x);
  }
  return partialSum;
}

在我的计算机上启用了超线程的4核，我得到以下结果（持续时间以秒为单位）：

Parallelism | Parallel.For | Partitioning | SIMD
------------+--------------+--------------+------
1           | 6.541        | 3.951        | 1.985
2           | 3.278        | 1.998        | 1.045
3           | 2.218        | 1.422        | 0.739
4           | 1.909        | 1.245        | 0.637
5           | 1.748        | 1.140        | 0.586
6           | 1.579        | 1.039        | 0.523
7           | 1.435        | 0.991        | 0.492
8           | 1.392        | 0.968        | 0.491

显然，这些数字在每次试运行时都会略有不同。

正如您所看到的，随着线程数量的增加，收益递减，但这并不是一个大惊喜，因为我的计算机只有4个物理内核。

Answer 4

谢谢你们，我已经做了一些改进，但性能仍然不一致当我在C ++中运行原始代码时，它始终为我提供了更多线程的加速结果如下：

C++ results

#include "stdafx.h"
#include <stdio.h>
#include <iostream>
#include <omp.h>
using namespace std;

static long num_steps = 1e9;
long double step;

int main()
{
    int i;
    double x, pi, sum = 0.0;
    double start_time, run_time;

    step = 1.0 / (double)num_steps;
    for (i = 1;i <= omp_get_num_procs();i++) {

        sum = 0.0;
        omp_set_num_threads(i);
        start_time = omp_get_wtime();
#pragma omp parallel 
        {

#pragma omp for reduction(+:sum) private(x) schedule(static) 
            for (i = 1;i <= num_steps; i++) {
                x = (i - 0.5)*step;
                sum = sum + 4.0 / (1.0 + x*x);
            }
        }
        pi = step * sum;
        run_time = omp_get_wtime() - start_time;
        printf("%d thread(s) ---> pi  %f calculated in %f seconds\n", i, pi, run_time);
    }
    system("pause");
}

然而，当我运行C＃代码时，结果不一致。

using System;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;
namespace pi
{

    class Program
    {        
        static void Main(string[] args)
        {
            object lockObject = new object();
            long num_steps = (long)1E9;
            double step = 1.0 / num_steps;
            for (int j = 1; j <= Environment.ProcessorCount; j++)
            {
                double sum = 0.0;
                Stopwatch timer = Stopwatch.StartNew();
                Parallel.For(1, num_steps + 1, new ParallelOptions { MaxDegreeOfParallelism = j }, () => 0.0, (i, loopState, partialResult) =>
                {
                    var x = (i - 0.5) * step;
                    return partialResult + 4.0 / (1.0 + x * x);
                },
                localPartialSum =>
                {
                    lock (lockObject)
                    {
                        sum += localPartialSum;
                    }
                });
                var pi = step * sum;
                timer.Stop();
                Console.WriteLine($"{j} thread(s) ----> pi = {pi} calculated in {(timer.ElapsedMilliseconds)/1000.0} seconds");
            }
            Console.ReadKey();
        }
    }
}

结果看起来像

C# results

除了C＃在这种情况下比C ++慢得多，你有什么建议为什么C＃结果不一致。每次运行它都会得到不同的结果

并行化pi数计算

4 个答案: