我编写了一个代码,它以单线程和多线程方式执行求和功能,如下所示:
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Diagnostics;
namespace ParallelFor
{
class Program
{
static void Main()
{
Console.WriteLine("Program started");
var step_time = Stopwatch.StartNew();
step_time.Start();
double R1 = 0.0;
double R2 = 0.0;
double R3 = 0.0;
var t1 = new Thread(() => TestCounter(2000, ref R1, 1));
var t2 = new Thread(() => TestCounter(2000, ref R2, 2));
var t3 = new Thread(() => TestCounter(2000, ref R3, 3));
t1.Start();
t2.Start();
t3.Start();
do
{
} while (t1.IsAlive == true || t2.IsAlive == true || t3.IsAlive == true);
Console.WriteLine("inside R1: {0}", R1);
Console.WriteLine("inside R2: {0}", R2);
Console.WriteLine("inside R3: {0}", R3);
Console.WriteLine("Program finished");
step_time.Stop();
Console.WriteLine("multi-thread last {0} (MilSec)\n", step_time.ElapsedMilliseconds);
step_time.Reset();
step_time.Start();
R1 = 0.0;
R2 = 0.0;
R3 = 0.0;
TestCounter(2000, ref R1, 1);
TestCounter(2000, ref R2, 2);
TestCounter(2000, ref R3, 3);
Console.WriteLine("inside R1: {0}", R1);
Console.WriteLine("inside R2: {0}", R2);
Console.WriteLine("inside R3: {0}", R3);
step_time.Stop();
Console.WriteLine("single thread last {0} (MilSec)\n", step_time.ElapsedMilliseconds);
Console.ReadLine();
}
static void TestCounter(int counter, ref double result, int No)
{
for (int i = 0; i < counter + 1; i++)
for (int j = 0; j < counter; j++)
for (int k = 0; k < counter; k++)
result += (double)i;
}
}
}
我发现单线程部分,持续时间更短! (我用counter = 10000运行代码,结果是一样的!) 为什么单线程会更快地进行?!?!
答案 0 :(得分:3)
我认为设置线程和等待线程完成的开销高于在多个线程上运行代码的收益。 此外,如果您启动的线程数多于可用内核数,则由于许多上下文切换,您的代码将会变慢。
您可以尝试的一种优化是使用Monitor.Wait()
调用来摆脱手动同步。您可以手动创建Thread
个对象(尽可能多的CPU核心),启动所有对象,然后通过调用Thread.Join()
等待线程完成。这样,您的线程中不需要任何同步代码:
Thread[] threads = new Thread[NumCores];
for (int i = 0; i < NumCores; i++)
{
threads[i] = new Thread(MyThread);
threads[i].Start(threadData);
}
for (int i = 0; i < NumCores; i++){
threads[i].Join();
}
答案 1 :(得分:1)
问题很可能是Cache Thrashing。我将TestCounter方法更改为以下内容:
static void TestCounter(int counter, ref double result, int No)
{
double internalResult = 0;
for (int i = 0; i < counter + 1; i++)
for (int j = 0; j < counter; j++)
for (int k = 0; k < counter; k++)
internalResult += (double)i;
result += internalResult;
}
这里,该值首先累积到局部变量中,最后写入传递的结果变量。因此,线程大多数时间在其本地堆栈上工作,而不是在其他线程也访问的内存位置(分别是缓存线)上工作。
有了这个改变,我几乎得到了预期因子3的加速。