如何让这个C#循环更快?

时间:2011-05-12 19:42:57

标签: c# performance

执行摘要:如果您希望留在C#,Reed的答案是最快的。如果你愿意为C ++(我是C)编组,这是一个更快的解决方案。

我在C#中有两个55mb的ushort数组。我使用以下循环组合它们:

float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}

此代码根据之前和之后添加DateTime.Now调用,需要3.5秒才能运行。我怎样才能让它更快?

编辑:以下是一些代码,我认为这些代码显示了问题的根源。当在全新的WPF应用程序中运行以下代码时,我得到了这些计时结果:

Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods

因此,当直接遍历数组时,时间比数组在另一个对象或容器中的时间快得多。此代码显示,不知何故,我使用的是访问器方法,而不是直接访问数组。即便如此,我似乎能够获得的最快速度是半秒钟。当我使用icc在C ++中运行第二个代码列表时,我得到:

Run time for pointer walk: 0.0743338

在这种情况下,C ++的速度提高了7倍(使用icc,不确定msvc是否可以获得相同的性能 - 我不熟悉那里的优化)。有没有办法让C#接近C ++性能水平,或者我应该让C#调用我的C ++例程?

清单1,C#代码:

public class ArrayHolder
{
    int length;
    public ushort[] output;
    public ushort[] input1;
    public ushort[] input2;
    public ArrayHolder(int inLength)
    {
        length = inLength;
        output = new ushort[length];
        input1 = new ushort[length];
        input2 = new ushort[length];
    }

    public ushort[] getOutput() { return output; }
    public ushort[] getInput1() { return input1; }
    public ushort[] getInput2() { return input2; }
}


/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
    public MainWindow()
    {
        InitializeComponent();


        Random random = new Random();

        int length = 55 * 1024 * 1024;
        ushort[] output = new ushort[length];
        ushort[] input1 = new ushort[length];
        ushort[] input2 = new ushort[length];

        ArrayHolder theArrayHolder = new ArrayHolder(length);

        for (int i = 0; i < length; i++)
        {
            output[i] = (ushort)random.Next(0, 16384);
            input1[i] = (ushort)random.Next(0, 16384);
            input2[i] = (ushort)random.Next(0, 16384);
            theArrayHolder.getOutput()[i] = output[i];
            theArrayHolder.getInput1()[i] = input1[i];
            theArrayHolder.getInput2()[i] = input2[i];
        }

        Stopwatch stopwatch = new Stopwatch(); 
        stopwatch.Start();
        int number = 44;
        float b = (float)number / 100.0f;
        for (int i = 0; i < length; i++)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * (float)input2[i]));
        } 
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.output[i] =
                (ushort)(theArrayHolder.input1[i] +
                (ushort)(b * (float)theArrayHolder.input2[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.getOutput()[i] =
                (ushort)(theArrayHolder.getInput1()[i] +
                (ushort)(b * (float)theArrayHolder.getInput2()[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
    }
}

清单2,C ++等价物:     // looptiming.cpp:定义控制台应用程序的入口点。     //

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>


int _tmain(int argc, _TCHAR* argv[])
{

    int length = 55*1024*1024;
    unsigned short* output = new unsigned short[length];
    unsigned short* input1 = new unsigned short[length];
    unsigned short* input2 = new unsigned short[length];
    unsigned short* outPtr = output;
    unsigned short* in1Ptr = input1;
    unsigned short* in2Ptr = input2;
    int i;
    const int max = 16384;
    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = rand()%max;
        *in1Ptr = rand()%max;
        *in2Ptr = rand()%max;
    }

    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick1, tick2;   // A point in time
    LARGE_INTEGER time;   // For converting tick into real time


    QueryPerformanceCounter(&tick1);

    outPtr = output;
    in1Ptr = input1;
    in2Ptr = input2;
    int number = 44;
    float b = (float)number/100.0f;


    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
    }
    QueryPerformanceCounter(&tick2);
    QueryPerformanceFrequency(&ticksPerSecond);

    time.QuadPart = tick2.QuadPart - tick1.QuadPart;

    std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;

    return 0;
}

编辑2:在第二个示例中启用/ QxHost会将时间减少到0.0662714秒。修改第一个循环为@Reed建议让我归结为

时间流逝:00:00:00.3835017

所以,滑块还不够快。那段时间是通过代码:

        stopwatch.Start();
        Parallel.ForEach(Partitioner.Create(0, length),
         (range) =>
         {
             for (int i = range.Item1; i < range.Item2; i++)
             {
                 output[i] =
                     (ushort)(input1[i] +
                     (ushort)(b * (float)input2[i]));
             }
         });

        stopwatch.Stop();

编辑3 根据@Eric Lippert的建议,我在发布时重新运行C#中的代码,而不是使用附加的调试器,只需将结果打印到对话框中。他们是:

  • 简单数组:~0.273s
  • 包含数组:~0.330s
  • 存取器阵列:~0.345s
  • 并行数组:~0.190s

(这些数字来自5次平均值)

因此,并行解决方案肯定比我之前获得的3.5秒快,但仍然有点低于使用非icc处理器可实现的0.074秒。因此,似乎最快的解决方案是在发布中编译然后编组到icc编译的C ++可执行文件,这使得可以使用滑块。

编辑4:来自@Eric Lippert的另外三个建议:将for循环的内部从length更改为array.length,使用双精度,并尝试使用不安全的代码。

对于这三个人,现在的时机是:

  • 长度:~0.274s
  • 双打,不漂浮:~0.290s
  • 不安全:~0.376s

到目前为止,并行解决方案是最大赢家。虽然如果我可以通过着色器添加这些,也许我可以在那里看到某种加速......

这是附加代码:

        stopwatch.Reset();

        stopwatch.Start();

        double b2 = ((double)number) / 100.0;
        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b2 * (double)input2[i]));
        }

        stopwatch.Stop();
        DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        stopwatch.Reset();

        stopwatch.Start();

        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * input2[i]));
        }

        stopwatch.Stop();
        LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        unsafe
        {
            fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
                ushort* outP = outPtr;
                ushort* in1P = in1Ptr;
                ushort* in2P = in2Ptr;
                for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
                {
                    *outP = (ushort)(*in1P + b * (float)*in2P);
                }
            }
        }

        stopwatch.Stop();
        UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);

2 个答案:

答案 0 :(得分:19)

这应该是完全可并行化的。但是,考虑到每个元素的工作量很少,您需要格外小心处理。

执行此操作(在.NET 4中)的正确方法是将Parallel.ForEach与分区程序结合使用:

float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length), 
(range) =>
{
   for (int i = range.Item1; i < range.Item2; i++)
   {
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   }
});

这将有效地划分系统中可用处理核心的工作,并且如果您有多个核心,则应提供适当的加速。

话虽这么说,但这最多只会加速系统内核数量的增加。如果你需要加快速度,你可能需要恢复混合的并行化和不安全的代码。在那时,可能值得考虑尝试实时呈现这一点的替代方案。

答案 1 :(得分:7)

假设您有很多这样的人,您可以尝试并行化操作(并且您使用的是.NET 4):

Parallel.For(0, length, i=>
   {
       image.DataArray[i] = 
      (ushort)(mUIHandler.image1.DataArray[i] + 
      (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   });

当然,这完全取决于这种并行化是否值得。该陈述在计算上看起来很短;按编号访问索引的速度非常快。您可能会获得收益,因为这个循环正在运行那么多次数据。