Question

我想优化这段代码：

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                Byte  pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

这将用于图像处理，我们目前正在运行约200张图像。我们优化了GetPixel值以使用不安全的代码，我们没有使用image.Width或image.Height，因为这些属性增加了我们的运行时成本。

但是，我们仍然处于低速状态。问题是我们的图像是640x480，所以循环的中间被调用大约640x480x200倍。我想问一下是否有办法以某种方式加速它，或者让我相信它的速度足够快。也许一种方法是通过一些快速的矩阵加法，或者Matrix Addition固有的n ^ 2操作无法加速它？

也许通过不安全的代码进行数组访问可以加快速度，但我不知道如何去做，以及它是否值得花时间。可能不是。感谢。

编辑：谢谢你的所有答案。

这是我们正在使用的GetPixel方法：

 public Color GetPixel(int x, int y)
    {
        int offsetFromOrigin = (y * this.stride) + (x * 3);
        unsafe
        {
            return Color.FromArgb(this.imagePtr[offsetFromOrigin + 2], this.imagePtr[offsetFromOrigin + 1], this.imagePtr[offsetFromOrigin]);
        }
    }

Answer 1

尽管使用了不安全的代码，GetPixel可能是这里的瓶颈。你有没有想过在一个调用中获取图像中所有像素的方法而不是每个像素一次？例如，Bitmap.LockBits可能是你的朋友......

在我的上网本上，一个非常简单的循环迭代640 * 480 * 200只需要大约100毫秒 - 所以如果你发现它一切都进展缓慢，你应该再看一下这个位在循环中。

您可能需要考虑的另一个优化：避免使用多维数组。它们比单维阵列慢得多。

特别是，你可以拥有一个大小为Width * Height的一维数组，并保留一个索引：

int index = 0;
for (int x = 0; x < Width; x++)
{
    for (int y = 0; y < Height; y++)
    {
        Byte pixelValue = image.GetPixel(x, y).B;
        this.sumOfPixelValues[index] += pixelValue;
        this.sumOfPixelValuesSquared[index] += pixelValue * pixelValue;
        index++;
    }
}

使用相同的简单测试工具，向2-D矩形阵列添加写入会使循环的总时间超过200 * 640 * 480，最高可达850ms;使用一维矩形阵列将其恢复到大约340ms - 所以它有点重要，而且目前你每循环迭代中有两个。

Answer 2

阅读这篇文章，其中还有一些代码，并提到了GetPixel的缓慢。

link text

从文章中可以看出这是简单地反转位的代码。这也显示了LockBits的用法。

请务必注意，不安全的代码不允许您远程运行代码。

public static bool Invert(Bitmap b)
{

BitmapData bmData = b.LockBits(new Rectangle(0, 0, b.Width, b.Height), 
                               ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb); 

int stride = bmData.Stride; 
System.IntPtr Scan0 = bmData.Scan0; 
unsafe 
{ 
    byte * p = (byte *)(void *)Scan0;
    int nOffset = stride - b.Width*3; 
    int nWidth = b.Width * 3;
    for(int y=0;y < b.Height;++y)
    {
        for(int x=0; x < nWidth; ++x )
        {
            p[0] = (byte)(255-p[0]);
            ++p;
        }
        p += nOffset;
    }
}

b.UnlockBits(bmData);

return true;

}

Answer 3

我建议您分析此代码并找出花费最多时间的内容。

您可能会发现这是下标操作，在这种情况下，您可能希望更改数据结构：

long sumOfPixelValues[n,m];
long sumOfPixelValuesSquared[n,m];

到

struct Sums
{
    long sumOfPixelValues;
    long sumOfPixelValuesSquared;
}

Sums sums[n,m];

这取决于您在分析代码后找到的内容。

Answer 4

代码分析是最佳起点。

Matrix添加是一种高度并行的操作，可以通过并行化多线程的操作来加速。

我建议使用包含线程高度优化的API的Intels IPP库进行此类操作。也许令人惊讶的是它只有100美元左右 - 但会给你的项目带来很大的复杂性。

如果您不想使用混合语言编程和IPP来解决问题，您可以尝试使用centerpace的C＃math库。 NMath API包含易于使用的前向缩放，矩阵运算。

保

Answer 5

System.Drawing.Color是一种结构，在当前版本的.NET上杀死了大多数优化。由于您只对蓝色组件感兴趣，因此请使用仅获取所需数据的方法。

public byte GetPixelBlue(int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr[offsetFromOrigin];
    }
}

现在，交换x和y迭代的顺序：

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            this.sumOfPixelValues[y, x] += pixelValue;
            this.sumOfPixelValuesSquared[y, x] += pixelValue * pixelValue;
        }
    }
}

现在，您正在按顺序访问扫描行中的所有值，这将更好地利用所有三个矩阵的CPU缓存（image.imagePtr，sumOfPixelValues和sumOfPixelValuesSquared。[感谢Jon在我修复时注意到访问image.imagePtr，我打破了另外两个。现在交换输出数组索引以保持最佳。]

接下来，摆脱成员引用。另一个线程理论上可以将sumOfPixelValues设置为中间的另一个数组，这对于优化来说是可怕的可怕事情。

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    uint [,] sums = this.sumOfPixelValues;
    ulong [,] squares = this.sumOfPixelValuesSquared;
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            sums[y, x] += pixelValue;
            squares[y, x] += pixelValue * pixelValue;
        }
    }
}

现在编译器可以生成用于在两个输出数组中移动的最佳代码，并且在内联和优化之后，内部循环可以以3的步幅逐步通过image.imagePtr数组，而不是一直重新计算偏移量。现在这是一个不安全的版本，做了很好的测量，我认为.NET应该足够聪明，但可能不是：

unsafe public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    byte* scanline = image.imagePtr;
    fixed (uint* sums = &this.sumOfPixelValues[0,0])
    fixed (uint* squared = &this.sumOfPixelValuesSquared[0,0])
    for (int y = 0; y < Height; y++)
    {
        byte* blue = scanline;
        for (int x = 0; x < Width; x++)
        {
            byte pixelValue = *blue;
            *sums += pixelValue;
            *squares += pixelValue * pixelValue;
            blue += 3;
            sums++;
            squares++;
        }
        scanline += image.stride;
    }
}

Answer 6

图像存储在哪里？如果每个都在磁盘上，那么您的处理时间问题可能是从磁盘中获取它们。您可以检查一下是否存在问题，如果是，则重写以预取图像数据，以便阵列处理代码不必等待数据......

如果整个应用程序逻辑允许它（每个矩阵添加是独立的，还是依赖于先前矩阵添加的输出？）如果它们是独立的，我会检查在不同的线程上执行它们，或者并行执行它们。

Answer 7

我能想到加速它的唯一可行方法是尝试并行地进行一些添加，这与你的大小可能比线程开销更有利。

Answer 8

矩阵加法当然是n ^ 2操作，但你可以通过使用不安全的代码或至少使用锯齿状数组而不是多维来加速它。

Answer 9

关于有效加速矩阵乘法的唯一方法是使用正确的算法。有更有效的方法来加速矩阵乘法。看看Stressen和Coopersmith Winograd算法。还注意到[与之前的回复]你可以平行化代码，这有点帮助。

Answer 10

我不确定它是否更快，但你可以写一些类似的东西;

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        Byte pixelValue;
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

Answer 11

这是微优化失败的经典案例。你不会从那个循环中得到任何东西。为了获得真正的速度效益，您需要从大局开始： -

您可以在处理图像[n]时异步预加载图像[n + 1]吗？
您可以从图片中仅加载B通道吗？这会降低内存带宽吗？
你能加载B值并直接更新sumOfPixelValues（Squared）数组，即读取文件并更新而不是读取文件，存储，读取，更新？同样，这会降低内存带宽。
您可以使用一维数组而不是二维数组吗？也许创建自己的数组类，无论哪种方式都可以。
也许您可以考虑使用Mono和SIMD扩展程序？
您可以在多个cpu环境中以块的形式处理图像并将它们分配给空闲的CPU吗？

编辑：

尝试使用专门的图像访问器，这样就不会浪费内存带宽：

public Color GetBPixel (int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr [offsetFromOrigin + 1];
    }
}

或者，更好的是：

public Color GetBPixel (int offset)
{
    unsafe
    {
        return this.imagePtr [offset + 1];
    }
}

并在如下的循环中使用上述内容：

for (int start_offset = 0, y = 0 ; y < Height ; start_offset += stride, ++y)
{
   for (int x = 0, offset = start_offset ; x < Width ; offset += 3, ++x)
   {
      pixel = GetBPixel (offset);
      // do stuff
   }
}

Answer 12

如果您只添加矩阵，您可以考虑使用多核线程来加速利用多核处理器。也使用一维索引而不是两个。

如果您想进行更复杂的操作，则需要使用高度优化的数学库，如NMath.Net，它使用本机代码而不是.net。

Answer 13

有时在本地C＃中执行操作，即使是不安全的调用，也比使用已经优化的方法慢。

无法保证结果，但您可能需要调查System.Windows.Media.Imaging名称空间并以不同的方式查看整个问题。

Answer 14

虽然这是微观优化，因此可能不会增加太多，但你可能想要研究在你做什么时获得零的可能性

Byte  pixelValue = image.GetPixel(x, y).B;

显然，如果pixelValue = 0，则没有理由进行汇总，因此您的例程可能会变为

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
  {
  for (int x = 0; x < Width; x++)
    {
    for (int y = 0; y < Height; y++)
      {
       Byte  pixelValue = image.GetPixel(x, y).B;

       if(pixelValue != 0)
         {
         this.sumOfPixelValues[x, y] += pixelValue;
         this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
         }}}}

然而，问题是你多久会看到pixelValue = 0，以及计算和存储上的保存是否会抵消测试的成本。

Answer 15

矩阵的加法复杂度为O(n^2)，增加的数量。

但是，由于没有中间结果，您可以使用线程并行化添加内容：

很容易证明生成的算法将无锁
您可以调整要使用的最佳线程数

加速C＃中的矩阵添加

15 个答案: