Question

我正在使用Dipperstein的bitarray.cpp类来处理双层（黑白）图像，其中图像数据本身就像一位像素一样存储。

我需要使用for循环遍历每个位，每张图像大小为4-9百万像素，数百张图像，如：

for( int i = 0; i < imgLength; i++) {
    if( myBitArray[i] == 1 ) {
         //  ... do stuff ...
    }
}

性能可用，但并不令人惊讶。我通过gprof运行程序，发现有很多时间和数百万次调用std::vector方法，如迭代器和开始。这是顶部采样函数：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 37.91      0.80     0.80        2     0.40     1.01  findPattern(bit_array_c*, bool*, int, int, int)
 12.32      1.06     0.26 98375762     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::__normal_iterator(unsigned char const* const&)
 11.85      1.31     0.25 48183659     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator+(int const&) const
 11.37      1.55     0.24 49187881     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::begin() const
  9.24      1.75     0.20 48183659     0.00     0.00  bit_array_c::operator[](unsigned int) const
  8.06      1.92     0.17 48183659     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::operator[](unsigned int) const
  5.21      2.02     0.11 48183659     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char const*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator*() const
  0.95      2.04     0.02                             bit_array_c::operator()(unsigned int)
  0.47      2.06     0.01  6025316     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >::__normal_iterator(unsigned char* const&)
  0.47      2.06     0.01  3012657     0.00     0.00  __gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >::operator*() const
  0.47      2.08     0.01  1004222     0.00     0.00  std::vector<unsigned char, std::allocator<unsigned char> >::end() const
... remainder omitted ...

我对C ++的STL并不是很熟悉，但是有没有人可以解释为什么std :: vector :: begin（）会被调用几百万次呢？当然，我是否可以做些什么来加快速度呢？

编辑：我只是放弃并优化了搜索功能（循环）。

Answer 1

您在配置文件输出中看到许多内联函数这一事实意味着它们没有被内联 - 也就是说，您没有在启用优化时进行编译。因此，优化代码最简单的方法就是使用-O2或-O3。

分析未经优化的代码很少值得，因为优化和未优化代码的执行配置文件可能会完全不同.33

Answer 2

快速了解bitarray.cpp的代码：

bool bit_array_c::operator[](const unsigned int bit) const
{
    return((m_Array[BIT_CHAR(bit)] & BIT_IN_CHAR(bit)) != 0);
}

m_Array的类型为std :: vector

STL向量上的[]运算符具有恒定的复杂性，但它可能实现为对vector :: begin的调用以获取数组的基址，然后计算偏移量以获得所需的值。因为bitarray.cpp在每次访问时调用[]运算符，所以你会收到很多电话。

根据您的用例，我将创建bitarray.cpp中包含的功能的自定义实现，并根据您的顺序，逐位访问模式对其进行调整。

不要使用unsigned char，使用32位或64位值来减少所需的内存访问次数。
我会使用普通数组，而不是向量来避免查找开销
创建一个顺序访问函数，nextbit（）不执行所有查找。存储指向当前“值”的指针，只需要在32/64位边界上递增它，边界之间的所有访问都是简单的掩码/移位操作，应该非常快。

Answer 3

如果没有看到您的代码，就很难就如何加快您的工作进行具体评论。但是，vector::begin()用于将迭代器返回到向量中的第一个元素 - 它是迭代向量时的标准例程。

我实际上建议使用更现代的分析器，例如OProfile，这将为您提供更精细的信息，说明您的程序花费的时间 - 实际的C ++行，甚至是个人asm指令，取决于你如何运行它。

顺便说一句 - 为什么你选择使用bitarray.cpp而不是香草std::vector<bool>？我自己没有使用它，但快速扫描上面的链接表明bitarray.cpp支持std::vector<bool>以上的额外功能，如果你没有使用它可能会增加开销与STL相比矢量类......

Answer 4

你可以通过使用指针/迭代器来提高性能（我不确定bitarray.cpp到底为你做了什么），如下所示：

for (bool *ptr = myBitArray, int i = 0; i != imgLength; ++i, ++ptr)
{
   if (*myBitArray == 1)
   {
       //handle
   }
}

我在这里只使用int i因为我不确定你的位数组是否会被空终止，在这种情况下你的条件可能只是

*myBitArray != '\0';

或者你可以克服更好的结局。使用std :: iterator是最好的，但我怀疑你的bitarray会支持它。

编辑：

通常这将是一个微优化，但如果你循环足够的东西，它可能会略微提高性能。

Answer 5

如果性能足够重要，您必须担心访问单个位，那么您应该并行化代码。由于您将其描述为图像处理，因此第i位的状态不会影响您处理第i + 1位到第i + 6位的方式，因此您可以重写代码以一次操作字节和字。只需将计数器增加8到64倍，就可以提高性能，并使编译器更容易优化代码。

优化位阵列访问

5 个答案: