Question

是否有一个函数（SSEx内在函数可以），它将用指定的int32_t值填充内存？例如，当此值等于0xAABBCC00时，结果内存应如下所示：

AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
AABBCC00AABBCC00AABBCC00AABBCC00AABBCC00
...

我可以使用std::fill或简单的for循环，但速度不够快。

在程序开始时仅执行一次矢量调整，这不是问题。瓶颈正在填补记忆。

简化代码：

struct X
{
  typedef std::vector<int32_t> int_vec_t;
  int_vec_t buffer;

  X() : buffer( 5000000 ) { /* some more action */ }
  ~X() { /* some code here */ }

  // the following function is called 25 times per second
  const int_vec_t& process( int32_t background, const SOME_DATA& data );
};

const X::int_vec_t& X::process( int32_t background, const SOME_DATA& data )
{
    // the following one string takes 30% of total time of #process function
    std::fill( buffer.begin(), buffer.end(), background );

    // some processing
    // ...

    return buffer;
}

Answer 1

我就是这样做的（请原谅我的微软）：

VOID FillInt32(__out PLONG M, __in LONG Fill, __in ULONG Count)
{
    __m128i f;

    // Fix mis-alignment.
    if ((ULONG_PTR)M & 0xf)
    {
        switch ((ULONG_PTR)M & 0xf)
        {
            case 0x4: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0x8: if (Count >= 1) { *M++ = Fill; Count--; }
            case 0xc: if (Count >= 1) { *M++ = Fill; Count--; }
        }
    }

    f.m128i_i32[0] = Fill;
    f.m128i_i32[1] = Fill;
    f.m128i_i32[2] = Fill;
    f.m128i_i32[3] = Fill;

    while (Count >= 4)
    {
        _mm_store_si128((__m128i *)M, f);
        M += 4;
        Count -= 4;
    }

    // Fill remaining LONGs.
    switch (Count & 0x3)
    {
        case 0x3: *M++ = Fill;
        case 0x2: *M++ = Fill;
        case 0x1: *M++ = Fill;
    }
}

Answer 2

我不得不问：您是否明确地描述了std::fill并将其视为性能瓶颈？我猜它是以一种非常有效的方式实现的，这样编译器就可以自动生成适当的指令（例如gcc上的-march）。

如果是瓶颈，可能仍然可以从算法重新设计中获得更好的好处（如果可能的话）以避免设置如此多的内存（显然是一遍又一遍），这样就不再重要哪个填充机制了使用

Answer 3

感谢大家的回答。我已经检查了wj32's solution ，但它显示的时间与std::fill非常相似。借助函数std::fill，我当前的解决方案比memcpy快4倍（在Visual Studio 2008中）：

 // fill the first quarter by the usual way
 std::fill(buffer.begin(), buffer.begin() + buffer.size()/4, background);
 // copy the first quarter to the second (very fast)
 memcpy(&buffer[buffer.size()/4], &buffer[0], buffer.size()/4*sizeof(background));
 // copy the first half to the second (very fast)
 memcpy(&buffer[buffer.size()/2], &buffer[0], buffer.size()/2*sizeof(background));

在生产代码中，需要添加检查buffer.size()是否可以被4整除，并为此添加适当的处理。

Answer 4

您是否考虑过使用

vector<int32_t> myVector;
myVector.reserve( sizeIWant );

然后使用std :: fill？或者也许是std::vector的构造函数，它将持有的项目数和初始化它们的值作为参数？

Answer 5

不完全确定如何连续设置4个字节，但是如果要重新填充只有一个字节的内存，则可以使用memset。

void * memset ( void * ptr, int value, size_t num );
填充内存块

将ptr指向的内存块的前NUM个字节设置为指定值（解释为unsigned char）。

Answer 6

假设你的背景参数中有一定数量的值（或者甚至更好，只有on），也许你应该尝试分配一个静态向量，并简单地使用memcpy。

const int32_t sBackground = 1234;
static vector <int32_t> sInitalizedBuffer(n, sBackground);

    const X::int_vec_t& X::process( const SOME_DATA& data )
    {
        // the following one string takes 30% of total time of #process function
        std::memcpy( (void*) data[0], (void*) sInitalizedBuffer[0], n * sizeof(sBackground));

        // some processing
        // ...

        return buffer;
    }

Answer 7

我刚用g ++测试了std :: fill并进行了全面优化（启用了SSE等）：

#include <algorithm>
#include <inttypes.h>

int32_t a[5000000];

int main(int argc,char *argv[])
{
    std::fill(a,a+5000000,0xAABBCC00);
    return a[3];
}

并且内循环看起来像：

L2:
    movdqa  %xmm0, -16(%eax)
    addl    $16, %eax
    cmpl    %edx, %eax
    jne L2

看起来0xAABBCC00 x 4被加载到xmm0中，并且一次被移动16个字节。

Answer 8

它可能有点不可移植，但您可以使用重叠的内存副本。用你想要的模式填充前四个字节，并使用memcpy（）。

int32* p = (int32*) malloc( size );
*p = 1234;
memcpy( p + 4, p, size - 4 );

不要认为你可以加快速度

如何使用`int32_t`值快速填充内存？

8 个答案: