Question

我有一个大小为N的方形布尔矩阵M，按行存储，我想对每一列计数设置为1的位数。

例如n = 4：

1101
0101
0001
1001

M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };

result = { 2, 2, 0, 4};

我显然可以

将矩阵M转换为矩阵M'
弹出M'的每一行。

通过位操作，存在用于矩阵转置和弹出计数的良好算法。

我的问题是：是否可以将这些算法“合并”为一个算法？

请注意，对于64位架构，N可能会很大（例如1024以上）。

Answer 1

我还有一个想法，我的写作还不够好。

Godbolt link to messy work-in-progress ，它没有正确的循环边界/清除功能，但是对于大型缓冲区，其运行速度比我的Skylake i7版本上@edrezen的版本快约3倍。 6700k，带有g ++ 7.3 -O3 -march = native。请参见test_SWAR_avx2函数。（我知道它不能在Godbolt上编译； Agner Fog的asmlib.h不存在。）

我也可能以错误的顺序排列了一些列，但是从逐步通过汇编开始，我认为它做的工作量适当。即任何必要的错误修正都不会减慢速度。

我使用了16位累加器，因此，如果您关心的输入足够大以至于溢出16位每列计数器，那么可能需要另一个外部循环。

有趣的观察：我的循环的一个较早的越野车版本在sum0123中两次使用store_globalsums_from_vec16，而未使用sum4567，因此在主循环中进行了优化。用较少的工作，gcc完全展开了大的for(int i=0 ; i<5 ; i++)循环，并且代码运行速度较慢，例如每个字节大约1个周期，而不是0.5个周期。对于uop缓存或其他东西，循环可能太大了（我尚未进行概要分析，但是前端解码瓶颈可以解释它）。由于某种原因，@ edrezen的版本对我来说仅以1.5c / B的速度运行，而不是答案中报告的〜1.25。我的CPU实际上运行的是3.9GHz，但是Agner Fog的库在4.0下可以检测到它，但这还不足以解释它。

此外，gcc将sum4567_16bit溢出到堆栈中，因此我们已经在没有AVX512的情况下推动了寄存器压力的边界。它不经常更新，这不是问题，但是在内部循环中可能需要更多的累加器。

当列数不是32时，您的数据布局不清楚。

对于32列的每个uint32_t块，似乎所有行都连续存储在内存中。即循环遍历某列的行是有效的。如果您有32列以上，则列32..63的行将是连续的，并排在列0..31的所有行之后。

（如果相反，如果您有一个连续行的所有列，您仍然可以使用此想法，但是可能需要将一些累加器溢出/重新加载到内存中，或者如果编译器有不错的选择，则让编译器为您完成。）

因此，加载32字节（8 dword）向量将获得一列块的8行数据。这非常方便，它允许从1位（在内存中）扩展到2位累加器，然后在扩展到4位之前获取更多数据，依此类推，这样一来，总之就可以完成大量的工作仍然很密集。（而不是每个字节仅向向量累加器添加1位（0或1）。）

展开的次数越多，我们可以从内存中获取更多的数据，从而更好地利用向量中的编码空间。即我们的变量具有更高的熵。每条vpaddb/w/d/q或拆包/混洗指令都抛出更多的数据（就其所贡献的内存位而言）是一件好事。

在SIMD向量中，小于1字节的累加器基本上是一种https://en.wikipedia.org/wiki/SWAR技术，在这种情况下，您必须对移出元素边界的位进行AND运算，因为我们没有SIMD元素边界我们。（而且我们还是避免了溢出，因此ADD携带到下一个元素中就不成问题了。）

每个内循环迭代：

从2或3（每组）行中的同一列中获取数据向量。因此，您可以从32列的一个块中获得3 * 8行，或者从256列的3行中获得。

使用set1(0b01010101)对其进行掩码以获取偶数（低）位，并使用(vec>>1) & mask（_mm256_srli_epi32(v,1)）对其进行掩码以获取奇数（高）位。使用_mm256_add_epi8在那些2位累加器中累加。它们不能只溢出3个，因此进位传播边界实际上并不重要。

向量的每个字节都有4个单独的垂直和，并且您有两个向量（奇/偶）。

再次重复以上操作，从内存中的3个数据向量中获得另一对向量。

再次组合以获得4个4位累加器的矢量（可能值为0..6）。当然，仍然不能在单个32位元素中混合位，因为我们绝对不能这样做。移位只能将奇/高列的位移动到包含它们的2位或4位单元的底部，因此可以将它们与在其他向量中以相同方式移动的位相加。

_mm256_unpacklo/hi_epi8并通过掩码或shift + mask获得8位累加器

将以上内容放入运行最多5次的循环中，因此0..12累加器值上升至0..60（即使用2位余量来解压缩8位累加器，所有的编码空间。）

如果您的答案具有数据布局，那么我们可以从同一向量内的dword元素中添加数据。我们可以这样做，所以在将累加器扩展到16位时不会用完寄存器（因为x86-64仅具有16个YMM寄存器，并且我们需要一些常量）。

_mm256_unpacklo/hi_epi16并添加，以交织成对的8位计数器，因此同一列的一组计数器已从dword扩展为qword。

重复此一般想法以减少累加器分布的寄存器（或__m256i变量）的数量。

有效地解决缺少跨行2输入字节或单词混洗的麻烦，但这只是总工作量的很小一部分。 vextracti128 / vpaddb xmm-> vpmovzxbw运作良好。

Answer 2

我在两种方法之间建立了一些基准：

转置+弹出计数
逐行更新

我为这两种方法编写了一个简单的版本和一个AVX2。我对AVX2的“ transpose + popcount”方法使用了一些函数（可在stackoverflow或其他地方找到）。

在我的测试中，我假设输入是位打包格式的nbRowsx32矩阵（nbRows本身是32的倍数）；因此，矩阵存储为uint32_t数组。

代码如下：

#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>

using namespace std;
using namespace std::chrono;

// see https://stackoverflow.com/questions/24225786/fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);

// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);

static double deviation (double n, double sum2, double sum);

////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    assert (nbRows%32==0);

    uint8_t transpo[32][32];  memset (transpo, 0, sizeof(transpo));

    for (uint64_t k=0; k<nbRows; k+=32)
    {
        // We unpack and transpose the input into a 32x32 bytes matrix
        for (size_t row=0; row<32; row++)
        {
            for (size_t col=0; col<32; col++)  {  transpo[col][row] = (bitmap[k+row] >> col) & 1 ;  }
        }

        for (size_t row=0; row<32; row++)
        {
            // We popcount the current row
            u_int8_t sum=0;
            for (size_t col=0; col<32; col++)  {  sum += transpo[row][col];  }

            // We update the corresponding global sum
            globalSums[row] += sum;
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    for (uint64_t row=0; row<nbRows; row++)
    {
        for (size_t col=0; col<32; col++)
        {
            globalSums[col] += (bitmap[row] >> col) & 1;
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    assert (nbRows%32==0);

    uint32_t transpo[32];

    const uint32_t* loop = bitmap;
    for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
    {
        // We transpose the input as a 32x32 bytes matrix
        sse_trans ((const char*)loop, (char*)transpo);

        // We update the global sums
        for (size_t i=0; i<32; i++)
        {
            globalSums[i] += __builtin_popcount (transpo[i]);
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////

// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
    // We update the local sums with the current row
    localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));

    // Go recursively
    UpdateLocalSums<N-1>(localSums, bitmap, k);
}

template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}

// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1

void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    union U256i {  __m256i v;   uint8_t a[32];  uint32_t b[8];  };

    // We use 1 register for updating local totals
    __m256i   localSums = _mm256_setzero_si256();

#ifdef USE_AVX2_FOR_GRAND_TOTALS
    // Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
    __m256i   globalSumsReg[4];  for (size_t r=0; r<4; r++)  {   globalSumsReg[r] = _mm256_setzero_si256(); }
#endif

    uint64_t steps = nbRows / 255;
    uint64_t k=0;

    const int divisorOf255 = 5;

    // We iterate over all rows
    for (uint64_t i=0; i<steps; i++)
    {
        // we update the local totals (255*32=8160 additions)
        for (int j=0; j<255/divisorOf255; j++)
        {
            // unroll some portion of the 255 loop through template specialization
            UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
        }

#ifdef USE_AVX2_FOR_GRAND_TOTALS
        // Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums

        // We take the 128 high bits of the local sums
        __m256i   localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));

        globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
        );
        globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
        );
        globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
        );
        globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
        );
#else
        // we update the global totals
        U256i tmp = { localSums };
        for (size_t k=0; k<32; k++)  {  globalSums[k] += tmp.a[k];  }
#endif
        // we reset the local totals
        localSums = _mm256_setzero_si256();
    }

#ifdef USE_AVX2_FOR_GRAND_TOTALS
    // We update the global totals into the final uint32_t array
    for (size_t r=0; r<4; r++)
    {
        U256i tmp = { globalSumsReg[r] };
        for (size_t k=0; k<8; k++)  {  globalSums[r*8+k] += tmp.b[k];  }
    }
#endif

    // we update the remaining local totals
    for (uint64_t i=steps*255; i<nbRows; i++)
    {
        UpdateLocalSums<1>(localSums, bitmap, k);
    }

    // we update the global totals
    U256i tmp = { localSums };
    for (size_t k=0; k<32; k++)  {  globalSums[k] += tmp.a[k];  }
}

////////////////////////////////////////////////////////////////////////////////
void execute (
    const char* name,
    void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums),
    size_t nbRuns,
    uint64_t nbRows,
    u_int32_t* bitmap
)
{
    uint64_t  sums[32];

    double timeTotal=0;
    double cycleTotal=0;
    double timeTotal2=0;
    double cycleTotal2=0;
    uint64_t check=0;

    for (size_t n=0; n<nbRuns; n++)
    {
        memset(sums,0,sizeof(sums));

        // We want both time and cpu cycles information
        milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
        uint64_t c0 = ReadTSC();

        // We run the test
        (*fct) (nbRows, bitmap, sums);

        uint64_t c1 = ReadTSC();
        milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());

        timeTotal  += (t1-t0).count();
        cycleTotal += (double)(c1-c0) / nbRows;

        timeTotal2  += (t1-t0).count() * (t1-t0).count();
        cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);

        // We compute some dummy checksum
        for (size_t k=0; k<32; k++)  {  check += sums[k];  }
    }

    printf ("%-21s |  %5.0lf (%5.1lf)            |  %5.2lf (%4.2lf)          |  %.3lf           |  0x%lx\n",
        name,
        timeTotal / nbRuns,
        deviation (nbRuns, timeTotal2, timeTotal),
        cycleTotal/nbRuns,
        deviation (nbRuns, cycleTotal2, cycleTotal),
        check,
        nbRows * cycleTotal / timeTotal / 1000000.0
    );
}

////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    // We set rows number as 2^n where n is the provided argument
    // For simplification, we assume that the rows number is a multiple of 32
    uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
    size_t   nbRuns = argc>2 ? atoi(argv[2]) : 10;

    // We build an bitmap of size nbRows*32
    uint32_t* bitmap = new uint32_t[nbRows];
    if (bitmap==nullptr)
    {
        fprintf(stderr, "unable to allocate the bitmap\n");
        exit(1);
    }

    // We fill the bitmap with random values
    srand(time(nullptr));
    for (uint64_t i=0; i<nbRows; i++)    {  bitmap[i] = rand() & 0xFFFFFFFF;  }

    printf ("\n");
    printf ("nbRows=%ld  nbRuns=%ld\n", nbRows, nbRuns);
    printf ("------------------------------------------------------------------------------------------------------------\n");
    printf ("name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum\n");
    printf ("------------------------------------------------------------------------------------------------------------\n");

    // We launch the benchmark
    execute ("naive (transpo)   ", test_transpose_popcnt_naive,  nbRuns, nbRows, bitmap);
    execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
    execute ("AVX2  (transpo)   ", test_transpose_popcnt_avx2,   nbRuns, nbRows, bitmap);
    execute ("AVX2  (row by row)", test_update_row_by_row_avx2,  nbRuns, nbRows, bitmap);

    printf ("\n");

    // Some clean up
    delete[] bitmap;

    return EXIT_SUCCESS;
}

////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);

    // Each byte gets the source byte containing the corresponding bit
    __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf     = _mm256_shuffle_epi8(xbcast, shufmask);
    __m256i andmask  = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_and_si256(shuf, andmask);

    // Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
    return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}

////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]

    int rr, cc, i, h;
    union { __m256i x; uint8_t b[32]; } tmp;

    for (cc = 0; cc < 32; cc += 8)
    {
        for (i = 0; i < 32; ++i)
            tmp.b[i] = INP(i, cc);

        for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
            *(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
    }
}

////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum)  {  return sqrt (sum2/n - (sum/n)*(sum/n)); }

一些评论：

我使用了Agner Fog的asmlib来具有返回CPU周期的函数
编译命令为g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
gcc版本为7.3.1
CPU为2.60GHz的Intel（R）Core（TM）i7-6700HQ CPU
我计算了一些虚拟校验和以比较不同测试的结果

现在结果：

------------------------------------------------------------------------------------------------------------
name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo)       |   4548 ( 36.5)            |  43.91 (0.35)          |  2.592           |  0x9affeb5a6
naive (row by row)    |   3033 ( 11.0)            |  29.29 (0.11)          |  2.592           |  0x9affeb5a6
AVX2  (transpo)       |    767 ( 12.8)            |   7.40 (0.12)          |  2.592           |  0x9affeb5a6
AVX2  (row by row)    |    130 (  4.0)            |   1.25 (0.04)          |  2.591           |  0x9affeb5a6

因此，到目前为止，AVX2中的“逐行”似乎是最好的。

请注意，当我看到此结果（每行少于2个周期）时，我没有做出更多努力来优化AVX2的“ transpose + popcount”方法，该方法可以通过并行计算多个popcount来实现（我可以对其进行测试）稍后）。

Answer 3

我最终根据Peter Cordes提出的高熵SWAR方法编写了另一种实现。此实现是递归的，并且依赖于C ++模板专业化。

全局思想是将N位累加器填充到最大，而不会产生进位溢出（使用递归的地方）。当这些累加器被填充后，我们将更新总计，然后从新的N位累加器开始填充，直到所有行都被处理为止。

这是代码（请参见功能test_SWAR_recursive）：

#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>

using namespace std;
using namespace std::chrono;

// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();

static double deviation (double n, double sum2, double sum)  {  return sqrt (sum2/n - (sum/n)*(sum/n)); }

////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////

template<int DEPTH>
struct RecursiveSWAR
{
    // Number of accumulators for current depth
    static const int N = 1<<DEPTH;

    // Array of N-bit accumulators
    typedef __m256i Array[N];

    // Magic numbers (0x55555555, 0x33333333, ...) computed recursively
    static const u_int32_t MAGIC_NUMBER =
        RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
            * (1 + (1<<(1<<(DEPTH-1))))
            / (1 + (1<<(1<<(DEPTH+0))));

    static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
    {
        // We reset the N-bit accumulators
        for (int i=0; i<N; i++)  {  accumulators[i] = _mm256_setzero_si256();  }

        // We check (only for depth big enough) that we have still rows to process
        if (DEPTH>=3)  if (begin>=end)  { return; }

        typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;

        // We load a register with the mask
        __m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);

        // We fill the N-bit accumulators to their maximum capacity without carry overflow
        for (int i=0; i<N+1; i++)
        {
            // We fill (N-1)-bit accumulators recursively
            RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);

            // We update the N-bit accumulators from the (N-1)-bit accumulators
            for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
            {
                // LOW part
                accumulators[2*j+0] = _mm256_add_epi32 (
                    accumulators[2*j+0],
                    _mm256_and_si256 (
                        accumulatorsMinusOne[j],
                        mask
                    )
                );

                // HIGH part
                accumulators[2*j+1] = _mm256_add_epi32 (
                    accumulators[2*j+1],
                    _mm256_and_si256 (
                        _mm256_srli_epi32 (
                            accumulatorsMinusOne[j],
                            RecursiveSWAR<DEPTH-1>::N
                        ),
                        mask
                    )
                );
            }
        }
    }
};

// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
    static const int N = 1;

    typedef __m256i Array[N];

    static const u_int32_t MAGIC_NUMBER = 0x55555555;

    static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
    {
        // We just load 8 rows in the AVX2 register
        result[0] = _mm256_loadu_si256 ((__m256i*)begin);

        // We update the iterator
        begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
    }
};

template<int DEPTH> struct TypeInfo  { };
template<> struct TypeInfo<3>  {  typedef u_int8_t  Type; };
template<> struct TypeInfo<4>  {  typedef u_int16_t Type; };
template<> struct TypeInfo<5>  {  typedef u_int32_t Type; };

unsigned char reversebits (unsigned char b)
{
    return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}

void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t*  globalSums)
{
    static const int DEPTH = 4;

    RecursiveSWAR<DEPTH>::Array accumulators;

          uint32_t* begin = (uint32_t*) bitmap;
    const uint32_t* end   = bitmap + nbRows;

    // We reset the grand totals
    for (int i=0; i<32; i++)  { globalSums[i] = 0; }

    while (begin < end)
    {
        // We fill the N-bit accumulators to the maximum without overflow
        RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);

        // We update grand totals from the filled N-bit accumulators
        for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
        {
            int r = reversebits(i) >> (8-DEPTH);
            u_int32_t* sums   = globalSums+r;
            TypeInfo<DEPTH>::Type*  values = (TypeInfo<DEPTH>::Type*) (accumulators+i);

            for (int j=0; j<8*(1<<(5-DEPTH)); j++)
            {
                sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
            }
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
void execute (
    const char* name,
    void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t*  globalSums),
    size_t nbRuns,
    uint64_t nbRows,
    u_int32_t* bitmap
)
{
    uint32_t  sums[32];

    double timeTotal=0;
    double cycleTotal=0;
    double timeTotal2=0;
    double cycleTotal2=0;
    uint64_t check=0;

    for (size_t n=0; n<nbRuns; n++)
    {
        memset(sums,0,sizeof(sums));

        // We want both time and cpu cycles information
        milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
        uint64_t c0 = ReadTSC();

        // We run the test
        (*fct) (nbRows, bitmap, sums);

        uint64_t c1 = ReadTSC();
        milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());

        timeTotal  += (t1-t0).count();
        cycleTotal += (double)(c1-c0) / nbRows;

        timeTotal2  += (t1-t0).count() * (t1-t0).count();
        cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);

        // We compute some dummy checksum
        for (size_t k=0; k<32; k++)  {  check += (k+1)*sums[k];  }
    }

    printf ("%-21s |  %5.0lf (%5.1lf)            |  %5.2lf (%5.3lf)         |  %.3lf           |  0x%lx\n",
        name,
        timeTotal / nbRuns,
        deviation (nbRuns, timeTotal2, timeTotal),
        cycleTotal/nbRuns,
        deviation (nbRuns, cycleTotal2, cycleTotal),
        nbRows * cycleTotal / timeTotal / 1000000.0,
        check/nbRuns
    );
}


////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    // We set rows number as 2^n where n is the provided argument
    // For simplification, we assume that the rows number is a multiple of 32
    uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
    size_t   nbRuns = argc>2 ? atoi(argv[2]) : 10;

    // We build an bitmap of size nbRows*32
    uint64_t actualNbRows = nbRows + 100000;
    uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
    if (bitmap==nullptr)
    {
        fprintf(stderr, "unable to allocate the bitmap\n");
        exit(1);
    }
    memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);

    // We fill the bitmap with random values
    //    srand(time(nullptr));
    for (uint64_t i=0; i<nbRows; i++)    {  bitmap[i] = rand() & 0xFFFFFFFF;  }


    printf ("\n");
    printf ("nbRows=%ld  nbRuns=%ld\n", nbRows, nbRuns);
    printf ("------------------------------------------------------------------------------------------------------------\n");
    printf ("name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum\n");
    printf ("------------------------------------------------------------------------------------------------------------\n");

    // We launch the benchmark
    execute ("AVX2  (SWAR rec)  ", test_SWAR_recursive,          nbRuns, nbRows, bitmap);

    printf ("\n");

    // Some clean up
    _mm_free (bitmap);

    return EXIT_SUCCESS;
}

在此代码中，累加器的大小为2 ^ DEPTH。请注意，此实现在DEPTH = 5之前有效。对于DEPTH = 4，这是与Peter Cordes（称为高熵SWAR）的实现相比的性能结果：

该图给出了处理一行（共32个项目）所需的周期数，该周期数是矩阵行数的函数。正如预期的那样，由于主要思想相同，因此结果非常相似。注意图的三个部分很有趣：

log2（n）的恒定值<= 20
log2（n）的值在20到22之间增加
log2（n）> = 22的恒定值

我想CPU缓存属性可以解释这种现象。

矩阵转置和总体计数

3 个答案: