Question

请考虑具有以下签名的C ++标准库中的以下算法：std::shuffle：

template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);

它对给定范围[first, last)中的元素进行重新排序，以使这些元素的每个可能排列具有相同的出现概率。

我正在尝试实现相同的算法，但是它在位级别起作用，随机地对输入序列的单词的位进行改组。考虑到64位字的序列，我正在尝试实现：

template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)

问题：如何尽可能有效地做到这一点（必要时使用编译器内部函数）？我并不一定要寻找一个完整的实现方式，而是更多地寻求研究的建议/方向，因为对于我来说，实际上是否有效地实现这一点还不是很清楚。

Answer 1

很明显，渐近速度为O(N)，其中N是位数。我们的目标是改善其中涉及的常数。

免责声明：提出的描述算法只是一个粗略的草图。有很多东西需要添加，尤其是要使其正常工作需要注意的许多细节。估计的执行时间将与此处声明的时间相同。

基线算法

最明显的一个是textbook approach，它执行N个操作，每个操作都涉及调用random_generator毫秒的R并访问该位的值两个不同的位，并在总共4 * A毫秒内为它们设置新值（A是读/写一位的时间）。假设数组查找操作花费C毫秒。因此，此算法的总时间为N * (R + 4 * A + 2 * C)毫秒（大约）。假设随机数生成花费更多的时间（即R >> A == C）也是合理的。

建议的算法

假设位存储在字节存储中，即我们将使用字节块。

unsigned char bit_field[field_size = N / 8];

首先，让我们计算一下位集中的1位的数量。为此，我们可以使用查找表并以字节数组的形式遍历位集：

# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
  bitcount_lookup[i] = 0;
  for (int b = 0; b < 8; ++b)
    bitcount_lookup[i] += (i >> b) & 1;
}

我们可以将其视为预处理开销（也可以在编译时进行计算），并说它需要0毫秒。现在，很容易计算1位的数目（以下过程将花费(N / 8) * C毫秒）：

int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
  bitcount += bitcount_lookup[*it];

现在，我们随机生成N / 8个数字（我们称其为结果数组gencnt[N / 8]），每个数字的范围为[0..8]，这样它们的总和为bitcount。这有点棘手，很难统一执行（与基线算法相比，生成统一分布的“正确”算法相当慢）。相当统一但快速的解决方案大致是：

用值gencnt[N / 8]填充v = bitcount / (N / 8)数组。
随机选择N / 16个“黑色”单元格。其余为“白色”。该算法与random permutation类似，但仅是数组的一半。
生成N / 16范围内的[0..v]个随机数。我们称它们为tmp[N / 16]。
将“黑色”单元格增加tmp[i]值，将“白色”单元格减少tmp[i]。这样可以确保总金额为bitcount。

在那之后，我们将得到一个统一ish的随机ish数组gencnt[N / 8]，其值是特定“单元”中1个字节的数量。全部生成于：

(N / 8) * C   +  (N / 16) * (4 * C)  +  (N / 16) * (R + 2 * C)
^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^
filling step      random coloring              filling

毫秒（此估算是在我的脑海中具体实现的）。最后，我们可以找到一个字节查找表，其中将指定位数设置为1（可以在开销上进行编译，甚至可以在编译时以constexpr的形式进行存储，因此我们假设这花费{{ 1}}毫秒）：

然后，我们可以如下填充std::vector<std::vector<unsigned char>> random_lookup(8); for (int c = 0; c < 8; c++) random_lookup[c] = { /* numbers with `c` bits set to `1` */ };（大约需要bit_field毫秒）：

(N / 8) * (R + 3 * C)

总结所有内容，我们总共有执行时间：
for (int i = 0; i < field_size; i++) {
  bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];
尽管它并不是真正地均匀地随机，但它确实将比特均匀且随机地散布开来，而且速度相当快，希望可以在您的用例中完成工作。

Answer 2

观察到实际的改组比特（涉及通过Fisher-Yates进行交换）对于生成确切的等价比特（这些比特的随机分布）不是必需的。

#include <iostream>
#include <vector>
#include <random>

// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
{
    std::random_device rd;
    static std::mt19937 gen(rd());  //mersenne_twister_engine seeded with rd()

    // count the number of set bits and clear bvector
    int set_bits_count = 0;
    for (int i=0; i < bvector.size(); i++)
        if (bvector[i])
        {
            set_bits_count++;
            bvector[i] = 0;
        }

    // set a bit if a random value in range bvector.size()-bit_loc-1 is
    // less than the number of bits remaining to be placed. This produces exactly the same
    // distribution as a random shuffle but only does an insertion of a 1 bit rather than
    // a swap. It requires counting the number of 1 bits. There are efficient ways
    // of doing this. See https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
    for (int bit_loc = 0; set_bits_count; bit_loc++)
    {
        std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
        auto x = dis(gen);
        if (x < set_bits_count)
        {
            bvector[bit_loc] = true;
            set_bits_count--;
        }
    }
    return bvector;
}

这等效于在bools中将vector<bool>改组。它对缓存行友好，不涉及交换。它按照OP的要求以可执行但简单的算法形式呈现。要优化它，可以做很多事情，例如提高位计数的速度和清除数组。

这将设置10位中的4位，调用“ shuffle”例程100,000次，并打印在10个位置中的每个位置出现1位的次数。每个位置应该大约有40,000。

int main()
{
    std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
    std::vector<int> totals(initial.size());
    for (int i = 0; i < 100000; i++)
        {
        auto a_distribution = DistributeBitsRandomly(initial);
        for (int ii = 0; ii < totals.size(); ii++)
            if (a_distribution[ii])
                totals[ii]++;
        }
    for (auto cnt : totals)
        std::cout << cnt << "\n";
}

可能的输出：

有效随机地改组单词序列的位

2 个答案:

基线算法

建议的算法