Question

我已经实现了一个isPermutation函数，如果两个字符串相互置换，则会给两个字符串返回true，否则它将返回false。

一个使用c ++排序算法两次，而另一个使用一个int数组来跟踪字符串计数。

我多次运行代码，每次排序方法都更快。我的阵列实现错了吗？

这是输出：

1
0
1
Time: 0.088 ms
1
0
1
Time: 0.014 ms

代码：

#include <iostream> // cout
#include <string>   // string
#include <cstring> // memset
#include <algorithm> // sort
#include <ctime> // clock_t

using namespace std;

#define MAX_CHAR 255


void PrintTimeDiff(clock_t start, clock_t end) {
    std::cout << "Time: " << (end - start) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
}


// using array to keep a count of used chars
bool isPermutation(string inputa, string inputb) {
    int allChars[MAX_CHAR];
    memset(allChars, 0, sizeof(int) * MAX_CHAR);

    for(int i=0; i < inputa.size(); i++) {
        allChars[(int)inputa[i]]++;
    }

    for (int i=0; i < inputb.size(); i++) {
        allChars[(int)inputb[i]]--;
        if(allChars[(int)inputb[i]] < 0) {
            return false;
        }   
    }

    return true;
}


// using sorting anc comparing
bool isPermutation_sort(string inputa, string inputb) {

    std::sort(inputa.begin(), inputa.end());
    std::sort(inputb.begin(), inputb.end());

    if(inputa == inputb) return true;
    return false;
}



int main(int argc, char* argv[]) {

    clock_t  start = clock();
    cout << isPermutation("god", "dog") << endl;
    cout << isPermutation("thisisaratherlongerinput","thisisarathershorterinput") << endl;
    cout << isPermutation("armen", "ramen") << endl;
    PrintTimeDiff(start, clock());


    start = clock();
    cout << isPermutation_sort("god", "dog") << endl;
    cout << isPermutation_sort("thisisaratherlongerinput","thisisarathershorterinput") << endl;
    cout << isPermutation_sort("armen", "ramen") << endl;
    PrintTimeDiff(start, clock());

    return 0;
}

Answer 1

要对此进行基准测试，您必须消除所有噪音。最简单的方法是将它包装在循环中，每次重复调用1000次左右，然后每10次迭代只吐出一次值。这样他们每个人都有类似的缓存配置文件。丢弃伪造的值（例如，由于操作系统的上下文切换导致的井喷）。

通过这样做，我的方法速度略快一些。摘录。

method 1 array Time: 0.768 us
method 2 sort Time: 0.840333 us

method 1 array Time: 0.621333 us
method 2 sort Time: 0.774 us

method 1 array Time: 0.769 us
method 2 sort Time: 0.856333 us

method 1 array Time: 0.766 us
method 2 sort Time: 0.850333 us

method 1 array Time: 0.802667 us
method 2 sort Time: 0.89 us

method 1 array Time: 0.778 us
method 2 sort Time: 0.841333 us

我使用的rdtsc在这个系统上对我来说效果更好。每微秒3000个循环足够接近这个，但如果你关心读数的准确性，请确保它更准确。

#if defined(__x86_64__)
static uint64_t rdtsc()
{
    uint64_t    hi, lo;

    __asm__ __volatile__ (
                            "xor %%eax, %%eax\n"
                            "cpuid\n"
                            "rdtsc\n"
                            : "=a"(lo), "=d"(hi)
                            :: "ebx", "ecx");

    return (hi << 32)|lo;
}
#else
#error wrong architecture - implement me
#endif

void PrintTimeDiff(uint64_t start, uint64_t end) {
    std::cout << "Time: " << (end - start)/double(3000)  << " us" << std::endl;
}

Answer 2

您无法检查将调用混合调用std::cout的实现之间的性能差异。 isPermutation和isPermutation_sort比调用std::cout（无论如何，prefer \n over std::endl）快一些数量级。
进行测试时必须激活编译器优化。这样做，编译器将应用loop-invariant code motion optimization，你可能会得到相同的结果。

更有效的测试方法是：

int main()
{
  const std::vector<std::string> bag
  {
    "god", "dog", "thisisaratherlongerinput", "thisisarathershorterinput",
    "armen", "ramen"
  };

  static std::mt19937 engine;
  std::uniform_int_distribution<std::size_t> rand(0, bag.size() - 1);

  const unsigned stop = 1000000;

  unsigned counter = 0;
  std::clock_t start = std::clock();
  for (unsigned i(0); i < stop; ++i)
    counter += isPermutation(bag[rand(engine)], bag[rand(engine)]);

  std::cout << counter << '\n';
  PrintTimeDiff(start, clock());

  counter = 0;
  start = std::clock();
  for (unsigned i(0); i < stop; ++i)
    counter += isPermutation_sort(bag[rand(engine)], bag[rand(engine)]);

  std::cout << counter << '\n';
  PrintTimeDiff(start, clock());

  return 0;
}

对于2.4s，我isPermutations_sort与2s isPermutation（有点类似于Hal＆＃39}的结果）。与g++和clang++相同。

打印counter的值具有以下双重好处：

触发as-if rule（编译器无法删除for循环）;
允许首先检查您的实现（这两个值不能太远）。

您需要在isPermutation的实施中改变一些事项：

将参数作为const引用传递
```
bool isPermutation(const std::string &inputa, const std::string &inputb)
```
只是此更改会将时间缩短至0.8s（当然，您无法对isPermutation_sort执行相同操作。）
您可以使用std::array和std::fill代替memset（这是C ++： - ）
避免premature pessimization并且更喜欢preincrement。如果您要使用原始值
不要在signed循环（unsigned和for）中混合inputa.size()和i值。 i应声明为std::size_t
更好，请使用range based for loop。

类似于：

bool isPermutation(const std::string &inputa, const std::string &inputb)
{
  std::array<int, MAX_CHAR> allChars;
  allChars.fill(0);

  for (auto c : inputa)
    ++allChars[(unsigned char)c];

  for (auto c : inputb)
  {
    --allChars[(unsigned char)c];
    if (allChars[(unsigned char)c] < 0)
      return false;
  }

  return true;
}

无论如何，isPermutation和isPermutation_sort都应该进行初步检查：

  if (inputa.length() != inputb.length())
    return false;

现在，对于0.55s，isPermutation与1.1s的{{1}}为isPermutation_sort。

最后但并非最不重要的是考虑std::is_permutation：

for (unsigned i(0); i < stop; ++i)
{
  const std::string &s1(bag[rand(engine)]), &s2(bag[rand(engine)]);

  counter += std::is_permutation(s1.begin(), s1.end(), s2.begin());
}

（0.6s）

修改

正如在BeyelerStudios' comment中观察到的那样，Mersenne-Twister在这种情况下太过分了。

您可以将引擎更改为更简单的引擎。

static std::linear_congruential_engine<std::uint_fast32_t, 48271, 0, 2147483647> engine;

这进一步降低了时间。幸运的是，相对速度保持不变。

为了确保我还检查了非随机访问方案，获得了相同的相对结果。

Answer 3

您的想法相当于在两个字符串上使用Counting Sort，但是在计数数组上进行比较，而不是在写出已排序的字符串之后。

它运行良好，因为一个字节只能有255个非零值中的一个。归零256B的内存，甚至是4 * 256B，非常便宜，所以即使对于相当短的字符串也是如此，其中大多数计数数组都没有被触及。

对于很长的字符串应该是相当不错的，至少在某些情况下。它非常依赖于良好且流水线严重的L1缓存，因为计数数组的分散增量会产生分散的读 - 修改 - 写入。重复出现会在其中创建具有存储加载往返的依赖关系链。对于这种算法来说，这是一个很大的玻璃钳，在CPU上，许多负载和存储可以同时在飞行中（它们的延迟并行发生）。现代的x86 CPU应该运行得很好，因为它们可以在每个时钟周期维持一个加载+存储。

初始统计inputa compiles to a very tight loop：

.L15:
        movsx   rdx, BYTE PTR [rax]
        add     rax, 1
        add     DWORD PTR [rsp-120+rdx*4], 1
        cmp     rax, rcx
        jne     .L15

这会让我们看到代码中的第一个主要错误：char可以是已签名或未签名。在x86-64 ABI中，char已签名，因此allChars[(int)inputa[i]]++;对其进行符号扩展以用作数组索引。（movsx代替movzx）。您的代码将在具有高位设置的非ASCII字符的数组边界外写入。所以你应该写allChars[(unsigned char)inputa[i]]++;。请注意，转换为(unsigned)并不会提供我们想要的结果（请参阅注释）。

注意clang makes much worse code（v3.7.1和v3.8，都带有-O3），在内部循环内调用std::basic_string<...>::_M_leak_hard()。（泄密就像泄漏参考一样，我想。）@ manlio的版本没有这个问题，所以我猜for (auto c : inputa)语法有助于弄清楚发生了什么。

此外，当您的来电者std::string强制他们构建char[]时，使用std::string。这有点愚蠢，但能够比较字符串长度是有帮助的。

GNU libc's `std::is_permutation`使用了一种非常不同的策略：

首先，它跳过两个字符串中没有置换的相同的公共前缀。

然后，对于inputa中的每个元素：

计算inputb中该元素的出现次数。检查它是否与inputa中的计数匹配。

有一些优化：

仅比较第一次看到元素时的计数：通过从inputa的开头搜索来查找重复项，如果匹配位置不是当前位置，我们已经检查了此元素。
检查inputb中的匹配计数是否为！= 0，然后计算其余inputa中的匹配项。

这不需要任何临时存储，因此当元素很大时它可以工作。（例如int64_t数组或结构数组。）

如果存在不匹配，可能会在做同样多的工作之前尽早找到它。可能有一些输入情况，计数版本需要的时间较少，但对于大多数输入，库算法最好。

std::is_permutation使用std::count，应该使用SSE / AVX向量很好地实现。不幸的是，它是由gcc和clang以非常愚蠢的方式自动矢量化的。它将字节解包为64位整数，然后将它们累积到向量元素中，以避免溢出。所以它花费了大部分指令来改变数据，并且可能比标量实现慢（你可以通过-O2或-O3 -fno-tree-vectorize进行编译。）

它可以而且应该只在每几次迭代中执行此操作，因此count的内部循环可以类似pcmpeqb / psubb，每{255} psadbw迭代。或pcmpeqb / pmovmskb / popcnt / add，但速度较慢。

库中的模板特化可以为8,16和32位类型的std::count提供很多帮助，它们的相等性可以通过按位相等来检查（整数==）。

使用不同方法运行排列函数的速度会导致意外结果

3 个答案:

GNU libc's `std::is_permutation`使用了一种非常不同的策略：

使用不同方法运行排列函数的速度会导致意外结果

3 个答案:

GNU libc's std::is_permutation使用了一种非常不同的策略：

GNU libc's `std::is_permutation`使用了一种非常不同的策略：