Question

我正在编写一个C ++算法，它接受两个字符串并返回true，如果你可以通过将一个字符更改为另一个字符来从字符串a变为字符串b。两个字符串的大小必须相等，并且只能有一个区别。我还需要访问已更改的索引以及已更改的strA的字符。我找到了一个有效的算法，但是它遍历每一对单词，并且在任何大量输入上都运行速度太慢。

bool canChange(std::string const& strA, std::string const& strB, char& letter)
{
    int dif = 0;
    int position = 0;
    int currentSize = (int)strA.size();
    if(currentSize != (int)strB.size())
    {
        return false;
    }
    for(int i = 0; i < currentSize; ++i)
    {
        if(strA[i] != strB[i])
        {
            dif++;
            position = i;
            if(dif > 1)
            {
                return false;
            }
        }
    }
    if(dif == 1)
    {
        letter = strA[position];
        return true;
    }
    else return false;
}

有关优化的建议吗？

Answer 1

除非你能接受偶尔出现的错误结果，否则远离检查字符串中的所有字符有点困难。

我建议使用标准库的功能，而不是尝试计算不匹配的数量。例如;

#include <string>
#include <algorithm>

bool canChange(std::string const& strA, std::string const& strB, char& letter, std::size_t &index)
{
     bool single_mismatch = false;
     if (strA.size() == strB.size())
     {
         typedef std::string::const_iterator ci; 
         typedef std::pair<ci, ci> mismatch_result;

         ci begA(strA.begin()), endA(strA.end());

         mismatch_result result = std::mismatch(begA, endA, strB.begin());

         if (result.first != endA)    //  found a mismatch
         {
             letter = *(result.first);
             index = std::distance(begA, result.first);

             // now look for a second mismatch

             std::advance(result.first, 1);
             std::advance(result.second, 1);

             single_mismatch = (std::mismatch(result.first, endA, result.second).first == endA);
         }
    }
    return single_mismatch;
}

这适用于所有版本。它可以在C ++ 11中简化一点。

如果上述内容返回true，则会发现一个不匹配。

如果返回值为false，则字符串大小不同，或者不匹配的数量不等于1（字符串相等，或者有多个不匹配）

如果字符串具有不同的长度或完全相等，则

letter和index保持不变，否则会识别第一个不匹配（strA中的字符值，以及{{1} }}）。

Answer 2

如果要优化大多数相同的字符串，可以使用x86 SSE / AVX向量指令。你的基本想法看起来很好：一旦发现第二个差异就会中断。

要查找和计算字符差异，像PCMPEQB / PMOVMSKB / test-and-branch这样的序列可能很好。（使用C / C ++内部函数来获取那些向量指令）。当矢量循环检测到当前块中的非零差异时，POPCNT位掩码以查看您是否刚刚找到第一个差异，或者您是否在同一个块中发现了两个差异。

我将一个未经测试且未完全充实的AVX2版本汇总到了我所描述的内容中。 此代码假定字符串长度是32 的倍数。提前停止并使用清理结尾处理最后一个块是留给读者的练习。

#include <immintrin.h>
#include <string>

// not tested, and doesn't avoid reading past the end of the string.
// TODO: epilogue to handle the last up-to-31 left-over bytes separately.
bool canChange_avx2_bmi(std::string const& strA, std::string const& strB, char& letter) {
    size_t size = strA.size();
    if (size != strB.size())
        return false;

    int diffs = 0;
    size_t diffpos = 0;
    size_t pos = 0;
    do {
        uint32_t diffmask = 0;
        while( pos < size ) {
            __m256i vecA  = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strA[pos]));
            __m256i vecB  = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(& strB[pos]));
            __m256i vdiff = _mm256_cmpeq_epi8(vecA, vecB);
            diffmask = _mm256_movemask_epi8(vdiff);
            pos += 32;
            if (diffmask) break;  // gcc makes worse code if you include && !diffmask in the while condition, instead of this break
        }
        if (diffmask) {
            diffpos = pos + _tzcnt_u32(diffmask);  // position of the lowest set bit.  Could safely use BSF rather than TZCNT here, since we only run when diffmask is non-zero.
            diffs += _mm_popcnt_u32(diffmask);
        }
    } while(pos < size && diffs <= 1);

    if (diffs == 1) {
        letter = strA[diffpos];
        return true;
    }
    return false;
}

丑陋的break而不是while条件中的那个显然有助于gcc generate better code。 do{}while()也符合我希望asm出现的方式。我没有尝试使用for或while循环来查看gcc会做什么。

内循环非常紧张：

.L14:
        cmp     rcx, r8
        jnb     .L10      #  the while(pos<size) condition
.L6: # entry point for first iteration, because gcc duplicates the pos<size test ahead of the loop

        vmovdqu ymm0, YMMWORD PTR [r9+rcx]        # tmp118,* pos
        vpcmpeqb        ymm0, ymm0, YMMWORD PTR [r10+rcx]       # tmp123, tmp118,* pos
        add     rcx, 32   # pos,
        vpmovmskb       eax, ymm0     # tmp121, tmp123
        test    eax, eax        # tmp121
        je      .L14        #,

理论上，这应该每2个时钟运行一次（Intel Haswell）。循环中有7个融合域uop。（将是6，但是2-reg addressing modes apparently can't micro-fuse on SnB-family CPUs。）由于两个uop是加载而不是ALU，因此在SnB / IvB上也可以实现这种吞吐量。

这对于飞越两个字符串相同的区域非常有用。正确处理任意字符串长度的开销将使得这可能比简单的标量函数慢，如果字符串很短，和/或早期有多个差异。

Answer 3

您的投入有多大？

我认为strA [i]，strB [i]有函数调用开销，除非它是内联的。因此，请确保在启用内联并使用发行版编译时进行性能测试。否则，尝试使用strA.c_str（）将字节作为char *。

如果所有失败并且仍然不够快，请尝试将字符串分成块并在块上使用memcmp或strncmp。如果没有区别，请移动到下一个块，直到达到结束或找到差异。如果发现差异，则逐字节进行比较直到找到差异为止。我建议使用这条路由，因为memcmp通常比你的琐碎实现更快，因为它们可以利用处理器SSE扩展等来进行非常快速的比较。

此外，您的代码存在问题。您假设strA比strB长，并且只检查数组访问器的A长度。

确定两个字符串是否由单个字符区别的最快方法

3 个答案: