Question

我描述了我的一个程序，发现非常热点是levenshtein_distance，递归调用。我决定尝试优化它。

lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 );

    const size_t prevColSize = prevCol.size();
    for( unsigned int i = 0; i < prevColSize; i++ )
        prevCol[i] = i;

    for( unsigned int i = 0, j; i < len1; ++i )
    {
        col[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] );
            col[j+1] = std::min( minPrev, prevCol[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
        col.swap( prevCol );
    }
    return prevCol[len2];
}

TL; DR：我更改了std::string→std::array

战争故事：在运行vtune之后，我发现更新col[j+1]的行是放慢所有内容的行（90％的时间花在它上面）。我想：好吧，也许这是一个别名问题，也许编译器无法确定字符串对象中的字符数组是否因为它们被字符串接口屏蔽而没有因果关系，并花费90％的时间检查该程序的其他部分修改它们。

所以我将我的字符串更改为静态数组，因为那里没有动态内存，下一步就是使用restrict来帮助编译器。但与此同时，我决定通过这样做检查我是否获得了任何表现。

lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    static constexpr unsigned MAX_STRING_SIZE = 512;
    assert(len1 < MAX_STRING_SIZE && len2 < MAX_STRING_SIZE);
    static std::array<unsigned int, MAX_STRING_SIZE> col, prevCol;

    for( unsigned int i = 0; i < len2+1; ++i )
        prevCol[i] = i;

    // the rest is unchanged
}

TL; DR ：现在它运行缓慢。

发生的事情是我失去了表现。很多。我的示例程序现在在44秒内运行，而不是在大约6秒内运行。再次使用vtune进行配置文件显示一个函数被反复调用：std::swap（对于你，gcc伙伴，这是位/ move.h），而std::swap_ranges调用它（比特/ stl_algobase.h）。

我认为std::min是使用quicksort实现的，这解释了为什么会有交换，但我不明白为什么交换，在这种情况下需要花费很多时间。

编辑：编译器选项：我使用带有选项“-O2 -g -DNDEBUG”的gcc和一堆警告说明符。

Answer 1

对于一个实验，我运行了一个原始代码的版本，在很大程度上未经修改，带有一对短字符串，阵列版本的时间约为36s，矢量版本时间约为8s。

您的版本似乎在很大程度上取决于MAX_STRING_SIZE的选择。当我使用50而不是512（这恰好适合我的字符串）时，阵列版本的时间下降到大约16秒。

然后，我执行了这个主循环的手动翻译，以摆脱显式交换。这进一步将阵列版本的时间缩短到11秒，更有趣的是，现在使阵列版本的时序与MAX_STRING_SIZE的选择无关。当它回到512时，阵列版本仍然需要11秒。

这是一个很好的证据，表明数组的显式交换是您的版本的大部分性能问题所在。

阵列和矢量版本之间仍存在显着差异，阵列版本的使用时间延长了40％。我没有机会调查确切原因。

for( unsigned int i = 0, j; i < len1; ++i )
{
    {
        col[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] );
            col[j+1] = std::min( minPrev, prevCol[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
    }

    if (!(++i < len1))
        return col[len2];

    {
        prevCol[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( prevCol[j], col[1 + j] );
            prevCol[j+1] = std::min( minPrev, col[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
    }
}
return prevCol[len2];

Answer 2

首先：@DanielFischer已经很可能指出了导致性能下降的原因：交换std::arrays是线性时间操作，而交换std::vector是一个恒定时间操作。虽然有些编译器可能能够优化它，但你的gcc似乎无法做到这一点。

同样重要的是：像你在这里使用static数组一样，你的代码本身就不是线程安全的。这通常不是一个好主意。

删除其中一个数组（或向量）和相关的交换以及使用动态分配的c-array实际上非常简单，并且性能优越（至少对我的设置而言）。
一些转换（如始终使用size_t）会产生以下函数：

unsigned int levenshtein_distance3( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    ::std::unique_ptr<size_t[]> col(new size_t[len2 + 1]);

    for(size_t i = 0; i < len2+1; ++i )
        col[i] = i;

    for(size_t i = 0; i < len1; ++i )
    {
        size_t lastc = col[0];
        col[0] = i+1;
        const char s1i = s1[i];
        for(size_t j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + (::std::min)(col[j], col[j + 1]);
            const auto newc = (::std::min)(minPrev, lastc + (s1i != s2[j] ? 1 : 0));
            lastc = col[j+1];
            col[j + 1] = newc;
        }
    }
    return col[len2];
}

将矢量更改为数组会使我的程序变慢

2 个答案: