Question

我正在开发一个需要Levenshtein算法计算两个字符串相似度的应用程序。

前段时间我将一个C＃版本（可以很容易地在互联网上找到它）改编成VB.NET，它看起来像这样：

Public Function Levenshtein1(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim cost As Integer
    Dim s1c As Char

    For i = 1 To n
        d(i, 0) = i
    Next
    For j = 1 To m
        d(0, j) = j
    Next

    For i = 1 To n
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

然后，尝试调整它并改善其性能，我以版本结束：

Public Function Levenshtein2(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim s1c As Char
    Dim cost As Integer

    For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j

            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

基本上，我认为距离数组d（，）可以在主循环内部初始化，而不需要两个初始（和额外）循环。我真的认为这将是一个巨大的进步......不幸的是，不仅没有改进原版，它实际上运行得更慢！

我已经尝试通过查看生成的IL代码来分析这两个版本，但我无法理解它。

所以，我希望有人可以对这个问题有所了解，并解释为什么第二个版本（即使它的周期较少）运行速度比原来慢？

注意：时差约为0.15纳秒。这看起来并不多，但是当你必须检查数以千万计的字符串时...差异变得非常显着。

Answer 1

正因为如此：

 For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j 'THIS LINE HERE

您刚刚开始初始化此数组，但现在您正在初始化 n 次。访问像这样的数组中的内存需要花费一些成本，现在你正在做额外的 n 次。您可以将该行更改为：If i = 1 Then d(0, j) = j。但是，在我的测试中，你仍然基本上最终得到比原版略慢的版本。这也是有道理的。你正在执行这个if语句n * m次。再次有一些成本。像它在原始版本中移出它要便宜很多它最终是O（n）。由于整体算法为O（n * m），因此任何可以进入O（n）步骤的步骤都将是一场胜利。

Answer 2

您可以拆分以下行：

d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)

如下：

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)

这样可以避免一次总结

您还可以在if部分中放置最后一个“min”比较，并避免分配成本：

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If

所以你在s1c = s2（j - 1）

时保存求和

Answer 3

不是您问题的直接答案，但为了提高性能，您应该考虑使用锯齿状数组（数组数组）而不是多维数组。 What are the differences between a multidimensional array and an array of arrays in C#?和Why are multi-dimensional arrays in .NET slower than normal arrays?

您将看到锯齿状数组的代码大小为7，而多维数组的代码大小为10。

下面的代码使用锯齿状数组，单维数组

Public Function Levenshtein3(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d()() As Integer = New Integer(n)() {}
    Dim cost As Integer
    Dim s1c As Char

    For i = 0 To n
        d(i) = New Integer(m) {}
    Next

    For j = 1 To m
        d(0)(j) = j
    Next

    For i = 1 To n
        d(i)(0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function

同一算法的两个实现之间的性能差异

3 个答案: