同一算法的两个实现之间的性能差异

时间:2012-09-12 23:04:13

标签: vb.net performance algorithm

我正在开发一个需要Levenshtein算法计算两个字符串相似度的应用程序。

前段时间我将一个C#版本(可以很容易地在互联网上找到它)改编成VB.NET,它看起来像这样:

Public Function Levenshtein1(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim cost As Integer
    Dim s1c As Char

    For i = 1 To n
        d(i, 0) = i
    Next
    For j = 1 To m
        d(0, j) = j
    Next

    For i = 1 To n
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

然后,尝试调整它并改善其性能,我以版本结束:

Public Function Levenshtein2(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d(n, m) As Integer
    Dim s1c As Char
    Dim cost As Integer

    For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j

            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function

基本上,我认为距离数组d(,)可以在主循环内部初始化,而不需要两个初始(和额外)循环。我真的认为这将是一个巨大的进步......不幸的是,不仅没有改进原版,它实际上运行得更慢!

我已经尝试通过查看生成的IL代码来分析这两个版本,但我无法理解它。

所以,我希望有人可以对这个问题有所了解,并解释为什么第二个版本(即使它的周期较少)运行速度比原来慢?

注意:时差约为0.15纳秒。这看起来并不多,但是当你必须检查数以千万计的字符串时...差异变得非常显着。

3 个答案:

答案 0 :(得分:2)

正因为如此:

 For i = 1 To n
        d(i, 0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            d(0, j) = j 'THIS LINE HERE

您刚刚开始初始化此数组,但现在您正在初始化 n 次。访问像这样的数组中的内存需要花费一些成本,现在你正在做额外的 n 次。您可以将该行更改为:If i = 1 Then d(0, j) = j。但是,在我的测试中,你仍然基本上最终得到比原版略慢的版本。这也是有道理的。你正在执行这个if语句n * m次。再次有一些成本。像它在原始版本中移出它要便宜很多它最终是O(n)。由于整体算法为O(n * m),因此任何可以进入O(n)步骤的步骤都将是一场胜利。

答案 1 :(得分:2)

您可以拆分以下行:

d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)

如下:

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)

这样可以避免一次总结

您还可以在if部分中放置最后一个“min”比较,并避免分配成本:

tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
   d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If

所以你在s1c = s2(j - 1)

时保存求和

答案 2 :(得分:0)

不是您问题的直接答案,但为了提高性能,您应该考虑使用锯齿状数组(数组数组)而不是多维数组。 What are the differences between a multidimensional array and an array of arrays in C#?Why are multi-dimensional arrays in .NET slower than normal arrays?

您将看到锯齿状数组的代码大小为7,而多维数组的代码大小为10。

下面的代码使用锯齿状数组,单维数组

Public Function Levenshtein3(s1 As String, s2 As String) As Double
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length

    Dim d()() As Integer = New Integer(n)() {}
    Dim cost As Integer
    Dim s1c As Char

    For i = 0 To n
        d(i) = New Integer(m) {}
    Next

    For j = 1 To m
        d(0)(j) = j
    Next

    For i = 1 To n
        d(i)(0) = i
        s1c = s1(i - 1)

        For j = 1 To m
            If s1c = s2(j - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
        Next
    Next

    Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function