我正在开发一个需要Levenshtein算法计算两个字符串相似度的应用程序。
前段时间我将一个C#版本(可以很容易地在互联网上找到它)改编成VB.NET,它看起来像这样:
Public Function Levenshtein1(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim cost As Integer
Dim s1c As Char
For i = 1 To n
d(i, 0) = i
Next
For j = 1 To m
d(0, j) = j
Next
For i = 1 To n
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
然后,尝试调整它并改善其性能,我以版本结束:
Public Function Levenshtein2(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d(n, m) As Integer
Dim s1c As Char
Dim cost As Integer
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
Next
Next
Return (1.0 - (d(n, m) / Math.Max(n, m))) * 100
End Function
基本上,我认为距离数组d(,)可以在主循环内部初始化,而不需要两个初始(和额外)循环。我真的认为这将是一个巨大的进步......不幸的是,不仅没有改进原版,它实际上运行得更慢!
我已经尝试通过查看生成的IL代码来分析这两个版本,但我无法理解它。
所以,我希望有人可以对这个问题有所了解,并解释为什么第二个版本(即使它的周期较少)运行速度比原来慢?
注意:时差约为0.15纳秒。这看起来并不多,但是当你必须检查数以千万计的字符串时...差异变得非常显着。
答案 0 :(得分:2)
正因为如此:
For i = 1 To n
d(i, 0) = i
s1c = s1(i - 1)
For j = 1 To m
d(0, j) = j 'THIS LINE HERE
您刚刚开始初始化此数组,但现在您正在初始化 n 次。访问像这样的数组中的内存需要花费一些成本,现在你正在做额外的 n 次。您可以将该行更改为:If i = 1 Then d(0, j) = j
。但是,在我的测试中,你仍然基本上最终得到比原版略慢的版本。这也是有道理的。你正在执行这个if语句n * m次。再次有一些成本。像它在原始版本中移出它要便宜很多它最终是O(n)。由于整体算法为O(n * m),因此任何可以进入O(n)步骤的步骤都将是一场胜利。
答案 1 :(得分:2)
您可以拆分以下行:
d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
如下:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
d(i, j) = Math.Min(tmp, d(i - 1, j - 1) + cost)
这样可以避免一次总结
您还可以在if部分中放置最后一个“min”比较,并避免分配成本:
tmp = Math.Min(d(i - 1, j), d(i, j - 1)) + 1
If s1c = s2(j - 1) Then
d(i, j) = Math.Min(tmp, d(i - 1, j - 1))
Else
d(i, j) = Math.Min(tmp, d(i - 1, j - 1)+1)
End If
所以你在s1c = s2(j - 1)
时保存求和答案 2 :(得分:0)
不是您问题的直接答案,但为了提高性能,您应该考虑使用锯齿状数组(数组数组)而不是多维数组。 What are the differences between a multidimensional array and an array of arrays in C#?和Why are multi-dimensional arrays in .NET slower than normal arrays?
您将看到锯齿状数组的代码大小为7,而多维数组的代码大小为10。
下面的代码使用锯齿状数组,单维数组
Public Function Levenshtein3(s1 As String, s2 As String) As Double
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim d()() As Integer = New Integer(n)() {}
Dim cost As Integer
Dim s1c As Char
For i = 0 To n
d(i) = New Integer(m) {}
Next
For j = 1 To m
d(0)(j) = j
Next
For i = 1 To n
d(i)(0) = i
s1c = s1(i - 1)
For j = 1 To m
If s1c = s2(j - 1) Then
cost = 0
Else
cost = 1
End If
d(i)(j) = Math.Min(Math.Min(d(i - 1)(j) + 1, d(i)(j - 1) + 1), d(i - 1)(j - 1) + cost)
Next
Next
Return (1.0 - (d(n)(m) / Math.Max(n, m))) * 100
End Function