计算Levenshtein距离的最有效方法

时间:2010-07-06 02:27:12

标签: algorithm optimization levenshtein-distance

我刚刚实现了一个最佳匹配文件搜索算法,以找到与字典中字符串最接近的匹配项。在对我的代码进行概要分析后,我发现绝大部分时间都花在计算查询与可能结果之间的距离上。我目前正在使用2-D数组实现算法来计算Levenshtein距离,这使得实现成为O(n ^ 2)运算。我希望有人可以提出更快的方法来做同样的事情。

这是我的实施:

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}

6 个答案:

答案 0 :(得分:23)

Levenshtein距离上的wikipedia entry对于优化计算提供了有用的建议 - 在您的情况下最适用的是如果您可以在最大距离上设置一个约束k(超出任何范围)这可能是无限的!)你可以将计算减少到O(n times k)而不是O(n squared)(基本上只要最小可能距离变为> k就放弃)。

由于您正在寻找最接近的匹配,因此您可以逐渐将k减少到目前为止找到的最佳匹配距离 - 这不会影响最坏情况的行为(因为匹配可能按距离递减顺序,这意味着你永远不会拯救,但平均情况应该有所改善。

我相信,如果您需要获得基本更好的性能,您可能必须接受一些计算更近似距离的强大妥协(因此获得“相当不错的匹配”而非必然最佳的一个)。

答案 1 :(得分:7)

根据此博客上的评论Speeding Up Levenshtein,您可以使用VP-Trees并实现O(nlogn)。同一博客上的另一条评论指向python implementation of VP-Trees and Levenshtein。如果有效,请告知我们。

答案 2 :(得分:3)

我修改了此post上的Levenshtein距离VBA函数,以使用一维数组。它的执行速度要快得多。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

我用字符串&#39; s1&#39;测试了这个函数。长度为11,304和&#39; s2&#39;长度为5,665(> 6400万字符比较)。使用上述单维版本的函数,我的机器上的执行时间约为24秒。我在上面的链接中引用的原始二维函数对于相同的字符串需要约37秒。我已经进一步优化了单维函数,如下所示,对于相同的字符串,它需要大约10秒钟。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function

答案 3 :(得分:2)

Wikipedia article讨论了您的算法和各种改进。但是,至少在一般情况下, O(n ^ 2)似乎是最好的。

如果你可以限制你的问题,有一些改进(例如,如果你只是对距离感兴趣,如果它小于 d ,复杂性是 O(dn) - 这可能是有意义的,因为距离接近字符串长度的匹配可能不是很有趣)。看看你是否可以利用问题的具体细节......

答案 4 :(得分:2)

Commons-lang的实施速度非常快。请参阅http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm

以下是我对Scala的翻译:

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}

答案 5 :(得分:1)

我知道这已经很晚了,但它与手头的讨论有关。

正如其他人所提到的,如果您要做的只是检查两个字符串之间的编辑距离是否在某个阈值k内,您可以将时间复杂度降低到 O(kn)。更精确的表达式是 O((2k + 1)n)。你取一个跨越对角线单元两侧的k个单元的条带(条带长度为2k + 1)并计算出位于该条带上的单元格的值。

有趣的是,李等人有improvement。人。并且这已经进一步减少到O((k + 1)n)。