找出一个单词和一千个数组之间的最小Levenshtein距离

时间:2016-01-29 03:04:34

标签: excel vba nlp levenshtein-distance edit-distance

所以我的用户在注册表中写了他们的地址,但很多人都有拼写错误。我从城市记录中检索了另一个列表,其中包含这些地址的正确拼写。所以,让我说我有" Brooklny"由他们输入,我有正确的名字列表:布鲁克林,曼哈顿,布朗克斯,史坦顿岛,皇后区(这是一个例子,实际地址是西班牙语,并参考墨西哥城的邻里)。

我想找到布鲁克林和每个自治市镇名称之间的编辑距离,然后找到布鲁克林有最小编辑距离的单词。

所以编辑之间的距离:Brooklny-Brooklyn是2,Brooklny-Bronx是4,依此类推。布鲁克林当然最低2分。

想象一下,布鲁克林在小区A1和布鲁克林,曼哈顿,布朗克斯,史坦顿岛,皇后区各有一个小区,来自B1:B6

我在VBA中为Excel中的用户定义函数执行此操作,到目前为止,我有这段代码,但它不起作用。

Function Minl(ByVal string1 As String, ByVal correctos As Range) As Variant

Dim distancias(3) As Integer
Dim i, minimo As Integer
i = 0
For Each c In correctos.Cells
    distancias(i) = Levenshtein(string1, c.Value)
    i = i + 1
Next c

Minl = Minrange(distancias)

End Function

Function Levenshtein(ByVal string1 As String, ByVal string2 As String) As Long

Dim i As Long, j As Long
Dim string1_length As Long
Dim string2_length As Long
Dim distance() As Long

string1_length = Len(string1)
string2_length = Len(string2)
ReDim distance(string1_length, string2_length)

For i = 0 To string1_length
distance(i, 0) = i
Next

For j = 0 To string2_length
    distance(0, j) = j
Next

For i = 1 To string1_length
    For j = 1 To string2_length
        If Asc(Mid$(string1, i, 1)) = Asc(Mid$(string2, j, 1)) Then
            distance(i, j) = distance(i - 1, j - 1)
        Else
            distance(i, j) = Application.WorksheetFunction.Min _
            (distance(i - 1, j) + 1, _
            distance(i, j - 1) + 1, _
            distance(i - 1, j - 1) + 1)
        End If
    Next
Next

Levenshtein = distance(string1_length, string2_length)

End Function

Function Minrange(ParamArray values() As Variant) As Variant
Dim minValue, Value As Variant
minValue = values(0)
For Each Value In values
   If Value < minValue Then minValue = Value
Next
Minrange = minValue
End Function

我认为算法是正确的但我认为我可能在使用语法时遇到问题。 levenshtein函数有效,但我不确定其他两个。

1 个答案:

答案 0 :(得分:0)

要获得最接近的输出,您可以使用它:

Function get_match(ByVal str As String, rng As Range) As String
  Dim itm As Variant, outp(0 To 2) As Variant
  outp(1) = 0: outp(2) = ""
  For Each itm In rng.Text
    outp(0) = Levenshtein(itm, str)
    If outp(0) = 0 Then
      get_match = itm
      Exit Function
    ElseIf outp(1) = 0 Or outp(0) < outp(1) Then
      outp(1) = outp(0)
      outp(2) = itm
    End If
  Next
  get_match = outp(1)
End Function

稍后获取距离,您只需运行Levenshtein(string,get_match(string,range))

即可

仍然......我不确定你在寻找什么:/