Question

我想要计算包含多达6个值的序列之间的Levenshtein距离。 这些值的顺序不应影响距离。

我如何将其实现为迭代或递归算法？

示例：

# Currently 
>>> LDistance('dog', 'god')
2

# Sorted
>>> LDistance('dgo', 'dgo')
0

# Proposed
>>> newLDistance('dog', 'god')
0

＆＃39;狗＆＃39;和上帝＆＃39;具有完全相同的字母，事先对字符串进行排序将返回所需的结果。然而，这并不是一直有效：

# Currently 
>>> LDistance('doge', 'gold')
3

# Sorted
>>> LDistance('dego', 'dglo')
2

# Proposed
>>> newLDistance('doge', 'gold')
1

＆＃39;多吉＆＃39;和＆＃39;黄金＆＃39;有3/4匹配的字母，所以应该返回1的距离。这是我目前的递归代码：

def mLD(s, t):
    memo = {}
    def ld(s, t):
        if not s: return len(t)
        if not t: return len(s)
        if s[0] == t[0]: return ld(s[1:], t[1:])
        if (s, t) not in memo:
            l1 = ld(s, t[1:])
            l2 = ld(s[1:], t)
            l3 = ld(s[1:], t[1:])
            memo[(s,t)] = 1 + min(l1, l2, l3)
        return memo[(s,t)]
    return ld(s, t)

编辑：后续问题：Adding exceptions to Levenshtein-Distance-like algorithm

Answer 1

你不需要Levenshtein机器。

import collections
def distance(s1, s2):
    cnt = collections.Counter()
    for c in s1:
        cnt[c] += 1
    for c in s2:
        cnt[c] -= 1
    return sum(abs(diff) for diff in cnt.values()) // 2 + \
        (abs(sum(cnt.values())) + 1) // 2   # can be omitted if len(s1) == len(s2)

Answer 2

为什么不计算共有多少个字母，并从中找到并回答？对于每个字符计算其频率，然后为每个字符串计算它根据频率有多少“额外”字符，并取最大值“额外”。

伪代码：

for c in s1:
    cnt1[c]++
for c in s2:
    cnt2[c]++
extra1 = 0
extra2 = 0
for c in all_chars:
    if cnt1[c]>cnt2[c]
        extra1 += cnt1[c]-cnt2[c]
    else
        extra2 += cnt2[c]-cnt1[c]
return max(extra1, extra2)

修改Levenshtein-Distance忽略顺序

2 个答案: