我需要知道两个字符串在一些测量指数
方面相差多远答案 0 :(得分:6)
最常用的指标可能是Levenshtein Distance,有时称为“编辑距离”。简单来说,它可以衡量您在比较中使字符串相同所需的编辑次数(添加,删除或广义方法,还有换位)。
该算法具有简单,高效和众所周知的实现,这里的伪代码直接来自之前链接的维基百科文章:
int LevenshteinDistance(char s[1..m], char t[1..n])
{
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)x(n+1) values
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i // the distance of any first string to an empty second string
for j from 0 to n
d[0, j] := j // the distance of any second string to an empty first string
for j from 1 to n
{
for i from 1 to m
{
if s[i] = t[j] then
d[i, j] := d[i-1, j-1] // no operation required
else
d[i, j] := minimum
(
d[i-1, j] + 1, // a deletion
d[i, j-1] + 1, // an insertion
d[i-1, j-1] + 1 // a substitution
)
}
}
return d[m,n]
}
另见这个相关的SO问题:Good Python modules for fuzzy string comparison?
答案 1 :(得分:5)
幸运的是,Python带有difflib模块:)
查看get_close_matches
功能