Question

根据python-Levenshtein.ratio来源：

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

计算为(lensum - ldist) / lensum。这适用于

distance('ab', 'a') = 1
ratio('ab', 'a') = 0.666666

然而，似乎打破了

distance('ab', 'ac') = 1
ratio('ab', 'ac') = 0.5

我觉得我必须遗漏一些非常简单的东西..但为什么不0.75？

Answer 1

'ab'和'ac'的

Levenshtein距离如下所示：

所以对齐是：

  a c
  a b

对齐长度= 2
不匹配数= 1

Levenshtein Distance为1，因为只需一次替换即可将ac转换为ab（或反向）

距离比=（Levenshtein距离）/（对准长度）= 0.5

编辑

你正在写

(lensum - ldist) / lensum = (1 - ldist/lensum) = 1 - 0.5 = 0.5。

但这是匹配（不是距离）
REFFRENCE，您可能会注意到它的书面

Matching %

p = (1 - l/m) × 100

其中l是levenshtein distance而m是length of the longest of the two字：

_{（通知：有些作者使用两者中最长的，我使用了对齐长度）}

(1 - 3/7) × 100 = 57.14...  

  (Word 1    Word 2    RATIO   Mis-Match   Match%
   AB         AB         0       0        (1 - 0/2 )*100  = 100%  
   CD         AB         1       2        (1 - 2/2 )*100  = 0%   
   AB         AC        .5       1        (1 - 1/2 )*100  = 50%

_{为什么有些作者除以对齐长度，除了两者之一的最大长度？...，因为Levenshtein不考虑间隙。距离=编辑次数（插入+删除+替换），而标准全局对齐的Needleman–Wunsch algorithm考虑间隙。这是Needleman-Wunsch和Levenshtein之间的（差距）差异，所以大部分论文使用两个序列之间的最大距离（但这是我自己的理解，和IAM不确定100％）}

以下是关于PAITERN ANALYSIS的IEEE交易：Computation of Normalized Edit Distance and Applications本文规范化编辑距离如下：

给定有限字母表上的两个字符串X和Y，X和Y之间的归一化编辑距离，d（X，Y）被定义为W（P）/ L（P）w的最小值，这里P是编辑X和Y之间的路径，W（P）是P的基本编辑操作的权重之和，L（P）是这些操作的数量（P的长度）。

Answer 2

通过仔细查看C代码，我发现这个明显的矛盾是由于ratio处理“替换”编辑操作的方式不同于其他操作（即成本为2），而distance以1的成本对待它们。

这可以在对levenshtein_common函数内部ratio_py函数的调用中看到：

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

static PyObject*
ratio_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call
    return NULL;

  if (lensum == 0)
    return PyFloat_FromDouble(1.0);

  return PyFloat_FromDouble((double)(lensum - ldist)/(lensum));
}

和distance_py函数：

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

static PyObject*
distance_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)
    return NULL;

  return PyInt_FromLong((long)ldist);
}

最终导致将不同的费用参数发送到另一个内部函数lev_edit_distance，其中包含以下文档片段：

@xcost: If nonzero, the replace operation has weight 2, otherwise all
        edit operations have equal weights of 1.

lev_edit_distance（）代码：

/**
 * lev_edit_distance:
 * @len1: The length of @string1.
 * @string1: A sequence of bytes of length @len1, may contain NUL characters.
 * @len2: The length of @string2.
 * @string2: A sequence of bytes of length @len2, may contain NUL characters.
 * @xcost: If nonzero, the replace operation has weight 2, otherwise all
 *         edit operations have equal weights of 1.
 *
 * Computes Levenshtein edit distance of two strings.
 *
 * Returns: The edit distance.
 **/
_LEV_STATIC_PY size_t
lev_edit_distance(size_t len1, const lev_byte *string1,
                  size_t len2, const lev_byte *string2,
                  int xcost)
{
  size_t i;

[答案]

所以在我的例子中，

ratio('ab', 'ac')意味着替换操作（成本为2），超过字符串（4）的总长度，因此2/4 = 0.5。

这解释了“如何”，我想唯一剩下的方面就是“为什么”，但目前我对这种理解感到满意。

Answer 3

虽然没有绝对标准，但标准化的Levensthein距离最常定义为ldist / max(len(a), len(b))。这两个例子都会产生.5。

max有意义，因为它是Levenshtein距离的最低上限：要从a获得b len(a) > len(b)，你总是可以替换第一个len(b) b 1}} a的元素以及来自a[len(b):]的相应元素，然后插入缺失的部分len(a)，以进行总计len(a) <= len(b)次编辑操作。

这个论点以明显的方式扩展到1 - ldist / max(len(a), len(b))的情况。要将标准化距离转换为相似度量，请将其从一个减去{{1}}。

Answer 4

(lensum - ldist) / lensum

ldist不是距离，是成本的总和

不匹配的每个数组来自上方，左侧或对角线

如果数字来自左边他是一个插入，它来自上面是一个删除，它来自对角线它是一个替代

插入和删除的成本为1，替换成本为2。重置成本为2，因为它是删除和插入

ab ac cost是2，因为它是替代

>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5

有关详细信息：python-Levenshtein ratio calculation

另一个例子：

费用为9（4替换=＆gt; 4 * 2 = 8和1删除1 * 1 = 1,8 + 1 = 9）

str1=len("google") #6
str2=len("look-at") #7
str1 + str2 #13

距离= 5（根据矩阵的向量（7,6）= 5）

比率为（13-9）/ 13 = 0.3076923076923077

>>> c="look-at"
>>> d="google"
>>> lev.editops(c,d)
[('replace', 0, 0), ('delete', 3, 3), ('replace', 4, 3), ('replace', 5, 4), ('replace', 6, 5)]
>>> lev.ratio(c,d)
0.3076923076923077
>>> lev.distance(c,d)
5

如何计算python-Levenshtein.ratio

4 个答案: