Python 3.6 SequenceMatcher()。get_matching_blocks()如何工作?

时间:2018-01-08 23:29:03

标签: python python-3.x python-3.6 difflib sequencematcher

我正在尝试使用SequenceMatcher.ratio()来获得两个字符串的相似性:"86418648""86488648"

>>> SequenceMatcher(None,"86418648","86488648").ratio()
0.5

返回的比率为0.5,远低于我的预期,因为两个字符串中只有一个字符不同。

似乎比率是根据匹配的块计算的。所以我试着运行SequenceMatcher.get_matching_blocks()

>>> SequenceMatcher(None,"86418648","86488648").get_matching_blocks()
[Match(a=4, b=0, size=4), Match(a=8, b=8, size=0)]

但我期望结果是:

[Match(a=0, b=0, size=3), Match(a=4, b=4, size=4), Match(a=8, b=8, size=0)]

任何人都可以帮助解释为什么它与前3个数字"864"不匹配吗?

1 个答案:

答案 0 :(得分:0)

SequenceMatcher.get_matching_blocks()通过将SequenceMatcher.find_longest_match()重复应用于两个序列中尚未匹配的块来工作。

引用find_longest_match()的文档字符串:

Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
    alo <= i <= i+k <= ahi
    blo <= j <= j+k <= bhi
and for all (i',j',k') meeting those conditions,
    k >= k'
    i <= i'
    and if i == i', j <= j'

In other words, of all maximal matching blocks, return one that
starts earliest in a, and of all those maximal matching blocks that
start earliest in a, return the one that starts earliest in b.

对于两个序列a = "86418648"b = "86488648"a中匹配b中某个块的最长块是8648处{ {1}},a[4]中最早的匹配是b中两个此类可能匹配中的第一个。

确定此匹配后,不再有任何进一步的匹配,根据b[0]提供的guarantee“三元组在 i中单调增加 j

例如,将SequenceMatcher.get_matching_blocks()处的尚未匹配的864a[0]处尚未匹配的864进行匹配将需要 i < / em>在 j 增加时减少(反之亦然),违反上述保证。