difflib序列匹配器缺少常见子字符串

时间:2018-10-05 21:30:15

标签: python difflib sequencematcher

在试图找到两个字符串之间的公共子字符串时,SequenceMatcher不会返回所有预期的公共子字符串。

s1 = '++%2F%2F+Prints+%22Hello%2C+World%22+to+the+terminal+window.%0A++++++++System.out.pr%29%3B%0A++++%7D%0A%7D%0ASample+program%0Apublic+static+voclass+id+main%28String%5B%5D+args%29+'
s2 = 'gs%29+%7B%0A++++++++%2F'
# The common substring are '+%', '%0A++++++++', '%s' and 'gs%29+'
# but 'gs%29+' is not matched.

import difflib as d

seqmatch = d.SequenceMatcher(None,s1,s2)
matches = seqmatch.get_matching_blocks()

for match in matches:
    apos, bpos, matchlen = match
    print(s1[apos:apos+matchlen])

输出:

+%
%0A++++++++
%2

“ gs%29 +”是s1s2之间的公用子字符串,但SequenceMatcher找不到。

我想念什么吗?

谢谢

1 个答案:

答案 0 :(得分:0)

也许垃圾字符混淆了算法。我在isjunk内为SequenceMatcher()添加了一个lambda函数

s1 = '++%2F%2F+Prints+%22Hello%2C+World%22+to+the+terminal+window.%0A++++++++System.out.pr%29%3B%0A++++%7D%0A%7D%0ASample+program%0Apublic+static+voclass+id+main%28String%5B%5D+args%29+'
s2 = 'gs%29+%7B%0A++++++++%2F'
# The expected substring is 'gs%29+'

import difflib as d

seqmatch = d.SequenceMatcher(lambda x: x in "+", s1, s2)
matches = seqmatch.get_matching_blocks()

for match in matches:
    apos, bpos, matchlen = match
    print(s1[apos:apos+matchlen])

现在输出为

gs%29+