我有两个字符串对的例子
YHFLSPYVY # answer
LSPYVYSPR # prediction
+++******ooo
YHFLSPYVS # answer
VEYHFLSPY # prediction
oo*******++
如上所述,我希望在答案(*
)和预测(+
)中找到重叠区域(o
)和非重叠区域。
我怎样才能用Python做到这一点?
我坚持这个
import re
# This is of example 1
ans = "YHFLSPYVY"
pred= "LSPYVYSPR"
matches = re.finditer(r'(?=(%s))' % re.escape(pred), ans)
print [m.start(1) for m in matches]
#[]
我希望得到的答案是例如:
plus_len = 3
star_len = 6
ooo_len = 3
答案 0 :(得分:3)
使用difflib.SequenceMatcher.find_longest_match
:
from difflib import SequenceMatcher
def f(answer, prediction):
sm = SequenceMatcher(a=answer, b=prediction)
match = sm.find_longest_match(0, len(answer), 0, len(prediction))
star_len = match.size
return (len(answer) - star_len, star_len, len(prediction) - star_len)
该函数返回一个3元组的整数(plus_len, star_len, ooo_len)
:
f('YHFLSPYVY', 'LSPYVYSPR') -> (3, 6, 3)
f('YHFLSPYVS', 'VEYHFLSPY') -> (2, 7, 2)
答案 1 :(得分:1)
您可以使用difflib
:
import difflib
ans = "YHFLSPYVY"
pred = "LSPYVYSPR"
def get_overlap(s1, s2):
s = difflib.SequenceMatcher(None, s1, s2)
pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2))
return s1[pos_a:pos_a+size]
overlap = get_overlap(ans, pred)
plus = ans.replace(get_overlap(ans, pred), "")
oo = pred.replace(get_overlap(ans, pred), "")
print len(overlap)
print len(plus)
print len(oo)