我有两个序列,例如:
Seq 1: MAT--LA-B
seq 2: MATATLAB
是否有可能在python中比较两个序列,然后在序列1中插入缺失部分而不更改序列1的其余部分,即最终序列1应为MATAT--LA-B
?
插入可能位于多个位置..(我有一个多序列比对,其中部分序列被丢弃......我想重新插入这些部分..)
提前致谢!!
答案 0 :(得分:0)
我建议通过获取opcodes将一个序列转换为另一个序列来开始搜索解决方案。可以使用difflib.SequenceMatcher.get_opcodes生成操作码。这些将是带有指令(插入,删除或替换)的元组,并且启动/停止索引是将一个序列转换为另一个序列必须发生的变化。但是,问题可能是由于SequenceMatcher算法的变幻莫测,最左边的匹配总是优先于它们右边的潜在匹配,这可能会在你的情况下产生不需要的结果。您始终可以设计自己的操作码处理函数。我注意到在该示例中,通过在使用SequenceMatcher生成操作码之前简单地反转两个字符串,可以使用普通操作码获得结果,因为答案将要求最右边的匹配具有优先权。只是一个想法。
答案 1 :(得分:0)
比前面的答案少一点;但它看起来像一个有趣的问题,所以我想我还是会尝试一下:
import re
def find_start_of(needle, haystack):
"""
@param needle Search on first char of string
@param haystack Longer string to search in
Look for first char of needle in haystack; return offset
"""
if needle=='':
return 0
offs = haystack.find(needle[0])
if offs==-1:
return len(haystack)
else:
return offs
def find_end_of(lst, letterset):
"""
@param lst Chars to search for
@param letterset String to search through
lst contains some chars of letterset in order;
Return offset in letterset of last char of lst
"""
offs = 0
for ch in lst:
t = letterset.find(ch, offs)
if t==-1:
raise ValueError('letterset (%s) is not an ordered superset of lst (%s)' % (letterset, lst))
else:
offs = t+1
return offs-1
def alignSeq(s1, s2):
"""
@param s1 A string consisting of letters and hyphens
@param s2 A string containing only letters
The letters in s1 are an in-sequence subset of s2
Returns s1 with the missing letters from s2 inserted
in-sequence and greedily preceding hyphens.
"""
# break s1 into letter-chunks and hyphen-chunks
r = '([^-]*)([-]*)' # string of letters followed by string of hyphens
seq = re.findall(r, s1) # break string into list of tuples
seq = seq[:-1] # discard final empty pair
# eg: "MAT--LA-B" becomes [('MAT', '--'), ('LA', '-'), ('B', '')]
# find start of corresponding letter-chunks in s2
offs = 0
chunkstart = []
for letters,hyphens in seq:
offs += find_start_of(letters, s2[offs:])
chunkstart.append(offs)
offs += find_end_of(letters, s2[offs:]) + 1
# get end+1 for each letter-chunk
chunkend = chunkstart[1:] + [len(s2)]
# get replacement letter-chunks
chunks = [s2[st:en] for st,en in zip(chunkstart,chunkend)]
# do replacement for each chunk
outp = [c+s[1] for c,s in zip(chunks, seq)]
return ''.join(outp)
然后
alignSeq('MAT--LA-B','MATATLAB')
返回
'MATAT--LA-B'