在python中插入序列的缺失部分

时间:2010-12-07 16:23:14

标签: python

我有两个序列,例如:

Seq 1: MAT--LA-B
seq 2: MATATLAB

是否有可能在python中比较两个序列,然后在序列1中插入缺失部分而不更改序列1的其余部分,即最终序列1应为MATAT--LA-B

插入可能位于多个位置..(我有一个多序列比对,其中部分序列被丢弃......我想重新插入这些部分..)

提前致谢!!

2 个答案:

答案 0 :(得分:0)

我建议通过获取opcodes将一个序列转换为另一个序列来开始搜索解决方案。可以使用difflib.SequenceMatcher.get_opcodes生成操作码。这些将是带有指令(插入,删除或替换)的元组,并且启动/停止索引是将一个序列转换为另一个序列必须发生的变化。但是,问题可能是由于SequenceMatcher算法的变幻莫测,最左边的匹配总是优先于它们右边的潜在匹配,这可能会在你的情况下产生不需要的结果。您始终可以设计自己的操作码处理函数。我注意到在该示例中,通过在使用SequenceMatcher生成操作码之前简单地反转两个字符串,可以使用普通操作码获得结果,因为答案将要求最右边的匹配具有优先权。只是一个想法。

答案 1 :(得分:0)

比前面的答案少一点;但它看起来像一个有趣的问题,所以我想我还是会尝试一下:

import re

def find_start_of(needle, haystack):
    """
    @param needle    Search on first char of string
    @param haystack  Longer string to search in

    Look for first char of needle in haystack; return offset
    """

    if needle=='':
        return 0

    offs = haystack.find(needle[0])
    if offs==-1:
        return len(haystack)
    else:
        return offs

def find_end_of(lst, letterset):
    """
    @param lst       Chars to search for
    @param letterset String to search through

    lst contains some chars of letterset in order;
    Return offset in letterset of last char of lst
    """

    offs = 0
    for ch in lst:
        t = letterset.find(ch, offs)

        if t==-1:
            raise ValueError('letterset (%s) is not an ordered superset of lst (%s)' % (letterset, lst))
        else:
            offs = t+1

    return offs-1

def alignSeq(s1, s2):
    """
    @param s1 A string consisting of letters and hyphens
    @param s2 A string containing only letters

    The letters in s1 are an in-sequence subset of s2

    Returns s1 with the missing letters from s2 inserted
    in-sequence and greedily preceding hyphens.
    """

    # break s1 into letter-chunks and hyphen-chunks
    r = '([^-]*)([-]*)'        # string of letters followed by string of hyphens
    seq = re.findall(r, s1) # break string into list of tuples
    seq = seq[:-1]          # discard final empty pair
    # eg: "MAT--LA-B" becomes [('MAT', '--'), ('LA', '-'), ('B', '')]

    # find start of corresponding letter-chunks in s2
    offs = 0
    chunkstart = []
    for letters,hyphens in seq:
        offs += find_start_of(letters, s2[offs:])
        chunkstart.append(offs)
        offs += find_end_of(letters, s2[offs:]) + 1

    # get end+1 for each letter-chunk
    chunkend = chunkstart[1:] + [len(s2)]
    # get replacement letter-chunks
    chunks = [s2[st:en] for st,en in zip(chunkstart,chunkend)]

    # do replacement for each chunk
    outp = [c+s[1] for c,s in zip(chunks, seq)]

    return ''.join(outp)

然后

alignSeq('MAT--LA-B','MATATLAB')

返回

'MATAT--LA-B'