Pythonic“合并”字符串的方法,处理所有可能的长度

时间:2016-07-01 17:28:18

标签: python string bioinformatics

我正试图解决生物信息学中一个相当普遍的问题,而不是诉诸于一堆if语句。

手头的问题:

我给了两个重叠的字符串和一个预期输出的长度,我想生成一个合并的字符串。这里是字符串可能重叠的所有方式:(在下面的示例中,-表示该字符串中该位置没有任何内容。consensus()位在示例之后进行了解释。):

# size=13
xxxOVERLAP---
---OVERLAPyyy
# expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy


# size=7
---OVERLAPxxx
yyyOVERLAP---
# expected output: consensus(xOVERLAP, yOVERLAP)


# size=7
OVERLAP
OVERLAP
# expected output: consensus(xOVERLAP, yOVERLAP)

# size=10
xxxOVERLAP
---OVERLAP
# expected output: xxx + consensus(xOVERLAP, yOVERLAP)

# size=10
OVERLAP---
OVERLAPyyy
# expected output: consensus(xOVERLAP, yOVERLAP) + yyy

# size > len(x) + len(y)
# no overlap, produce error:
xxx---
---yyy
# expected output: error

生成的合并字符串需要x 的开头开始,y 的结尾结束。 需要将重叠的区域传递给另一个处理合并重叠区域的函数consensus()这里字符串可能重叠的所有方式:(在以下示例中为{{1表示该位置的该字符串中没有任何内容)

-

我可以编写一堆if语句来识别每个案例并单独处理它,但我一直在努力寻找更优雅的解决方案。我考虑的一种方法是填充字符串(x的结尾和y的开头),以便所有情况看起来像第二个例子,但这似乎太低效了,不适合,因为当我这样做时我会创建新的字符串我正在将这个函数应用于数百万个字符串。

3 个答案:

答案 0 :(得分:0)

我会从生成每个角色的生成器开始:

def merge_gen(x, y, overhang):
    buffer = ' ' * overhang
    for s in map(set, zip(buffer + x, y + buffer)):
        yield max(s)

overhanglen(x) - size的地方(见下文)

其工作原理如下:

>>> list(merge_gen('OVERLAPXXX', 'YYYOVERLAP', 3))
['Y', 'Y', 'Y', 'O', 'V', 'E', 'R', 'L', 'A', 'P', 'X', 'X', 'X']

然后,您可以实现merge函数,包括consensus函数,如下所示:

def merge(x, y, size):
    length = len(x)
    overhang = size - length
    overlap = length - overhang
    gen = merge_gen(x, y, overhang)

    result = ''
    result += ''.join(next(gen) for _ in range(overhang))
    result += consensus(''.join(next(gen) for _ in range(overlap)))
    result += ''.join(next(gen) for _ in range(overhang))
    return result

我希望这在Python3中相当有效;很多发电机,很少浪费的字符串,等等。

(*)Apparently这是从集合中获取单个项目的快捷方式。在这种情况下,我们知道集合只有一个元素,我们只想提取。

答案 1 :(得分:0)

这是您正在寻找的功能吗?

def consensus(left, right, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    slide = len(left) + len(right) - 1

    #slides the strings over each other one spot at a time
    solutions = []
    for i in range(slide):
        lft_test = left[-(i+1):]
        rgt_test = right[:min(len(right), i+1)]
        #print(lft_test, rgt_test)
        if lft_test == rgt_test:
            lft_garbage = left[:-(i+1)]
            rgt_garbage = right[min(len(right), (i+1)):]
            solutions.append((lft_garbage, lft_test, rgt_garbage))

    #if more than one overlap combo is found, keeps only the longest
    if len(solutions) > 1:
        sol_lenghts = [len(i[1]) for i in solutions]                
        longest_index = sol_lenghts.index(max(an_lens))
        solutions = solutions[longest_index]
        return solutions
    elif len(solutions) == 0:
        return None
    else:
        return solutions[0]

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right)
> ('', 'HEY', 'HEYxxxx')

使用滑动窗口留下旧答案,但这里有指定的重叠:

def consensus(left, right, size, ignore_blank_padding=True):
    if ignore_blank_padding:
        left = left.strip()
        right = right.strip()

    solutions = None
    lft_test = left[-(size):]
    rgt_test = right[:size]
    if lft_test == rgt_test:
        lft_garbage = left[:-(size)]
        rgt_garbage = right[min(len(right), (size)):]
        solutions = (lft_garbage, lft_test, rgt_garbage)

    return solutions

left = 'xxxxHEY'
right = 'HEYxx'
consensus(left, right, 3)
> ('xxxx', 'HEY', 'xx')

left = 'xxHEYHEY'
right = 'HEYHEYxxx'
consensus(left, right, 6)
> ('xx', 'HEYHEY', 'xxx')

left = 'xxHEY '
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('xx', 'HEY', 'HEYxxxx')

left = 'HEY'
right = '  HEYHEYxxxx'
consensus(left, right, 3)
> ('', 'HEY', 'HEYxxxx')

答案 2 :(得分:0)

这是一个工作示例,但使用"方式太多if语句"这种方法难以阅读,难以推理,而且极不优雅:

def extra_left(x, y, size):
    if size - len(y) > 0:
        return x[:size - len(y)]
    else:
        return ""


def extra_right(x, y, size):
    if size - len(x) > 0:
        return y[len(x) - size:]
    else:
        return ""

def overlap(x, y, size):

    if len(x) < size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) == size:
        x_overlap = x
        y_overlap = y[:len(x) - size]
    if len(x) < size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y)-size:size]

    if len(x) == size and len(y) < size:
        x_overlap = x[size - len(y):]
        y_overlap = y
    if len(x) == size and len(y) == size:
        x_overlap = x
        y_overlap = y
    if len(x) == size and len(y) > size:
        x_overlap = x
        y_overlap = y[len(y) - size:]

    if len(x) > size and len(y) < size:
        x_overlap = x[size - len(y):size]
        y_overlap = y
    if len(x) > size and len(y) == size:
        x_overlap = x[:size]
        y_overlap = y
    if len(x) > size and len(y) > size:
        x_overlap = x[:size]
        y_overlap = y[-size:]

    if len(x) + len(y) < size:
        raise RuntimeError("x and y do not overlap with this size")

    return consensus(x_overlap, y_overlap)

def consensus(x, y):
    assert len(x) == len(y)
    return x


def merge(x, y, size):
    return extra_left(x, y, size) + overlap(x, y, size) + extra_right(x, y, size)

以下是一些单元测试(使用pytest

class Tests:

    def test1(self):
        """
        len(x) < size and len(y) < size:
        xxxOVERLAP---
        ---OVERLAPyyy
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "AAAATTTTTTT"
        y = "TTTTTTTCCC"
        size = 14
        assert merge(x, y, size) == "AAAA" + consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test2(self):
        """
        if len(x) < size and len(y) == size:
        # size=10
        OVERLAP---
        OVERLAPyyy
        # expected output: consensus(xOVERLAP, yOVERLAP) + yyy
        """
        x = "TTTTTTT"
        y = "TTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test3(self):
        """
        if len(x) < size and len(y) > size:
        ---OVERLAP---
        yyyOVERLAPyyy
        """
        x = "TTTTTTT"
        y = "CCCTTTTTTTCCC"
        size = 10
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT") + "CCC"

    def test4(self):
        """
        if len(x) == size and len(y) < size:
        # size=10 
        xxxOVERLAP
        ---OVERLAP
        # expected output: xxx + consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test5(self):
        """
        if len(x) == size and len(y) == size:
        # size=7
        OVERLAP
        OVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTT"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test6(self):
        """
        if len(x) == size and len(y) > size:
        # size=10
        --xxxOVERLAP
        yyyyyOVERLAP
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "AAATTTTTTT"
        y = "CCCCCTTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test7(self):
        """
        if len(x) > size and len(y) < size:
        xxxOVERLAPxxx
        ---OVERLAP---
        """
        x = "AAATTTTTTTAAA"
        y = "TTTTTTT"
        size = 10
        assert merge(x, y, size) == "AAA" + consensus("TTTTTTT", "TTTTTTT")

    def test8(self):
        """
        if len(x) > size and len(y) == size:
        ---OVERLAPxxx
        ---OVERLAP---
        """
        x = "TTTTTTTAAA"
        y = "TTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")

    def test9(self):
        """
        if len(x) > size and len(y) > size:
        ---OVERLAPxxx
        yyyOVERLAP---
        # expected output: consensus(xOVERLAP, yOVERLAP)
        """
        x = "TTTTTTTAAA"
        y = "CCCTTTTTTT"
        size = 7
        assert merge(x, y, size) == consensus("TTTTTTT", "TTTTTTT")


    def test_error(self):
        """
        # no overlap, produce error:
        xxx---
        ---yyy
        # expected output: error
        """
        x = "AAA"
        y = "TTT"
        size = 7
        with pytest.raises(RuntimeError):
            merge(x, y, size)

他们都过去了:

test_merge.py::Tests::test1 PASSED
test_merge.py::Tests::test2 PASSED
test_merge.py::Tests::test3 PASSED
test_merge.py::Tests::test4 PASSED
test_merge.py::Tests::test5 PASSED
test_merge.py::Tests::test6 PASSED
test_merge.py::Tests::test7 PASSED
test_merge.py::Tests::test8 PASSED
test_merge.py::Tests::test9 PASSED
test_merge.py::Tests::test_error PASSED

====================================================================== 10 passed in 0.02 seconds =======================================================================