如何使用python在DNA序列中找到连续重复?

时间:2012-10-23 04:18:04

标签: python

我有像

这样的DNA序列
seq='ATCGTTTTTCGAAACTGCCCCCCACTGGGGA'

我想在python中打印连续的重复核苷酸(如果它连续重复两次以上)。

对于此序列,输出应为

TTTTT
AAA
CCCCCC
GGGG

3 个答案:

答案 0 :(得分:3)

您可能需要查看itertools.groupby

示例用法:

for _, group in itertools.groupby(seq):
    group = ''.join(group)
    if len(group) > 2:
        print group

答案 1 :(得分:1)

您可以使用后引用regular expressionfindall方法轻松找到重复内容;

seq = 'ATCGTTTTTCGAAACTGCCCCCCACTGGGGA'

import re
hits = re.findall(r'(([A-Z])\2\2+)', seq) # regex matching all repeating A-Z groups
print [hit[0] for hit in hits]          # Comprehension to filter the results

['TTTTT', 'AAA', 'CCCCCC', 'GGGG']

答案 2 :(得分:0)

seq='ATCGTTTTTCGAAACTGCCCCCCACTGGGGA'
while len(seq) > 1:
    value = seq[0]
    repeats = 1
    idx = 1
    while 1:
        if seq[idx] == value:
            repeats += 1
        else:
            if repeats > 1: print value*repeats
            seq = seq[repeats:]
            break
        idx += 1