找到DNA字符串集合中的所有(k,d)-motif

时间:2016-02-06 23:37:08

标签: python string motif

基本上,问题是要求找出所有可能的主题(k-mers long),其中DNA串的集合中只有d个不匹配。我可以编写下面的代码来找到一个字符串DNA的所有图案(k,d)。当我发现多行字符串DNA时,我不知道如何修改我的代码。

示例输入:

  

k = 3,d = 1

     

ATTTGGC

     

TGCCTTA

     

CGGTATC

     

GAAAATT

示例输出:

  

ATA

     

ATT

     

GTT

     

TTT

.....
30145: $RM -f "$cfgfile"
.....

1 个答案:

答案 0 :(得分:0)

问题似乎是将代码从使用内部变量切换到从文件读取输入。您不能将文件的DNA链连接在一起并像以前一样运行它,因为这会改变链的末端相遇的结果。您还必须以不同于其他输入的方式处理输入的第一行,因为它包含程序参数,其余的是原始数据:

import re
import sys
import collections

mismatch_list = []

def hamming_distance(s1, s2):
    """ Returns the Hamming distance between equal-length sequences """
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

with open(sys.argv[1]) as file:

    kmer = None
    in_mistake = None

    parameters = file.readline().rstrip()  # first line of file has parameters

    matchobj = re.search(r"k\s*=\s*(\d+)", parameters)
    if matchobj:
        kmer = int(matchobj.group(1))

    matchobj = re.search(r"d\s*=\s*(\d+)", parameters)
    if matchobj:
        in_mistake = int(matchobj.group(1))

    assert kmer is not None and in_mistake is not None, "file parameters misread"

    for sequence in file:  # subsequent lines of file are DNA strands
        sequence = sequence.rstrip()
        if not sequence:
            continue  # ignore blank lines

        result = []

        for i in range(len(sequence) - kmer + 1):
            v = sequence[i:i + kmer]
            result.append(v)

        for t_kmer in set(result):
            for s_kmer in result:
                if hamming_distance(t_kmer, s_kmer) <= in_mistake:
                    mismatch_list.append(t_kmer)

mismatch_count = collections.Counter(mismatch_list)

print(mismatch_count)