基本上,问题是要求找出所有可能的主题(k-mers long),其中DNA串的集合中只有d个不匹配。我可以编写下面的代码来找到一个字符串DNA的所有图案(k,d)。当我发现多行字符串DNA时,我不知道如何修改我的代码。
示例输入:
k = 3,d = 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT
示例输出:
ATA
ATT
GTT
TTT
.....
30145: $RM -f "$cfgfile"
.....
答案 0 :(得分:0)
问题似乎是将代码从使用内部变量切换到从文件读取输入。您不能将文件的DNA链连接在一起并像以前一样运行它,因为这会改变链的末端相遇的结果。您还必须以不同于其他输入的方式处理输入的第一行,因为它包含程序参数,其余的是原始数据:
import re
import sys
import collections
mismatch_list = []
def hamming_distance(s1, s2):
""" Returns the Hamming distance between equal-length sequences """
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
with open(sys.argv[1]) as file:
kmer = None
in_mistake = None
parameters = file.readline().rstrip() # first line of file has parameters
matchobj = re.search(r"k\s*=\s*(\d+)", parameters)
if matchobj:
kmer = int(matchobj.group(1))
matchobj = re.search(r"d\s*=\s*(\d+)", parameters)
if matchobj:
in_mistake = int(matchobj.group(1))
assert kmer is not None and in_mistake is not None, "file parameters misread"
for sequence in file: # subsequent lines of file are DNA strands
sequence = sequence.rstrip()
if not sequence:
continue # ignore blank lines
result = []
for i in range(len(sequence) - kmer + 1):
v = sequence[i:i + kmer]
result.append(v)
for t_kmer in set(result):
for s_kmer in result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
mismatch_count = collections.Counter(mismatch_list)
print(mismatch_count)