我在罗莎琳德身上遇到过一个问题,我认为我已经正确地解决了问题,但我得知我的答案是错误的。问题可以在这里找到:http://rosalind.info/problems/grph/
这是基本的图论,更具体地说,它涉及返回重叠DNA串的邻接列表。
“对于字符串和正整数k的集合,字符串的重叠图是一个有向图Ok,其中每个字符串由一个节点表示,字符串s连接到字符串t,其中有一个有向边s的长度为k的后缀与t的长度k前缀匹配,只要s≠t;我们要求s≠t以防止重叠图中的有向循环(尽管可能存在有向循环)。
鉴于:FASTA格式的DNA字符串集合,总长度最多为10 kbp。
返回:与O3对应的邻接列表。您可以按任何顺序返回边缘。“
所以,如果你有:
Rosalind_0498 AAATAAA
Rosalind_2391 AAATTTT
Rosalind_2323 TTTTCCC
Rosalind_0442 AAATCCC
Rosalind_5013 GGGTGGG
你必须回复:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
我的python代码在解析了包含DNA字符串的FASTA文件后,如下所示:
listTitle = []
listContent = []
#SPLIT is the parsed list of DNA strings
#here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings ('>Rosalind_' has been removed, because it is what I split the file with)
while i < len(SPLIT):
curr = SPLIT[i]
title = curr[0:4:1]
listTitle.append(title)
content = curr[4::1]
listContent.append(content)
i+=1
start = []
end = []
#now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string's index will be the same in both lists, as well as in the title list
for item in listContent:
start.append(item[0:3:1])
end.append(item[len(item)-3:len(item):1])
list = []
#then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list
p=0
while p<len(end):
iterator=0
while iterator<len(start):
if p!=iterator:
if end[p] == start[iterator]:
one=listTitle[p]
two=listTitle[iterator]
list.append(one)
list.append(two)
iterator+=1
p+=1
#finally I print the list in the format that they require for the answer
listInc=0
while listInc < len(list):
print "Rosalind_"+list[listInc]+' '+"Rosalind_"+list[listInc+1]
listInc+=2
我哪里错了?对不起,代码有点乏味,我在python中接受过很少的培训
答案 0 :(得分:3)
我不确定您的代码有什么问题,但是这种方法可能被认为更“pythonic”。
我认为你已经将数据读入到将名称映射到DNA字符串的字典中:
{'Rosalind_0442': 'AAATCCC',
'Rosalind_0498': 'AAATAAA',
'Rosalind_2323': 'TTTTCCC',
'Rosalind_2391': 'AAATTTT',
'Rosalind_5013': 'GGGTGGG'}
我们定义一个简单的函数来检查字符串s1
是否有k
- 后缀匹配字符串k
的{{1}} - 前缀:
s2
然后我们查看DNA序列的所有组合以找到匹配的序列。 def is_k_overlap(s1, s2, k):
return s1[-k:] == s2[:k]
:
itertools.combinations
例如,在上面的数据中我们得到:
import itertools
def k_edges(data, k):
edges = []
for u,v in itertools.combinations(data, 2):
u_dna, v_dna = data[u], data[v]
if is_k_overlap(u_dna, v_dna, k):
edges.append((u,v))
if is_k_overlap(v_dna, u_dna, k):
edges.append((v,u))
return edges