假设有一个文件 - CTYESDYDFYCGDDTTGANDSPAHDTAGMAIHTAA
我需要从文件的第一个位置一次读取3个字母(在这种情况下来自'CTY'),直到我遇到TGA或TAG或TAA。当我遇到这些子序列中的任何一个时,读取应该停止(在这种情况下,它将在TGA处停止,因为遇到第一个)并且应该打印序列。
现在,当我下次读取序列时,我需要从字符串的第2个位置开始(这次是从'TYE'开始)并再次读取3个子序列,直到遇到TGA,TAA或TAG。这与先前的读取不同,因为当从“TYE”读取序列时,所有其他子序列(以三个为一组)也将改变。我需要再次打印序列。
在第3次和最后时间重复此步骤,这次是从字符串的第3个位置开始(即“是”)并继续执行上述步骤
这是我迄今为止所做的。
import sys
import pickle
def find_orfs(sequence):
""" Finds all valid open reading frames in the string 'sequence', and
returns them as a list"""
starts = find_all(sequence, 'ATG')
stop_amber = find_all(sequence, 'TAG')
stop_ochre = find_all(sequence, 'TAA')
stop_umber = find_all(sequence, 'TGA')
stops = stop_amber + stop_ochre + stop_umber
stops.sort()
orfs = []
for start in starts:
for stop in stops:
if start < stop \
and (start - stop) % 3 == 0: # Stop is in-frame
orfs.append(sequence[start:stop+3])
# the +3 includes the stop codon
break
# break out of the inner for loop
# when we hit the first stop codon
return orfs
def find_all(sequence, subsequence):
''' Returns a list of indexes within sequence that are the start of subsequence'''
start = 0
idxs = []
next_idx = sequence.find(subsequence, start)
while next_idx != -1:
idxs.append(next_idx)
start = next_idx + 1 # Move past this on the next time around
next_idx = sequence.find(subsequence, start)
return idxs
file = open(sys.argv[1], 'r') # Read in from the first command-line argument
genedict = pickle.load(file)
file.close()
orfdict = {}
for gene in genedict:
gene_seq = genedict[gene]
orfs = find_orfs(gene_seq)
if len(orfs) > 0:
orfdict[gene] = orfs
print orfdict
fout = open('orfs_out', 'w')
pickle.dump(orfdict, fout)
fout.close()
答案 0 :(得分:1)
以这种方式从文件中读取效率非常低。将整个文件读入字符串,这是一个骨架:
data = my_file.read()
data_len = len_data
for start in range(3):
for idx in xrange(start, data_len-start-2, 3):
if data[idx:idx+3] in ('TGA', 'TAA', 'TAG'):
<do your worst here :-)>