Question

假设有一个文件 - CTYESDYDFYCGDDTTGANDSPAHDTAGMAIHTAA

我需要从文件的第一个位置一次读取3个字母（在这种情况下来自'CTY'），直到我遇到TGA或TAG或TAA。当我遇到这些子序列中的任何一个时，读取应该停止（在这种情况下，它将在TGA处停止，因为遇到第一个）并且应该打印序列。

现在，当我下次读取序列时，我需要从字符串的第2个位置开始（这次是从'TYE'开始）并再次读取3个子序列，直到遇到TGA，TAA或TAG。这与先前的读取不同，因为当从“TYE”读取序列时，所有其他子序列（以三个为一组）也将改变。我需要再次打印序列。

在第3次和最后时间重复此步骤，这次是从字符串的第3个位置开始（即“是”）并继续执行上述步骤

这是我迄今为止所做的。

import sys
import pickle
def find_orfs(sequence):
        """ Finds all valid open reading frames in the string 'sequence', and
            returns them as a list"""

        starts = find_all(sequence, 'ATG')
        stop_amber = find_all(sequence, 'TAG')
        stop_ochre = find_all(sequence, 'TAA')
        stop_umber = find_all(sequence, 'TGA')
        stops = stop_amber + stop_ochre + stop_umber
        stops.sort()

        orfs = []

        for start in starts:
                for stop in stops:
                        if start < stop \
                           and (start - stop) % 3 == 0:  # Stop is in-frame
                                orfs.append(sequence[start:stop+3])
                                # the +3 includes the stop codon
                                break
                                # break out of the inner for loop
                                # when we hit the first stop codon
        return orfs


def find_all(sequence, subsequence):
        ''' Returns a list of indexes within sequence that are the start of subsequence'''
        start = 0
        idxs = []
        next_idx = sequence.find(subsequence, start)

        while next_idx != -1:
                idxs.append(next_idx)
                start = next_idx + 1     # Move past this on the next time around
                next_idx = sequence.find(subsequence, start)


        return idxs


file = open(sys.argv[1], 'r')   # Read in from the first command-line argument


genedict = pickle.load(file)

file.close()

orfdict = {}

for gene in genedict:
    gene_seq = genedict[gene]
    orfs = find_orfs(gene_seq)
    if len(orfs) > 0:
        orfdict[gene] = orfs

print orfdict

fout = open('orfs_out', 'w')
pickle.dump(orfdict, fout)
fout.close()

Answer 1

以这种方式从文件中读取效率非常低。将整个文件读入字符串，这是一个骨架：

data = my_file.read()
data_len = len_data
for start in range(3):
    for idx in xrange(start, data_len-start-2, 3):
        if data[idx:idx+3] in ('TGA', 'TAA', 'TAG'):
           <do your worst here :-)>

字符串读取并找到不同的子序列

1 个答案: