字符串读取并找到不同的子序列

时间:2012-10-21 00:31:10

标签: python string

假设有一个文件 - CTYESDYDFYCGDDTTGANDSPAHDTAGMAIHTAA

我需要从文件的第一个位置一次读取3个字母(在这种情况下来自'CTY'),直到我遇到TGA或TAG或TAA。当我遇到这些子序列中的任何一个时,读取应该停止(在这种情况下,它将在TGA处停止,因为遇到第一个)并且应该打印序列。

现在,当我下次读取序列时,我需要从字符串的第2个位置开始(这次是从'TYE'开始)并再次读取3个子序列,直到遇到TGA,TAA或TAG。这与先前的读取不同,因为当从“TYE”读取序列时,所有其他子序列(以三个为一组)也将改变。我需要再次打印序列。

在第3次和最后时间重复此步骤,这次是从字符串的第3个位置开始(即“是”)并继续执行上述步骤

这是我迄今为止所做的。

import sys
import pickle
def find_orfs(sequence):
        """ Finds all valid open reading frames in the string 'sequence', and
            returns them as a list"""

        starts = find_all(sequence, 'ATG')
        stop_amber = find_all(sequence, 'TAG')
        stop_ochre = find_all(sequence, 'TAA')
        stop_umber = find_all(sequence, 'TGA')
        stops = stop_amber + stop_ochre + stop_umber
        stops.sort()

        orfs = []

        for start in starts:
                for stop in stops:
                        if start < stop \
                           and (start - stop) % 3 == 0:  # Stop is in-frame
                                orfs.append(sequence[start:stop+3])
                                # the +3 includes the stop codon
                                break
                                # break out of the inner for loop
                                # when we hit the first stop codon
        return orfs


def find_all(sequence, subsequence):
        ''' Returns a list of indexes within sequence that are the start of subsequence'''
        start = 0
        idxs = []
        next_idx = sequence.find(subsequence, start)

        while next_idx != -1:
                idxs.append(next_idx)
                start = next_idx + 1     # Move past this on the next time around
                next_idx = sequence.find(subsequence, start)


        return idxs


file = open(sys.argv[1], 'r')   # Read in from the first command-line argument


genedict = pickle.load(file)

file.close()

orfdict = {}

for gene in genedict:
    gene_seq = genedict[gene]
    orfs = find_orfs(gene_seq)
    if len(orfs) > 0:
        orfdict[gene] = orfs

print orfdict

fout = open('orfs_out', 'w')
pickle.dump(orfdict, fout)
fout.close()

1 个答案:

答案 0 :(得分:1)

以这种方式从文件中读取效率非常低。将整个文件读入字符串,这是一个骨架:

data = my_file.read()
data_len = len_data
for start in range(3):
    for idx in xrange(start, data_len-start-2, 3):
        if data[idx:idx+3] in ('TGA', 'TAA', 'TAG'):
           <do your worst here :-)>