Question

我有这样的序列（超过9000）：

>TsM_000224500 
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500 
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900 
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL

包含＆＃34;＆gt;＆＃34;的行ID是ID，带字母的行是氨基酸（aa）序列。我需要删除（或移动到另一个文件）低于40 aa和超过4000 aa的序列。然后，生成的文件应仅包含此范围内的序列（＆gt; = 40 aa和＆lt; = 4K aa）。

我尝试过编写以下脚本：

def read_seq(file_name):
    with open(file_name) as file:
        return file.read().split('\n')[0:]

ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")

tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')

for x in range(len(ts)):
    if ([x][0:1] != '>'):
        if (len([x]) > 40 or len([x]) < 4000):

            tsf.write('%s\n'%(x))

tsf.close()

print "OK!"

我做了一些修改，但我得到的都是空文件或所有+9000序列。

Answer 1

在for循环中，x是一个迭代整数，因为使用了range()（即0,1,2,3,4...）。试试这个：

for x in ts:

这将为ts中的每个元素提供x

此外，您不需要x周围的括号; Python可以自己迭代字符串中的字符。当你在一个字符串周围放置括号时，你把它放到一个列表中，因此如果你试图获得x中的第二个字符：[x][1]，Python将尝试获取第二个元素在列出x的列表中，会遇到问题。

编辑：要包含ID，请尝试以下操作：

注意：我还将if (len(x) > 40 or len(x) < 4000)更改为if (len(x) > 40 and len(x) < 4000) - 使用and代替or将为您提供所需的结果。

for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
    if (x[0] != '>'):
        if (len(x) > 40 and len(x) < 4000):
            tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
            tsf.write('%s\n'%(x))

Answer 2

试试这个，简单易懂。它不会将整个文件加载到内存中，而是逐行遍历文件。

tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
    for line in ts: # iterate over each line of input file
        line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
        if line[0]=='>': # if line is an ID 
            continue # move to the next line
        else: # otherwise
            if (len(line)>40) or (len(line)<4000): # if line is in required length
                tsf.write('%s\n'%line) # write to output file

tsf.close() # done
print "OK!"

仅供参考，如果在unix环境中工作，你也可以使用awk作为单行解决方案：

cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

如何写一个文件的特定行长度？

2 个答案: