Question

我的fasta文件有多个序列，如下所示

>header1
MPANFTE
GSFDSSG
>header2
MDNASFS
EPWPANA

所以我正在编写一个代码来删除标题，输出在临时文件中看起来像这样：

MPANFTEGSFDSSG

MDNASFSEPWPANA

到目前为止，我已经提出了这个代码：但它没有给我确切的输出。

import sys,subprocess

# open the file
my_file = open("gpcr.fasta")
# read the contents
my_gpcr = my_file.readlines()
for line in my_gpcr:
    if '>' == line[0]:
        header = line
    else:   
        tempf = open('temp.fasta', 'w')
        tempf.write(header)
        tempf.write(line)
        tempf.close()
        print line

Answer 1

我假设您要做的是打印每个序列只包含一行的文件？您可以使用awk脚本：

awk 'BEGIN{newline=0}{if(/^>/){if(newline==1){print ""} newline=1}else printf $i}END{print }' gpcr.fasta

如果你想使用Python，无论如何，我就是这样做的：

import sys, re

# open the file                                                                                            
my_file = open("gpcr.fasta")

newline = 0
for line in my_file:
    line = line.rstrip('\n')
    if re.match('>', line):
        # after we read the first header, we will print a new line when we see one to separate concatenated sequences                                                                                                   
        if newline == 1:
            print
        newline = 1
    else:
        sys.stdout.write(line)

print

此外，标题以＆＃39;＆gt;＆＃39;开头。不是真正的FASTA标题。只有＆gt;应该在行的开头。

Answer 2

此代码段中不需要import sys,subprocess
您永远不会close my_file，您可以使用with open("gpcr.fasta") as my_file:
由于您使用temp.fasta打开'w'，因此每个新序列行都会覆盖前一个序列行，您可以通过使用'a'打开文件或使用标题作为文件名来附加序列。
您使用readlines()一次性读取整个文件，这对于小文件很方便，但速度较慢，并且对于较大的文件会消耗大量内存（有关详细讨论，请参阅here）。迭代file对象会更快。

所以这是建议的代码：

my_file = 'gpcr.fasta'
header = ''

with open(my_file, 'r') as my_gpcr:
    for line in my_gpcr:
        if line.startswith('>'):
            # check if we already passed the first line
            if header:
                tempf.close()
                print()
            header = line.strip()
            # open a new file for each sequence
            tempf = open('{}.fasta'.format(header[1:]), 'w')
            # remove if you want to skip the header
            tempf.write(header)
        else:
            # write the sequence line to the file
            # remove the strip() if you want to keep the line breaks
            tempf.write(line.strip())
            # end='\r' makes the sure the sequences are concatenated
            print(line.strip(), end='\r')

注意：这假定格式正确的FASTA文件。

Answer 3

这是你想要的吗？

$ awk '/>/{if (NR>1) print ""; next} {printf "%s", $0} END{print ""}' file
MPANFTEGSFDSSG
MDNASFSEPWPANA

无法从Fasta文件中拆分序列

3 个答案: