自动执行许多文件的fasta序列检索

时间:2017-02-02 14:26:17

标签: python loops fasta

我有一个包含许多序列的fasta文件,如下所示:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765659|emb|Z78553.1|CIZ78553 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765668|emb|Z78531.3|CIZ78531 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG

另外,我有一个带有一些基因ID的id.file,我想从fasta文件中检索序列并获得各自基因和序列的输出,例如:

gi|2765658|emb|Z78533.1|CIZ78533
gi|2765659|emb|Z78553.1|CIZ78553

我的输出文件将是:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765659|emb|Z78553.1|CIZ78553 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG

但是我有很多id.files(500个文件--idfile1.txt idfile2.txt,idfile3.txt等),它们有不同的基因组,我有一个python脚本,一次只能处理一个文件,但我想同时为我的500个文件做这件事。我的python脚本是:

#!/usr/bin/python
from Bio import SeqIO

fasta_file = "fastafile.fa" # Input fasta file
wanted_file = "idfile1.txt" # Input interesting sequence IDs, one per line
result_file = "out1.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

我该如何为它做一个循环? 我只有一个fasta文件,但很多id文件。 我是Python的新手,所以我不知道该怎么做。或者也许我可以为这个脚本运行带有循环的shell脚本?我不确定,因为我需要在其上包含文件名。

有什么建议吗?

3 个答案:

答案 0 :(得分:0)

我说os是你的朋友。 假设您的文件夹包含多种类型的文件,但应处理结尾为txt的所有文本文件。我猜每个输入文件都需要自己的结果文件,因此我们必须注意:

#!/usr/bin/python
from Bio import SeqIO
import os  

fasta_file = "fastafile.fa" # Input fasta file


mypath='/home/usr/data/mydatafolder'#your path here
for file in os.listdir(mypath):
    if file.endswith(".txt"):#as you probably have others there as well
        wanted_file=os.path.join(mypath,file)#absolute path to file

        wanted = set()###clearing for every new file
        with open(wanted_file) as f:
            for line in f:
                line = line.strip()
                if line:
                    wanted.add(line)

        result_file=os.path.join(mypath, 'result_'+wanted_file)
        #this puts the output of, e.g. idfile17.txt in result_idfile17.txt

        with open(result_file, "w") as f:
            for seq in fasta_sequences:
                if seq.id in wanted:
                    SeqIO.write([seq], f, "fasta")

或者如果您知道如何构造文件名,例如:

#!/usr/bin/python
from Bio import SeqIO
import os  

fasta_file = "fastafile.fa" # Input fasta file
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')

mypath='/home/usr/data/mydatafolder'#your path here

filenamelist=["idfile{}.txt".format(x) for x in range(100)]
#files from 0 to 99
### or maybe some specific numbers?
# filenamelist=["idfile{}.txt".format(x) for x in [1,20,30,50,117] ]


for file in filenamelist:
    wanted_file=os.path.join(mypath,file)#absolute path to file
    ####now the same thing as before
    wanted = set()###clearing for every new file
    with open(wanted_file) as f:
        for line in f:
            line = line.strip()
            if line:
                wanted.add(line)

    result_file=os.path.join(mypath, 'result_'+wanted_file)
    #this puts the output of, e.g. idfile17.txt in result_idfile17.txt
    with open(result_file, "w") as f:
        for seq in fasta_sequences:
            if seq.id in wanted:
                SeqIO.write([seq], f, "fasta")

请注意,您也可以轻松创建输出文件名的独立列表。

答案 1 :(得分:0)

您可以使用glob。添加到序言(文件顶部):

import glob

然后替换

with open(wanted_file) as f:

for wanted_file in glob.glob("/path/to/files/id*.txt"):
    with open(wanted_file) as f:
       ...

假设所有FASTA文件名都采用id*.txt格式并位于/path/to/files/文件夹中。确保正确缩进以下代码。

答案 2 :(得分:0)

我按照建议尝试,但可能我犯了很多错误:

<ListBox ItemsSource="{Binding MyItems, RelativeSource={RelativeSource AncestorType=Window}}"/>

但是不起作用,错误信息是: NameError:未定义名称“result_file”

我包含了结果文件,但是根据id文件,预计会有很多结果文件。我不知道该怎么做!