我有一个包含许多序列的fasta文件,如下所示:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765659|emb|Z78553.1|CIZ78553 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765668|emb|Z78531.3|CIZ78531 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
另外,我有一个带有一些基因ID的id.file,我想从fasta文件中检索序列并获得各自基因和序列的输出,例如:
gi|2765658|emb|Z78533.1|CIZ78533
gi|2765659|emb|Z78553.1|CIZ78553
我的输出文件将是:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
>gi|2765659|emb|Z78553.1|CIZ78553 C.irapeanum 5.8S rRNA gene
AATTTCAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
但是我有很多id.files(500个文件--idfile1.txt idfile2.txt,idfile3.txt等),它们有不同的基因组,我有一个python脚本,一次只能处理一个文件,但我想同时为我的500个文件做这件事。我的python脚本是:
#!/usr/bin/python
from Bio import SeqIO
fasta_file = "fastafile.fa" # Input fasta file
wanted_file = "idfile1.txt" # Input interesting sequence IDs, one per line
result_file = "out1.fasta" # Output fasta file
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
我该如何为它做一个循环? 我只有一个fasta文件,但很多id文件。 我是Python的新手,所以我不知道该怎么做。或者也许我可以为这个脚本运行带有循环的shell脚本?我不确定,因为我需要在其上包含文件名。
有什么建议吗?
答案 0 :(得分:0)
我说os
是你的朋友。
假设您的文件夹包含多种类型的文件,但应处理结尾为txt
的所有文本文件。我猜每个输入文件都需要自己的结果文件,因此我们必须注意:
#!/usr/bin/python
from Bio import SeqIO
import os
fasta_file = "fastafile.fa" # Input fasta file
mypath='/home/usr/data/mydatafolder'#your path here
for file in os.listdir(mypath):
if file.endswith(".txt"):#as you probably have others there as well
wanted_file=os.path.join(mypath,file)#absolute path to file
wanted = set()###clearing for every new file
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line:
wanted.add(line)
result_file=os.path.join(mypath, 'result_'+wanted_file)
#this puts the output of, e.g. idfile17.txt in result_idfile17.txt
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
或者如果您知道如何构造文件名,例如:
#!/usr/bin/python
from Bio import SeqIO
import os
fasta_file = "fastafile.fa" # Input fasta file
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
mypath='/home/usr/data/mydatafolder'#your path here
filenamelist=["idfile{}.txt".format(x) for x in range(100)]
#files from 0 to 99
### or maybe some specific numbers?
# filenamelist=["idfile{}.txt".format(x) for x in [1,20,30,50,117] ]
for file in filenamelist:
wanted_file=os.path.join(mypath,file)#absolute path to file
####now the same thing as before
wanted = set()###clearing for every new file
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line:
wanted.add(line)
result_file=os.path.join(mypath, 'result_'+wanted_file)
#this puts the output of, e.g. idfile17.txt in result_idfile17.txt
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
请注意,您也可以轻松创建输出文件名的独立列表。
答案 1 :(得分:0)
您可以使用glob
。添加到序言(文件顶部):
import glob
然后替换
with open(wanted_file) as f:
与
for wanted_file in glob.glob("/path/to/files/id*.txt"):
with open(wanted_file) as f:
...
假设所有FASTA文件名都采用id*.txt
格式并位于/path/to/files/
文件夹中。确保正确缩进以下代码。
答案 2 :(得分:0)
我按照建议尝试,但可能我犯了很多错误:
<ListBox ItemsSource="{Binding MyItems, RelativeSource={RelativeSource AncestorType=Window}}"/>
但是不起作用,错误信息是: NameError:未定义名称“result_file”
我包含了结果文件,但是根据id文件,预计会有很多结果文件。我不知道该怎么做!