我正在寻找 python 解决方案,根据与列表的匹配,将多个序列从FASTA文件提取到多个文件中标题ID在单独的文件中。
这是Extract sequences from a FASTA file based on entries in a separate file和https://www.biostars.org/p/2822/上发布的问题的稍微复杂版本,只为所有匹配输出单个文件。
CAP357_2030_09WPI,CAP357_2040_11WPI,CAP357_2050_13WPI等......
> CAP357_2030_009wpi_v1v3_1_056_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG
> CAP357_2040_011wpi_v1v3_1_008_00006_001.1
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3_1_030_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3_1_004_00001_000.2
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2050_013wpi_v1v3_1_047_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
file1:CAP357_2030_009wpi_v1v3.fasta
> CAP357_2030_009wpi_v1v3_1_056_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGG
file2:CAP357_2040_011wpi_v1v3.fasta
> CAP357_2040_011wpi_v1v3_1_008_00006_001.1
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3_1_030_00002_000.4
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
> CAP357_2040_011wpi_v1v3_1_004_00001_000.2
GTAAAATTAACCCCACTCTGTGTCACTCTAAATTGTACAACTGCAAAGGGT
等...
此代码来自上述链接,但我希望:
*匹配写入单独的outfiles
*我不必单独指定每个outfile,如果可能的话(我将有多达30个outfiles)
#!/usr/bin/env python
import sys
from Bio import SeqIO
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file = sys.argv[3]
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
index = SeqIO.index(input_file, "fasta")
records = (index[r] for r in wanted)
count = SeqIO.write(records, output_file, "fasta")
assert count == len(wanted)
print "Saved %i records from %s to %s" % (count, input_file, output_file)
到目前为止,这是我提出的(下面的脚本),但不知道如何手动指定所有的outfiles和变量(我这里只包括三个)
from Bio import SeqIO
import pandas as pd
import sys
input_file = sys.argv[1]
id_file = sys.argv[2]
output_file2020 = sys.argv[3]
output_file2030 = sys.argv[4]
output_file2040 = sys.argv[5]
colnames = ["2020", "2030", "2040"]
headerlist = pd.read_csv(id_file, names = colnames, header = None)
infile = list(SeqIO.parse(input_file, "fasta"))
2020_seq = tuple(headerlist.2020)
2030_seq = tuple(headerlist.2030)
2040_seq = tuple(headerlist.2040)
count2020 = 0
count2030 = 0
count2040 = 0
for record in infile:
if record.id in 2020_seq:
SeqIO.write([record], output_file2020, "fasta")
countSU += 1
elif record.id in PI_seq:
SeqIO.write([record], output_file2030, "fasta")
countPI += 1
elif record.id in REC_seq:
SeqIO.write([record], output_file2040, "fasta")
countREC += 1
else:
print("no matches found")
print("number of SU is", count2020)
print("number of PI is", count2030)
print("number of REC is", count2040)
答案 0 :(得分:1)
一些简短的建议:
如果所有标题都遵循相同的模式,那么您可以提取唯一元素:
record.description.split("_")[1]
(从“CAP357_2040_011wpi_v1v3_1_008_00006_001.1”中收取“2040”)
如果你使用dict,你可以收集记录集合:
collected = {}
for record in records:
descr = record.description.split("_")[1]
try:
collected[descr].append(record)
except KeyError:
collected[descr] = [record ,]
然后你可以把每个集合写成一个新文件:
file_name = "outfile%s"
for (descr, records) in collected.items(): # iteritems in python2
with open(os.path.join(file_path, file_name % descr), 'w') as f:
SeqIO.write(records, f, 'fasta')
答案 1 :(得分:0)
为了完整起见,这里是最后的'脚本:
#!/usr/bin/env python
# a script to extract fasta records from a fasta file to multiple separate fasta files based on a particular ID (time point) in a particular field, for a given delimiter
# to run, navigate to file location with command prompt and enter: python split_fasta_by_collections.py infile.fasta
from Bio import SeqIO
import os
import sys
records = SeqIO.parse(sys.argv[1], "fasta")
collected = {}
for record in records:
descr = record.description.split("_")[1] # "_" sets the delimeter, "1" sets the field where counting starts at 0 for the first field
try:
collected[descr].append(record)
except KeyError:
collected[descr] = [record ,]
file_name = "outfile%s.fasta"
file_path = os.getcwd() #sets the output file path to your current working directory
for (descr, records) in collected.items():
with open(os.path.join(file_path, file_name % descr), 'w') as f:
SeqIO.write(records, f, 'fasta')