我需要从.tsv文件中过滤Blast结果。 过滤器的参数是:
e值在第三列。
文件以.tsv
格式保存contig-001 [Enterobacteria phage G4 sensu lato] 9.01988e-168 5418 GCATAC
contig-001 [Enterobacteria phage ID18 sensu lato] 9.97265e-167 5418 GCATACGAAAAGACAGAATCTC
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 1.10261e-165 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage phiX174 sensu lato] 3.31985e-162 5418 GACTGATCGCAGT
contig-002 [Enterobacteria phage ID2 Moscow/ID/2001] 7.92015e-156 5418 GCATACGAAAAGAC
contig-002 [Enterobacteria phage ID18 sensu lato] 2.38469e-152 5418 GCATACGAAAAGAC
contig-003 [Enterobacteria phage ID2 Moscow/ID/2001] 1.08293e-112 5418 GCATACGAAAAGAC
contig-003 [Sweetpotato badnavirus A] 0.000593081 6592 CATCGTAGCTGAT
contig-003 [Dahlia mosaic virus] 0.000593081 6592 CAAGAAGATAGAGAGTCCCACA
答案 0 :(得分:1)
假设您要保存的结果是核苷酸序列(最后一列),这应该有效:
import csv
from collections import defaultdict
threshold = 10E-20
data = defaultdict(dict)
with open('path/to/file') as infile:
for contig, _ignore, e, _id, nuc in csv.reader(infile, delimiter='\t'):
contig = int(contig.split('-')[1])
e = float(e)
if e < threshold: continue
data[contig][e] = nuc
if len(data[contig]) > 3: data[contig].pop(min(data[contig]))
for contig,d in data.items():
for e in sorted(d):
print(contig, e, d[e])