过滤tsv文件以根据列值获得前3次出现

时间:2015-05-18 22:32:34

标签: python parsing csv bioinformatics tsv

我需要从.tsv文件中过滤Blast结果。 过滤器的参数是:

  1. 仅保留E值< 10E-20,忽略其他人
  2. 对于每个重叠群,保存前3个爆炸结果。每个重叠群不一定有3个,而且很多重叠群超过3个。
  3. e值在第三列。

    文件以.tsv

    格式保存
    contig-001      [Enterobacteria phage G4 sensu lato]          9.01988e-168    5418    GCATAC
    contig-001      [Enterobacteria phage ID18 sensu lato]        9.97265e-167    5418    GCATACGAAAAGACAGAATCTC
    contig-002      [Enterobacteria phage ID2 Moscow/ID/2001]     1.10261e-165    5418    GCATACGAAAAGAC
    contig-002      [Enterobacteria phage phiX174 sensu lato]     3.31985e-162    5418 GACTGATCGCAGT
    contig-002      [Enterobacteria phage ID2 Moscow/ID/2001]     7.92015e-156    5418    GCATACGAAAAGAC
    contig-002      [Enterobacteria phage ID18 sensu lato]        2.38469e-152    5418    GCATACGAAAAGAC
    contig-003      [Enterobacteria phage ID2 Moscow/ID/2001]     1.08293e-112    5418    GCATACGAAAAGAC
    contig-003      [Sweetpotato badnavirus A]                    0.000593081     6592 CATCGTAGCTGAT
    contig-003      [Dahlia mosaic virus]                         0.000593081     6592    CAAGAAGATAGAGAGTCCCACA
    

1 个答案:

答案 0 :(得分:1)

假设您要保存的结果是核苷酸序列(最后一列),这应该有效:

import csv
from collections import defaultdict

threshold = 10E-20

data = defaultdict(dict)
with open('path/to/file') as infile:
    for contig, _ignore, e, _id, nuc in csv.reader(infile, delimiter='\t'):
        contig = int(contig.split('-')[1])
        e = float(e)
        if e < threshold: continue
        data[contig][e] = nuc
        if len(data[contig]) > 3: data[contig].pop(min(data[contig]))

for contig,d in data.items():
    for e in sorted(d):
        print(contig, e, d[e])