我编写了以下脚本来检索每个包含的基因计数。它运行良好,但我用作输入的ID list
的顺序在输出中不守恒。
我需要保留相同的顺序,因为我的输入重叠群列表是根据它们的表达水平排序的
谁能帮我?
谢谢你的帮助。
from collections import defaultdict
import numpy as np
gene_list = {}
for line in open('idlist.txt'):
columns = line.strip().split()
gene = columns[0]
rien = columns[1]
gene_list[gene] = rien
gene_count = defaultdict(lambda: np.zeros(6, dtype=int))
out_file= open('out.txt','w')
esem_file = open('Aquilonia.txt')
esem_file.readline()
for line in esem_file:
fields = line.strip().split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list.keys():
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
input file:
comp54678_c0_seq3
comp56871_c2_seq8
comp56466_c0_seq5
comp57004_c0_seq1
comp54990_c0_seq11
...
output file comes back in numerical order:
comp100235_c0_seq1 [22 13 15 6 15 16]
comp101274_c0_seq1 [55 2 27 26 6 6]
comp101915_c0_seq1 [20 2 34 12 8 7]
comp101956_c0_seq1 [13 21 11 17 17 28]
comp101964_c0_seq1 [30 73 45 36 0 1]
答案 0 :(得分:5)
使用collections.OrderedDict()
;它按输入顺序保留条目。
from collections import OrderedDict
with open('idlist.txt') as idlist:
gene_list = OrderedDict(line.split(None, 1) for line in idlist)
上面的代码使用一行读取您的gene_list
有序词典。
但是,看起来好像是纯粹根据输入文件行的顺序生成输出文件:
for line in esem_file:
# ...
if exon in gene_list: # no need to call `.keys()` here
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
重新编写代码以首先收集计数,然后使用单独的循环来写出数据:
with open('Aquilonia.txt') as esem_file:
next(esem_file, None) # skip first line
for line in esem_file:
fields = line.split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list:
gene_count[gene_list[exon]] += numbers
with open('out.txt','w') as out_file:
for gene in gene_list:
print >> out_file, gene, gene_count[gene]