以下程序在两个文件(txt,~10MB ea。)上运行约22个小时。每个文件大约有~100K行。有人可以告诉我我的代码效率低下,也许是一种更快的方法。输入字典是有序的,保留顺序是必要的:
import collections
def uniq(input):
output = []
for x in input:
if x not in output:
output.append(x)
return output
Su = {}
with open ('Sucrose_rivacombined.txt') as f:
for line in f:
(key, val) = line.split('\t')
Su[(key)] = val
Su_OD = collections.OrderedDict(Su)
Su_keys = Su_OD.keys()
Et = {}
with open ('Ethanol_rivacombined.txt') as g:
for line in g:
(key, val) = line.split('\t')
Et[(key)] = val
Et_OD = collections.OrderedDict(Et)
Et_keys = Et_OD.keys()
merged_keys = Su_keys + Et_keys
merged_keys = uniq(merged_keys)
d3=collections.OrderedDict()
output_doc = open("compare.txt","w+")
for chr_local in merged_keys:
line_output = chr_local
if (Et.has_key(chr_local)):
line_output = line_output + "\t" + Et[chr_local]
else:
line_output = line_output + "\t" + "ND"
if (Su.has_key(chr_local)):
line_output = line_output + "\t" + Su[chr_local]
else:
line_output = line_output + "\t" + "ND"
output_doc.write(line_output + "\n")
输入文件如下:并非两个文件中都存在每个键
Su:
chr1:3266359 80.64516129
chr1:3409983 100
chr1:3837894 75.70093458
chr1:3967565 100
chr1:3977957 100
Et:
chr1:3266359 95
chr1:3456683 78
chr1:3837894 54.93395855
chr1:3967565 100
chr1:3976722 23
我希望输出看起来如下:
chr1:3266359 80.645 95
chr1:3456683 ND 78
答案 0 :(得分:3)
将uniq
替换为此输入,因为输入可以清除:
def uniq(input):
output = []
s = set()
for x in input:
if x not in s:
output.append(x)
s.add(x)
return output
这会将近O(n^2)
个过程减少到接近O(n)
。
答案 1 :(得分:1)
您不需要自己独特的功能。
伪代码如:
此外,爱情列表理解......您可以阅读文件:
OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
只有一个有序的字典和'Sucrose_rivacombined.txt'甚至从未进入内存。应该超级快
编辑完整代码(不确定输出行格式)
from collections import OrderedDict
Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
with open("compare.txt","w+") as output_doc:
for line in open('Sucrose_rivacombined.txt'):
key,val = line.strip().split('\t')
line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
output_doc.write(line_out+'\n')
for key,val in Et_OD.items():
line_out = '\t'.join((key,'ND',val))
output_doc.write(line_out+'\n')
答案 2 :(得分:0)
您的output
是一个列表,但您的输入是字典:它们的密钥保证唯一,但您的not in output
需要与的每个元素进行比较列表,这是组合。 (由于not
检查,你正在进行n ^ 2次比较。)
您可以完全用以下内容替换uniq:
Su_OD.update(Et_OD)
这对我有用:
from collections import OrderedDict
one = OrderedDict()
two = OrderedDict()
one['hello'] = 'one'
one['world'] = 'one'
two['world'] = 'two'
two['cats'] = 'two'
one.update(two)
for k, v in one.iteritems():
print k, v
输出:
hello one
world two
cats two