比较两个词典的效率

时间:2012-05-02 16:03:13

标签: python dictionary

以下程序在两个文件(txt,~10MB ea。)上运行约22个小时。每个文件大约有~100K行。有人可以告诉我我的代码效率低下,也许是一种更快的方法。输入字典是有序的,保留顺序是必要的:

import collections

def uniq(input):
  output = []
  for x in input:
    if x not in output:
      output.append(x)
  return output

Su = {}
with open ('Sucrose_rivacombined.txt') as f:
    for line in f:
        (key, val) = line.split('\t')
        Su[(key)] = val
    Su_OD = collections.OrderedDict(Su)

Su_keys = Su_OD.keys()
Et = {}

with open ('Ethanol_rivacombined.txt') as g:
    for line in g:
        (key, val) = line.split('\t')
        Et[(key)] = val
    Et_OD = collections.OrderedDict(Et)

Et_keys = Et_OD.keys()

merged_keys = Su_keys + Et_keys
merged_keys =  uniq(merged_keys)

d3=collections.OrderedDict()

output_doc = open("compare.txt","w+")

for chr_local in merged_keys:
    line_output = chr_local
    if (Et.has_key(chr_local)):
        line_output = line_output + "\t" + Et[chr_local]
    else:
        line_output = line_output + "\t" + "ND"
    if (Su.has_key(chr_local)):
        line_output = line_output + "\t" + Su[chr_local]
    else:
        line_output = line_output + "\t" + "ND"

    output_doc.write(line_output + "\n")

输入文件如下:并非两个文件中都存在每个键

Su:
chr1:3266359    80.64516129
chr1:3409983    100
chr1:3837894    75.70093458
chr1:3967565    100
chr1:3977957    100


Et:
chr1:3266359    95
chr1:3456683    78
chr1:3837894    54.93395855
chr1:3967565    100
chr1:3976722    23

我希望输出看起来如下:

chr1:3266359    80.645    95
chr1:3456683    ND        78

3 个答案:

答案 0 :(得分:3)

uniq替换为此输入,因为输入可以清除:

def uniq(input):
  output = []
  s = set()
  for x in input:
    if x not in s:
      output.append(x)
      s.add(x)
  return output

这会将近O(n^2)个过程减少到接近O(n)

答案 1 :(得分:1)

您不需要自己独特的功能。

伪代码如:

  1. 将文件2读为OrderedDict
  2. 处理文件1写出它的项目(已正确订购)
  3. 弹出,文件2的defalut为输出行的最后一部分
  4. 消耗文件一后
  5. 处理文件2中的Ordered dict
  6. 此外,爱情列表理解......您可以阅读文件:

    OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
    

    只有一个有序的字典和'Sucrose_rivacombined.txt'甚至从未进入内存。应该超级快

    编辑完整代码(不确定输出行格式)

    from collections import OrderedDict
    
    Et_OD = OrderedDict(line.strip().split('\t') for line in open('Ethanol_rivacombined.txt'))
    
    with open("compare.txt","w+") as output_doc:
        for line in open('Sucrose_rivacombined.txt'):
            key,val = line.strip().split('\t')
            line_out = '\t'.join((key,val,Et_OD.pop(key,'ND')))
            output_doc.write(line_out+'\n')
    
        for key,val in Et_OD.items():
            line_out = '\t'.join((key,'ND',val))
            output_doc.write(line_out+'\n')
    

答案 2 :(得分:0)

您的output是一个列表,但您的输入是字典:它们的密钥保证唯一,但您的not in output需要与的每个元素进行比较列表,这是组合。 (由于not检查,你正在进行n ^ 2次比较。)

您可以完全用以下内容替换uniq:

Su_OD.update(Et_OD)

这对我有用:

from collections import OrderedDict

one = OrderedDict()
two = OrderedDict()

one['hello'] = 'one'
one['world'] = 'one'

two['world'] = 'two'
two['cats'] = 'two'

one.update(two)

for k, v in one.iteritems():
    print k, v

输出:

    hello one
    world two
    cats two