Question

我必须从大文件中计算数据。文件有大约100000行和3列。下面的程序适用于小型测试文件，但是当尝试使用大型文件运行时，甚至需要显示一个结果。任何加速大数据文件加载和计算的建议。

代码：计算是完美的小测试文件，输入格式如下

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    numline = 0
    for line in f:
        numline += 1
            line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))

Inputfile中：

5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937  459.0
2684 937  318.0
1980 606  390.0
1349 606  750.0
1174 606  750.0

Answer 1

缓慢的主要原因是因为你为perpair字典中的每一行重新创建paircount字典，字典越来越大，这是不必要的，因为只有在之后计算的值所有的线都被处理过了。

我并不完全理解所有的计算是什么，但是这里的等价物应该运行得更快，因为它只会创建一次pairper字典。我也简化了逻辑，虽然这可能不会影响运行时间，但我认为它更容易理解。

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occurrences and total time
with open('easy_input.txt', 'r') as f, open('easy_output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    pairper = dict((pair, c * 100.0 / numline) for (pair, c)
                                                in paircount.iteritems())
    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c,
                                          pairper[pair], pairtime[pair]))
print 'done'

Answer 2

配对计算正在杀死你并且不需要。您可以使用枚举来计算输入行，并在最后使用该值。这与martineau的答案类似，只是它没有将整个输入列表拉入内存（坏主意），甚至根本没有计算出配对。

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    for numline, line in enumerate(f, 1):
        line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))

轻松计算大数据文件python的方法

2 个答案: