Question

我有一个多处理器程序，它基本上解析一些XML信息并返回字典（一个文件的一个字典对象）作为输出，然后，我将所有字典合并为一个final_dword。

if __name__ == '__main__':
  numthreads = 2  
  pool = mp.Pool(processes=numthreads)
  dword_list = pool.map(parse_xml, (locate("*.xml")))
  final_dword = {}
  print "The final Word Count dictionary is "
  map(final_dword.update,dword_list)
  print final_dword

上述代码适用于较小的数据集。随着我的数据量不断增长，我的程序在

期间冻结

map(final_dword.update,dword_list)

这是我的假设，我的程序在上述stmt的exe期间冻结。我试图使用muppy来编写我的代码并找到以下内容。

在 n 迭代中（其中 n > 1200+，这意味着该程序基本上处理了大约1200多个文件），我得到以下统计数据：

Iteration  1259
                       types |   # objects |   total size
============================ | =========== | ============
                        dict |         660 |    511.03 KB
                         str |        6899 |    469.10 KB
                        code |        1979 |    139.15 KB
                        type |         176 |     77.00 KB
          wrapper_descriptor |        1037 |     36.46 KB
                        list |         307 |     23.41 KB
  builtin_function_or_method |         738 |     23.06 KB
           method_descriptor |         681 |     21.28 KB
                     weakref |         434 |     16.95 KB
                       tuple |         476 |     15.76 KB
                         set |         122 |     15.34 KB
         <class 'abc.ABCMeta |          18 |      7.88 KB
         function (__init__) |         130 |      7.11 KB
           member_descriptor |         226 |      7.06 KB
           getset_descriptor |         213 |      6.66 KB

我的笔记本电脑中有4 Gb RAM，我正在处理大量小（<1MB）XML文件。我正在寻找一种更好的方法来合并较小的词典。

Answer 1

如果您使用Python 3.3，您可以尝试使用collections.ChainMap作为解决方案。我还没有使用它，但它应该是将多个词典链接在一起的快速方法。请参阅讨论here。

也许尝试将dword_list腌制到文件中，并使用生成器而不是保留列表内存。通过这种方式，您可以流式传输数据而不是存储数据。它应该释放一些内存并使程序更快。类似的东西：

def xml_dict(): 
    for d in pickle.load("path/to/file.pickle"): 
        yield d

Answer 2

使用itertools可以链接容器

import itertools

listA = {1,2,3}
listB = {4,5,6}
listC = {7,8,9}

for key in itertools.chain(listA, listB, listC):
    print key,

产出：1,2,3,4,5,6,7,8,9

这样你就不需要创建一个新的容器，它将遍历迭代，直到它们用完为止。它与用户@roippi评论相同，但写的方式不同。

dict(itertools.chain.from_iterable(x.iteritems() for x in dword_list))

有效地合并大量词典

2 个答案: