defaultdict的MemoryError(int)

时间:2013-03-14 15:32:03

标签: python

我正在使用defaultdict(int)来记录一组书中的单词出现次数。

当我得到内存异常时,Python正在消耗1.5 GB的内存:

  File "C:\Python32\lib\collections.py", line 540, in update
    _count_elements(self, iterable)
MemoryError

我的柜台大小超过8,000,000。

我至少要有20,000,000个独特单词。我该怎么做才能避免内存异常?

1 个答案:

答案 0 :(得分:1)

即使您的64位系统存在大量内存,我也不认为使用dict跟踪它们是一个可行的想法。你应该使用数据库。

/* If we added a key, we can safely resize.  Otherwise just return!
 * If fill >= 2/3 size, adjust size.  Normally, this doubles or
 * quaduples the size, but it's also possible for the dict to shrink
 * (if ma_fill is much larger than ma_used, meaning a lot of dict
 * keys have been * deleted).
 *
 * Quadrupling the size improves average dictionary sparseness
 * (reducing collisions) at the cost of some memory and iteration
 * speed (which loops over every possible entry).  It also halves
 * the number of expensive resize operations in a growing dictionary.
 *
 * Very large dictionaries (over 50K items) use doubling instead.
 * This may help applications with severe memory constraints.
 */
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
    return 0;
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

来自code,它表示如果插入太多项目,则必须增加dict - 不仅为包含的项目提供空间,而且还为新项目的插槽提供空间。它表示,如果超过2/3的字典被填满,则字典的大小将加倍(或少于50,000个项目的四倍)。我个人使用dicts来包含不到几十万个项目。即使只有不到一百万件物品,它也消耗了几千兆字节,几乎冻结了我的8GB win7机器。

如果你只是计算物品,你可以:

spilt the words in chunk
count the words in each chunk
update the database

使用合理的块大小,执行一些db查询(假设数据库访问将成为瓶颈)将会更好。(/ p>