Question

我需要Python中一个内存高效的int-int dict，它支持 O（log n）时间内的以下操作：

d[k] = v  # replace if present
v = d[k]  # None or a negative number if not present

我需要保持~250万对，所以真的必须紧。

您是否碰巧知道合适的实现（Python 2.7）？

编辑删除了不可能的要求和其他废话。谢谢，Craig和Kylotan！

改写。这是一个包含1M对的简单int-int字典：

>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
...     d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
... 
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 25165960  51  25165960  51 dict (no owner)
     1 1999521 100 23994252  49  49160212 100 int

平均而言，一对整数使用 49字节。

这是一个2M整数数组：

>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
...     a.append(random.randint(0, sys.maxint))
... 
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   7  8000028 100   8000028 100 array.array

平均而言，一对整数使用 8字节。

我接受字典中的8个字节/对通常很难实现。 重新提问：是否有一个内存高效的int-int字典实现，使用的字节少于49字节/对？

Answer 1

您可以使用Zope的IIBtree

Answer 2

我不知道这是一次性解决方案，还是正在进行的项目的一部分，但是如果它是前者，那么投入更多的ram比开发时间优于内存使用的开发时间便宜？即使每对64字节，你仍然只看15GB，这对于大多数桌面盒来说都很容易。

我认为正确答案可能在SciPy / NumPy库中，但我对库不太熟悉，无法准确地告诉您。

http://docs.scipy.org/doc/numpy/reference/

您可能还会在此主题中找到一些有用的想法： Memory Efficient Alternatives to Python Dictionaries

Answer 3

每个键/值对8个字节在任何实现，Python或其他方面都会非常困难。如果你没有保证密钥是连续的那么你要么通过使用数组表示在代码之间浪费大量空间（以及需要某种死值来表示空键），或者你需要为键/值对维护一个单独的索引，根据定义，每对将超过你的8个字节（即使只是少量）。

我建议你使用你的数组方法，但最好的方法将取决于我期望的键的性质。

Answer 4

如果你从整数映射到Judy数组怎么样？它是一种稀疏数组......使用字典实现空间的1/4。

朱迪：

$ cat j.py ; time python j.py 
import judy, random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = judy.JudyIntObjectMap()
for _ in xrange(4000000):
    d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)

print h.heap()
Partition of a set of 4000004 objects. Total size = 96000624 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0 4000001 100 96000024 100  96000024 100 int
     1      1   0      448   0  96000472 100 types.FrameType
     2      1   0       88   0  96000560 100 __builtin__.weakref
     3      1   0       64   0  96000624 100 __builtin__.PyJudyIntObjectMap

real    1m9.231s
user    1m8.248s
sys     0m0.381s

字典：

$ cat d.py ; time python d.py   
import random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = {}
for _ in xrange(4000000):
    d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)

print h.heap()
Partition of a set of 8000003 objects. Total size = 393327344 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      1   0 201326872  51 201326872  51 dict (no owner)
     1 8000001 100 192000024  49 393326896 100 int
     2      1   0      448   0 393327344 100 types.FrameType

real    1m8.129s
user    1m6.947s
sys     0m0.559s

〜1/4空间：

$ echo 96000624 / 393327344 | bc -l
.24407309958089260125

（我正在使用64位python，顺便说一句，所以我的基数可能因64位指针而膨胀）

Answer 5

查看上面的数据，每个int不是49个字节，它是25.每个条目的其他24个字节是int对象本身。所以你需要的东西比每个条目的 25 字节要小得多。除非您还要重新实现int对象，这至少可以用于键哈希。或者在C中实现它，你可以完全跳过这些对象（这是Zopes IIBTree所做的，如上所述）。

说实话，Python词典以各种方式高度调整。打败它并不容易，但祝你好运。

Answer 6

我已经实现了自己的int-int字典available here（BSD许可证）。简而言之，我使用array.array('i')来存储按键排序的键值对。事实上，我保留了一个较小数组的字典（一个键值对存储在key/65536数组中）而不是一个大数组，以便在检索期间加速插入和二进制搜索。每个数组按以下方式存储键和值：

key0 value0 key1 value1 key2 value2 ...

实际上，它不仅是一个int-int字典，而且是一个将对象简化为哈希值的通用object-int字典。因此，hash-int字典可以用作某些持久存储字典的缓存。

处理“密钥冲突”有三种可能的策略，即尝试为同一密钥分配不同的值。默认策略允许它。 “删除”删除密钥并将其标记为冲突，因此任何进一步尝试为其分配值都将无效。 “喊叫”策略会在任何覆盖尝试期间以及对任何碰撞密钥的任何进一步访问时抛出异常。

请参阅my answer至a related question，了解我的方法的不同措辞。

Python中的内存高效int-int dict

6 个答案: