检查dict中是否存在vs python中的set

时间:2016-11-01 20:21:28

标签: python performance python-2.7 dictionary set

似乎检查dict键设置有点快:

import random
import string
import timeit

repeat = 3
numbers = 1000

def time(statement, _setup=None):
    print min(
        timeit.Timer(statement, setup=_setup or setup).repeat(
            repeat, numbers))

random.seed('slartibartfast')

# Integers
length = 100000
d = {}
for _ in range(length):
    d[random.randint(0, 10000000)] = 0
s = set(d)

setup = """from __main__ import s, d, length
"""

time('for i in xrange(length): check = i in d')
time('for i in xrange(length): check = i in s')

# Strings
d = {}
for _ in range(length):
    d[''.join(random.choice(string.ascii_uppercase) for __ in range(16))] = 0
s = set(d)

test_strings= []
for _ in range(length):
    test_strings.append(random.choice(string.ascii_uppercase) for __ in range(16))

setup = """from __main__ import s, d, length, test_strings
"""

time('for i in test_strings: check = i in d')
time('for i in test_strings: check = i in s')

打印如下内容:

10.1242966769
9.73939713014
10.5156763102
10.2767765061

这是预期的还是随意的神器?

想知道在性能密集型代码中为dict键创建集合是否值得。

编辑:我的测量确实让我对底层实现感到疑惑,我不是想节省微秒,我只是好奇 - 是的,如果事实证明底层实现真的有利于集合我可以制作一组那些dict键 - 或不(我实际上是修补遗留代码)。

2 个答案:

答案 0 :(得分:1)

可能取决于各种各样的事情。在我的运行中,字典查找速度稍微快一点,但不足以令人兴奋:

In [1]: import numpy as np

In [2]: d = {i: True for i in np.random.random(1000)}

In [3]: s = {i for i in np.random.random(1000)}

In [4]: checks = d.keys()[:500] + list(s)[:500]

In [5]: %timeit [k in d for k in checks]
10000 loops, best of 3: 83 µs per loop

In [6]: %timeit [k in s for k in checks]
10000 loops, best of 3: 88.4 µs per loop

In [7]: d = {i: True for i in np.random.random(100000)}

In [8]: s = {i for i in np.random.random(100000)}

In [9]: checks = d.keys()[:5000] + list(s)[:5000]

In [10]: %timeit [k in d for k in checks]
1000 loops, best of 3: 865 µs per loop

In [11]: %timeit [k in s for k in checks]
1000 loops, best of 3: 929 µs per loop

答案 1 :(得分:1)

老实说,它在很大程度上取决于硬件,操作系统和数据大小/约束。总的来说,在获得非常大的数据大小之前,性能几乎是相同的。注意这里的一些运行dict稍微好一些。在较大的数据结构大小下,内部实现细节开始主导差异,而我的机器set往往表现得更好。

现实情况是,在大多数情况下,三角洲并不重要。如果您真的想要更好的查找性能,请考虑使用cythonctypes转移到C级操作,或者使用专为更大数据量设计的库实现。当达到数百万个元素时,Python基类型不适用于性能。

>>> # With empty dict as setup in question
>>> time('for i in xrange(length): check = i in d')
2.83035111427
>>> time('for i in xrange(length): check = i in s')
2.87069892883
>>> d = { random.random(): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.84766697884
>>> time('for i in xrange(length): check = i in s')
3.97955989838
>>> d = { random.randint(0, 1000000000): None for _ in xrange(100000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
3.96871709824
>>> time('for i in xrange(length): check = i in s')
3.62110710144
>>> d = { random.randint(0, 1000000000): None for _ in xrange(10000000) }
>>> s = set(d)
>>> time('for i in xrange(length): check = i in d')
10.6934559345
>>> time('for i in xrange(length): check = i in s')
5.7491569519