Question

我有一个很长的字典表格，字符串就像

一样

{'a':'b' , 'c':'d'}

我想有效地搜索一个键，这样我就可以从字典中得到相应的答案，但我想要一个更好的搜索，而不是迭代整个字典。是否有更好的方式存储在一个集合中，以便我可以有效地搜索。我查看了集合，但我只能找到存储单个字符串的方法，但不能存储字典元素。

Answer 1

如果您有字典d并想要测试密钥k的成员资格，则可以使用k in d或k not in d。例如：

>>> d = {'a':'b' , 'c':'d'}
>>> 'a' in d
True
>>> 'x' in d
False
>>> 'a' not in d
False
>>> 'x' not in d
True
>>>

这些检查应该非常有效，因为字典（和集合）是用哈希表实现的。

Answer 2

How are Python's Built In Dictionaries Implemented已经有了一个有趣的任务......

但为什么不在这里尝试和测量（在阅读了关于python实现中的时间复制之后）：

#! /usr/bin/env python
from __future__ import division, print_function

import dis  # for disassembling (bottom)
import string  # string.printable as sample character source
import timeit


chars = tuple(string.printable)
cartesian = tuple(a + b for a in chars for b in chars)
assert 10000 == len(cartesian), print(len(cartesian))
d = dict((a + b, b) for a in cartesian for b in chars)

assert 1000000 == len(d), print(len(d))
assert d['zzz'] == 'z'

setup = """
import string
chars = tuple(string.printable)
d = dict((a + b, b) for a in chars for b in chars)
"""

assert 1000000 / 10000 == 100
setup_100x = """
import string
chars = tuple(string.printable)
cartesian = tuple(a + b for a in chars for b in chars)
d = dict((a + b, b) for a in cartesian for b in chars)
"""

stmt = """
'zzz' in d
"""


t = timeit.timeit(stmt=stmt, setup=setup, timer=timeit.default_timer,
                  number=timeit.default_number)

print("# Timing[secs] for 1x10000:", t)

t_100x = timeit.timeit(stmt=stmt, setup=setup_100x, timer=timeit.default_timer,
                       number=timeit.default_number)

print("# Timing[secs] for 100x10000:", t_100x)

disassemble_me = "'zzz' in {'a': 'b'}"
print("# Disassembly(lookup in dict with 1 string entry):")
print("#", disassemble_me)
dis.dis(disassemble_me)

disassemble_me = "'zzz' in {'a': 'b', 'c': 'd'}"
print("# Disassembly(lookup in dict with 2 string entries):")
print("#", disassemble_me)
dis.dis(disassemble_me)

disassemble_me = "'zzz' in {'a': 'b', 'c': 'd', 'e': 'f'}"
print("# Disassembly(lookup in dict with 3 string entries):")
print("#", disassemble_me)
dis.dis(disassemble_me)

在我的使用Python 2.7.11的机器上，这给出了：

# Timing[secs] for 1x10000: 0.0406861305237
# Timing[secs] for 100x10000: 0.0472030639648
# Disassembly(lookup in dict with 1 string entry):
# 'zzz' in {'a': 'b'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   32039 (32039)
# Disassembly(lookup in dict with 2 string entries):
# 'zzz' in {'a': 'b', 'c': 'd'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   11303 (11303)
       19 SLICE+2        
       20 <39>           
       21 DUP_TOPX        14887
       24 SLICE+2        
       25 <39>           
       26 LOAD_CONST      32039 (32039)
# Disassembly(lookup in dict with 3 string entries):
# 'zzz' in {'a': 'b', 'c': 'd', 'e': 'f'}
        0 <39>           
        1 SETUP_FINALLY   31354 (to 31358)
        4 <39>           
        5 SLICE+2        
        6 BUILD_MAP        8302
        9 <123>           24871
       12 <39>           
       13 INPLACE_DIVIDE 
       14 SLICE+2        
       15 <39>           
       16 DELETE_GLOBAL   11303 (11303)
       19 SLICE+2        
       20 <39>           
       21 DUP_TOPX        14887
       24 SLICE+2        
       25 <39>           
       26 LOAD_CONST      11303 (11303)
       29 SLICE+2        
       30 <39>           
       31 LOAD_NAME       14887 (14887)
       34 SLICE+2        
       35 <39>           
       36 BUILD_TUPLE     32039

所以10000个条目在10 ^ 4个条目中查找'zz'大约dict。平均40毫秒（timeit.default_number == 1000000）和50毫秒以下100倍，即10 ^ 6个条目（'zzz'查询）。

# Timing[secs] for 1x10000: 0.0406861305237
# Timing[secs] for 100x10000: 0.0472030639648

测量意味着可重复性:-)因此再次运行它：

# Timing[secs] for 1x10000: 0.0441079139709
# Timing[secs] for 100x10000: 0.0460820198059

它只是安顿下来（这里没有显示其他的运行，这些关键类型和长度关系（较大的dict的键也更长！），这里有没有线性最坏情况实现。对于100倍大的dict（qu entry entry count）和50％更大的密钥长度，运行时间更长10％。

看起来不错。建议在有疑问时始终进行测量和拆卸。 HTH。

PS：OP可能仍希望在未来的问题中提供更多的代码上下文，因为最好选择数据结构，知道如何使用它; - ）

PPS：Raymond Hettinger等人。经常优化CPython实现“爱到细节”（对不起，没有更好的英语表达可用于我），所以期望总是针对小“尺寸”问题的特定展开实现，这就是为什么玩具变体问题的反汇编可能会有很大不同从一个，实现了大型任务。这就是为什么我更倾向于在反汇编时使用timeit和（轮廓测量），但是我们应该习惯于读取字节代码，以便在测量性能无法满足我们的预期时获得想法。

否则：享受查找的反汇编： - ）

更新：...如果您更改语句的时间，小dict会命中一个匹配'zz'而大一个没有（反之亦然），您也可以加密这些时间：

# Timing[secs] for 1x10000: 0.0533709526062
# Timing[secs] for 100x10000: 0.0458760261536

其中'zz' in d的测试时间为53毫秒（46毫秒）和46毫秒（平均1000000次试验）。

Answer 3

您可以使用str.translate函数使用表（在这种情况下为dict）从键到值替换字符串的字符。

在字典中高效搜索

3 个答案: