查找键以相同前缀开头的字典值的更有效方法

时间:2015-08-05 19:33:04

标签: python performance dictionary lookup startswith

我有一个字典,其密钥以共享相同前缀的集合形式出现,如下所示:

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

给定一个查询字符串,我需要查找与以该前缀开头的键相关联的所有值,例如:对于query="key1",我需要["valA", "valB", "valC"]

我的下面的实现有效但对于大量查询来说太慢了,因为字典d有大约30,000个密钥,而且大多数密钥长度超过20个字符:

result = [d[s] for s in d.keys() if s.startswith(query)]

有更快/更有效的方法来实现这个吗?

3 个答案:

答案 0 :(得分:8)

您可以避免生成dict.keys()生成的中间列表(在python 2.x中):

result = [d[key] for key in d if key.startswith(query)]

但是您很可能希望使用trie而不是字典,因此您可以找到与具有公共前缀的键相关联的所有值(trie类似于基于前缀的树)。 / p>

Here你可以找到一些不同的尝试实现。

  

A trie for keys "A", "to", "tea", "ted", "ten", "i", "in", and "inn".

     

键“A”,“to”,“tea”,“ted”,“ten”,“i”,“in”和“inn”的特里。 (来源wikipedia

让我们比较不同解决方案的时间安排:

# create a dictionary with 30k entries
d = {str(x):str(x) for x in xrange(1, 30001)}
query = '108'

# dict with keys()
%timeit [d[s] for s in d.keys() if s.startswith(query)]

    100 loops, best of 3: 8.87 ms per loop
# dict without keys()
%timeit [d[s] for s in d if s.startswith(query)]

    100 loops, best of 3: 7.83 ms per loop

# 11.72% improvement
# PyTrie (https://pypi.python.org/pypi/PyTrie/0.2)
import pytrie
pt = pytrie.Trie(d)

%timeit [pt[s] for s in pt.iterkeys(query)]

    1000 loops, best of 3: 320 µs per loop

# 96.36% improvement
# datrie (https://pypi.python.org/pypi/datrie/0.7)
import datrie
dt = datrie.Trie('0123456789')
for key, val in d.iteritems():
    dt[unicode(key)] = val

%timeit [dt[s] for s in dt.keys(unicode(query))]

    10000 loops, best of 3: 162 µs per loop

# 98.17% improvement

答案 1 :(得分:0)

sortedContainers lib有一个SortedDict实现,一旦你排序了dict,你可以bisect_left找到从哪里开始,bisect_right找到最后一个位置然后使用irange来获取密钥在范围内:

from sortedcontainers import SortedDict
from operator import itemgetter
from itertools import takewhile


d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
  "key2":"valD", "key2-22":"valE","key3":"foo" }

key = "key2"
d = SortedDict(sorted(d.items(), key=itemgetter(0)))
start = d.bisect_left(key)
print([d[key] for key in takewhile(lambda x: x.startswith("key2"), d.irange(d.iloc[start]]))
['valD', 'valE']

使用sorteddict维护一个sorteddict后效率要高得多:

In [68]: l = ["key{}".format(randint(1,1000000)) for _ in range(100000)] 
In [69]: l.sort()    
In [70]: d = SortedDict(zip(l,range(100000)))

In [71]: timeit [d[s] for s in d.keys() if s.startswith("key2")]
10 loops, best of 3: 124 ms per loop

In [72]: timeit [d[s] for s in d if s.startswith("key2")]
10 loops, best of 3: 24.6 ms per loop

In [73]: %%timeit
key = "key2"
start = d.bisect_left(key)
l2 =[d[k] for k in takewhile(lambda x: x.startswith("key2"),d.irange(d.iloc[start]))]
   ....: 

100 loops, best of 3: 5.57 ms per loop

答案 2 :(得分:0)

您可以使用suffix tree

#!/usr/bin/env python2
from SuffixTree import SubstringDict # $ pip install https://github.com/JDonner/SuffixTree/archive/master.zip

d = { "key1":"valA", "key123":"valB", "key1XY":"valC",
      "key2":"valD", "key2-22":"valE" }

a = '\n' # anchor
prefixes = SubstringDict()
for key, value in d.items(): # populated the tree *once*
    prefixes[a + key] = value # assume there is no '\n' in key

for query in ["key1", "key2"]: # perform queries
    print query, prefixes[a + query]

输出

key1 ['valC', 'valA', 'valB']
key2 ['valE', 'valD']