Question

我有一个元组列表，每个元组中有3个成员，如下所示：

[(-5092793511388848640, 'test1', 1),
 (-5092793511388848639, 'test0', 0), 
 (-5092793511388848638, 'test3', 3), 
 (-5092793511388848637, 'test2', 2), 
 (-5092793511388848636, 'test5', 5)]

元组按照每个元组的第一个元素按升序排序 - 每个键的哈希值（例如'test0'）。我想找到一种快速搜索这些元组的方法，使用二进制搜索其哈希值来查找特定键。问题是我发现使用for循环的最快方法：

def get(key, D, hasher=hash):
    '''
    Returns the value in the dictionary corresponding to the given key.

    Arguements:
    key -- desired key to retrieve the value of.
    D -- intended dictionary to retrieve value from.
    hasher -- the hash function to be used on the key.
    '''
    for item in D:
        if item[0] == hash(key):
            return item[2]
    raise TypeError('Key not found in the dictionary.')

上面写的函数在搜索更长的元组列表时似乎非常慢，让我们说一个包含6000个不同元组的列表。如果存在任何哈希冲突，它也会中断。我想知道是否有更有效/快捷的方法来搜索列表中正确的元组？

旁注：我知道使用词典将是一种更快捷，更简单的方法来解决我的问题，但我想避免使用它们。

Answer 1

首先，预先关键，不要一遍又一遍地做。其次，您可以将next与解包生成器表达式一起使用以优化位：

def get(key, D, hasher=hash):
    keyhash = hasher(key)
    try:
        return next(v for hsh, k, v in D if keyhash == hsh and key == k)
    except StopIteration:
        raise TypeError('Key not found in the dictionary.')

那就是说，你声称要进行二分搜索，但上面仍然是线性搜索，只是为了避免冗余工作而优化，并在找到所需的密钥时停止（它首先检查哈希，假设密钥比较是昂贵的，然后只在哈希匹配上检查密钥相等，因为你抱怨重复的问题）。如果目标是二进制搜索，并且D按哈希码排序，则您需要使用the bisect module。这样做并不简单（因为bisect不会像key那样使用sorted参数，但是如果你可以将D分成两部分，一部分就是哈希代码，以及代码，键和值的代码，您可以这样做：

import bisect

def get(key, Dhashes, D, hasher=hash):
    keyhash = hasher(key)
    # Search whole list of hashes for beginning of range with correct hash
    start = bisect.bisect_left(Dhashes, keyhash)
    # Search for end point of correct hashes (limit to entries after start for speed)
    end = bisect.bisect_right(Dhashes, keyhash, start)
    try:
        # Linear search of only start->end indices for exact key
        return next(v for hsh, k, v in D[start:end] if key == k)
    except StopIteration:
        raise TypeError('Key not found in the dictionary.')

这样可以获得真正的二分查找，但如上所述，要求在搜索之前将哈希码提前与tuple的{{1}}分开。在每次搜索时拆分哈希码是不值得的，因为将它们分开的循环可能只是直接找到了你想要的值（如果你一次执行多次搜索，那只会分裂）。 p>

正如Padraic在his answer中所说的那样，以放弃C加速器代码为代价，你可以复制和修改bisect.bisect_right和bisect.bisect_left的纯Python实现，改变addEventListener的每次使用{1}} hashcode, key, value a[mid]会为您提供a[mid][0]代码，不需要您维护单独的bisect哈希值。节省内存可能值得更高的查找成本。不要使用list来执行切片，因为带有itertools.islice索引的islice会迭代整个start直到那一点;真正的切片只会读取和复制您关心的内容。如果您想避免第二次list操作，您可以随时编写自己的bisect - 优化Sequence并将其与islice结合使用，以获得类似的效果，而无需预先计算itertools.takewhile索引。代码可能类似于：

end

注意：from itertools import takewhile # Copied from bisect.bisect_left, with unused arguments removed and only # index 0 of each tuple checked def bisect_idx0_left(a, x): lo, hi = 0, len(a) while lo < hi: mid = (lo+hi)//2 if a[mid][0] < x: lo = mid+1 else: hi = mid return lo def sequence_skipper(seq, start): return (seq[i] for i in xrange(start, len(seq))) def get(key, D, hasher=hash): keyhash = hasher(key) # Search whole list of hashes for beginning of range with correct hash start = bisect_idx0_left(D, keyhash) # Make lazy iterator that skips start values in the list # and stops producing values when the hash stops matching hashmatches = takewhile(lambda x: keyhash == x[0], sequence_skipper(D, start)) try: # Linear search of only indices with matching hashes for exact key return next(v for hsh, k, v in hashmatches if key == k) except StopIteration: raise TypeError('Key not found in the dictionary.')实际成为Dhashes对，您可以以更多内存为代价来节省更多工作;假设唯一性，这意味着单个(hashcode, key)调用，而不是两个，并且不需要在bisect.bisect*匹配的索引之间进行扫描;你要么在二进制搜索中找到它，要么你没有找到它。例如，我生成了1000个键值对，将它们存储为key中的(hashcode, key, value) tuple（我在list上排序）或{{{ 1}}映射hashcode s-＆gt; dict s。 key s都是65位value s（足够长，哈希码不是一个简单的自映射）。使用我在上面提供的线性搜索代码，找到位于索引321的值需要大约15微秒;使用二进制搜索（仅将哈希复制到单独的key），它只需要超过2微秒。在相应的int中查找~55 _nano_seconds;即使对于二进制搜索，运行时间开销也达到了~37x，线性搜索运行高出约270x。这是在我们进入增加的内存成本，增加代码复杂性以及增加维护排序顺序的开销之前（假设list被修改）。

最后：你说“我想避免使用[dict s]”，但不解释原因。 D是解决这类问题的正确方法;假设没有自我散列（即dict是一个散列到自身的dict，可能会节省散列码的代价），仅key int的内存开销s（不包括单独的list哈希码）将（大致）是值的简单tuple映射键的两倍。 list还可以防止意外存储重复项，插入费用为dict（即使使用dict，插入维护排序顺序也会有O(1)次查找和bisect内存移动成本），〜O(log n)查询成本（与O(n)对比〜O(1)），除了大O差异外，还可以使用C内置函数完成所有工作经过大量优化，真正节省的成本会更高。

Answer 2

您可以修改bisect以检查第一个元素：

def bisect_left(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi) // 2
        if a[mid][0] < x:
            lo = mid+1
        else: hi = mid
    return lo

def get_bis(key, d):
    h = hash(key)
    ind = bisect_left(d, h)
    if ind == -1:
        raise KeyError()
    for i in xrange(ind, len(d)):
        if d[i][0] != h:
            raise KeyError()
        if d[i][1] == key:
            return d[i][2]
    raise KeyError()

复制一些碰撞，它会做它应该做的事情：

In [41]: l = [(-5092793511388848640, 'test1', 1), (-5092793511388848639, 'test9', 0), (-5092793511388848639, 'test0', 3), (-5092793511388848637, 'test2', 2), (-5092793511388848636, 'test5', 5)]

In [42]: get("test0", l)
Out[42]: 3

In [43]: get("test1", l)
Out[43]: 1

In [44]: get(-5092793511388848639, l)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-81e928da1ac8> in <module>()
----> 1 get(-5092793511388848639, l)

<ipython-input-30-499e71432196> in get(key, d)
      6     for sub in islice(d, ind, None):
      7         if sub[0] != h:
----> 8             raise KeyError()
      9         if sub[1] == key:
     10             return sub

KeyError:

一些时间：

In [91]: l = sorted((hash(s), s,randint(1,100000)) for s in ("".join(sample(ascii_letters,randint(10,26))) for _ in xrange(1000000)))

In [92]: l[-1]
Out[92]: (9223342880888029755, 'FocWPinpYZXjHhBqRkJxQeGMa', 43768)

In [93]: timeit get_bis(l[-1][1],l)hed 
100000 loops, best of 3: 5.29 µs per loop

In [94]: l[250000]
Out[94]: (-4616437486317828880, 'qXsybdhFPLczWwCQkm', 86136)

In [95]: timeit get_bis(l[250000][1],l)
100000 loops, best of 3: 4.4 µs per loop

In [96]: l[750000]
Out[96]: (4623630109115829672, 'dlQewhpMoBGmn', 39904)

In [97]: timeit get_bis(l[750000][1],l)
100000 loops, best of 3: 4.46 µs per loop

为了获得更好的想法，您必须抛出碰撞，但要找到散列可能属于的部分非常有效。

只需键入一些变量并使用cython进行编译：

def cython_bisect_left(a, long x, long lo=0):
   if lo < 0:
       raise ValueError('lo must be non-negative')
   cdef long hi = len(a)
   while lo < hi:
       mid = (lo + hi) // 2
       if a[mid][0] < x:
           lo = mid + 1
       else:
           hi = mid
   return lo
def cython_get(str key, d):
   cdef long h = hash(key)
   cdef ind = cython_bisect_left(d, h)
   if ind == -1:
       raise KeyError()
   for i in xrange(ind, len(d)):
       if d[i][0] != h:
           raise KeyError()
       if d[i][1] == key:
           return d[i][2]
   raise KeyError()

让我们几乎降到1微秒：

In [13]: timeit cython_get(l[-1][1],l)
The slowest run took 40.77 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.44 µs per loop

In [14]: timeit cython_get(l[250000][1],l)
1000000 loops, best of 3: 1.33 µs per loop

In [15]: timeit cython_get(l[750000][1],l)
1000000 loops, best of 3: 1.33 µs per loop

Answer 3

尝试使用列表推导。我不确定它是否是最有效的方式，但它是 pythonic 方式并且非常有效！

public static string GetMood()
{
    return mood;
}

根据哈希值搜索列表列表的有效方法是什么？

3 个答案: