Python hashable dicts

时间:2009-07-20 04:04:30

标签: python

作为一项练习,主要是为了我自己的乐趣,我正在实现一个回溯包装解析器。对此的灵感是我想更好地了解hygenic宏如何在类似algol的语言中工作(与你通常在其中找到的语法免费lisp方言相对应)。因此,通过输入的不同传递可能会看到不同的语法,因此缓存的解析结果无效,除非我还存储语法的当前版本以及缓存的解析结果。 ( EDIT :使用键值集合的结果是它们应该是不可变的,但我不打算公开接口以允许它们被更改,因此要么是可变的,要么是不可变的集合很好)

问题是python dicts不能作为其他dicts的键。即使使用元组(正如我将要做的那样)也无济于事。

>>> cache = {}
>>> rule = {"foo":"bar"}
>>> cache[(rule, "baz")] = "quux"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> 

我想它必须一直是元组。现在python标准库大致提供了我需要的东西,collections.namedtuple具有非常不同的语法,但可以用作键。从上面的会议继续:

>>> from collections import namedtuple
>>> Rule = namedtuple("Rule",rule.keys())
>>> cache[(Rule(**rule), "baz")] = "quux"
>>> cache
{(Rule(foo='bar'), 'baz'): 'quux'}

确定。但是我必须为我想要使用的规则中的每个可能的键组合创建一个类,这不是那么糟糕,因为每个解析规则确切地知道它使用了什么参数,因此可以同时定义该类作为解析规则的函数。

编辑:namedtuple的另一个问题是它们是严格定位的。两个看起来应该不同的元组实际上可以是相同的:

>>> you = namedtuple("foo",["bar","baz"])
>>> me = namedtuple("foo",["bar","quux"])
>>> you(bar=1,baz=2) == me(bar=1,quux=2)
True
>>> bob = namedtuple("foo",["baz","bar"])
>>> you(bar=1,baz=2) == bob(bar=1,baz=2)
False

tl'dr:我如何获得可用作其他dict的密钥的dict

对答案进行了一些攻击,这是我正在使用的更完整的解决方案。请注意,这需要做一些额外的工作,以使得生成的dicts在实际应用中模糊不变。当然,通过致电dict.__setitem__(instance, key, value)来解决它仍然很容易,但我们都是成年人。

class hashdict(dict):
    """
    hashable dict implementation, suitable for use as a key into
    other dicts.

        >>> h1 = hashdict({"apples": 1, "bananas":2})
        >>> h2 = hashdict({"bananas": 3, "mangoes": 5})
        >>> h1+h2
        hashdict(apples=1, bananas=3, mangoes=5)
        >>> d1 = {}
        >>> d1[h1] = "salad"
        >>> d1[h1]
        'salad'
        >>> d1[h2]
        Traceback (most recent call last):
        ...
        KeyError: hashdict(bananas=3, mangoes=5)

    based on answers from
       http://stackoverflow.com/questions/1151658/python-hashable-dicts

    """
    def __key(self):
        return tuple(sorted(self.items()))
    def __repr__(self):
        return "{0}({1})".format(self.__class__.__name__,
            ", ".join("{0}={1}".format(
                    str(i[0]),repr(i[1])) for i in self.__key()))

    def __hash__(self):
        return hash(self.__key())
    def __setitem__(self, key, value):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def __delitem__(self, key):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def clear(self):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def pop(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def popitem(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def setdefault(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    def update(self, *args, **kwargs):
        raise TypeError("{0} does not support item assignment"
                         .format(self.__class__.__name__))
    # update is not ok because it mutates the object
    # __add__ is ok because it creates a new object
    # while the new object is under construction, it's ok to mutate it
    def __add__(self, right):
        result = hashdict(self)
        dict.update(result, right)
        return result

if __name__ == "__main__":
    import doctest
    doctest.testmod()

9 个答案:

答案 0 :(得分:58)

这是制作可编辑字典的简便方法。请记住,出于显而易见的原因,在嵌入另一个字典后不要改变它们。

class hashabledict(dict):
    def __hash__(self):
        return hash(tuple(sorted(self.items())))

答案 1 :(得分:57)

Hashables应该是不可变的 - 不强制执行此操作但是在第一次使用dict作为键后不要改变dict,以下方法可行:

class hashabledict(dict):
  def __key(self):
    return tuple((k,self[k]) for k in sorted(self))
  def __hash__(self):
    return hash(self.__key())
  def __eq__(self, other):
    return self.__key() == other.__key()

如果你需要改变你的dicts并且STILL想要将它们用作键,复杂性会爆发百倍 - 而不是说它无法完成,但我会等到一个非常具体的指示才能进入THAT令人难以置信的泥潭! - )

答案 2 :(得分:30)

使字典可用于您的目的所需的只是添加__hash__方法:

class Hashabledict(dict):
    def __hash__(self):
        return hash(frozenset(self))

请注意, frozenset 转换适用于所有词典(即,它不需要键可排序)。同样,对字典值没有限制。

如果有许多字典具有相同的键但具有不同的值,则必须将哈希值考虑在内。最快的方法是:

class Hashabledict(dict):
    def __hash__(self):
        return hash((frozenset(self), frozenset(self.itervalues())))

由于两个原因,这比frozenset(self.iteritems())更快。首先,frozenset(self)步骤重用字典中存储的哈希值,从而节省对hash(key)的不必要调用。其次,使用 itervalues 将直接访问这些值,并避免每次执行查找时, items 使用许多内存分配器调用在内存中形成新的多个键/值元组。

答案 3 :(得分:23)

给定的答案是可以的,但可以使用frozenset(...)而不是tuple(sorted(...))来改进它们来生成哈希:

>>> import timeit
>>> timeit.timeit('hash(tuple(sorted(d.iteritems())))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
4.7758948802947998
>>> timeit.timeit('hash(frozenset(d.iteritems()))', "d = dict(a=3, b='4', c=2345, asdfsdkjfew=0.23424, x='sadfsadfadfsaf')")
1.8153600692749023

性能优势取决于字典的内容,但在大多数情况下我已经测试过,使用frozenset进行散列的速度至少快2倍(主要是因为它不需要排序)。

答案 4 :(得分:11)

一个相当干净,直截了当的实施

import collections

class FrozenDict(collections.Mapping):
    """Don't forget the docstrings!!"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        return hash(tuple(sorted(self._d.iteritems())))

答案 5 :(得分:7)

我一直回到这个话题......这是另一种变化。我对子类dict添加__hash__方法感到不安;几乎没有逃避dict是可变的问题,并且相信它们不会改变似乎是一个弱想法。所以我反而考虑构建一个基于内置类型的映射,它本身是不可变的。虽然tuple是一个显而易见的选择,但访问其中的值意味着排序和二等分;这不是一个问题,但它似乎没有充分利用它所构建的类型的大部分功能。

如果将密钥,值对保存到frozenset怎么办?那将需要什么,它将如何运作?

第1部分,你需要一种对'item'进行编码的方法,即冻结集主要通过它们的键处理它们;我会为此做一个小类。

import collections
class pair(collections.namedtuple('pair_base', 'key value')):
    def __hash__(self):
        return hash((self.key, None))
    def __eq__(self, other):
        if type(self) != type(other):
            return NotImplemented
        return self.key == other.key
    def __repr__(self):
        return repr((self.key, self.value))

仅此一项就会让你在不可变映射的随地吐痰:

>>> frozenset(pair(k, v) for k, v in enumerate('abcd'))
frozenset([(0, 'a'), (2, 'c'), (1, 'b'), (3, 'd')])
>>> pairs = frozenset(pair(k, v) for k, v in enumerate('abcd'))
>>> pair(2, None) in pairs
True
>>> pair(5, None) in pairs
False
>>> goal = frozenset((pair(2, None),))
>>> pairs & goal
frozenset([(2, None)])

D'oh!不幸的是,当你使用set运算符并且元素相等但不是同一个对象时;哪一个在返回值中结束 undefined ,我们将不得不经历一些更多的旋转。

>>> pairs - (pairs - goal)
frozenset([(2, 'c')])
>>> iter(pairs - (pairs - goal)).next().value
'c'

但是,以这种方式查看值很麻烦,更糟糕的是,创建了大量的中间集;那不行!我们将创建一个'假的'键值对来绕过它:

class Thief(object):
    def __init__(self, key):
        self.key = key
    def __hash__(self):
        return hash(pair(self.key, None))
    def __eq__(self, other):
        self.value = other.value
        return pair(self.key, None) == other

这导致问题较少:

>>> thief = Thief(2)
>>> thief in pairs
True
>>> thief.value
'c'

这就是所有深刻的魔力;剩下的就是把它全部包装成具有接口的东西,比如dict。由于我们是从frozenset继承的,它有一个非常不同的接口,所以有很多方法;我们从collections.Mapping获得了一些帮助,但大多数工作都覆盖了像{dic}这样的版本的frozenset方法,而不是:

class FrozenDict(frozenset, collections.Mapping):
    def __new__(cls, seq=()):
        return frozenset.__new__(cls, (pair(k, v) for k, v in seq))
    def __getitem__(self, key):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        raise KeyError(key)
    def __eq__(self, other):
        if not isinstance(other, FrozenDict):
            return dict(self.iteritems()) == other
        if len(self) != len(other):
            return False
        for key, value in self.iteritems():
            try:
                if value != other[key]:
                    return False
            except KeyError:
                return False
        return True
    def __hash__(self):
        return hash(frozenset(self.iteritems()))
    def get(self, key, default=None):
        thief = Thief(key)
        if frozenset.__contains__(self, thief):
            return thief.value
        return default
    def __iter__(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def iteritems(self):
        for item in frozenset.__iter__(self):
            yield (item.key, item.value)
    def iterkeys(self):
        for item in frozenset.__iter__(self):
            yield item.key
    def itervalues(self):
        for item in frozenset.__iter__(self):
            yield item.value
    def __contains__(self, key):
        return frozenset.__contains__(self, pair(key, None))
    has_key = __contains__
    def __repr__(self):
        return type(self).__name__ + (', '.join(repr(item) for item in self.iteritems())).join('()')
    @classmethod
    def fromkeys(cls, keys, value=None):
        return cls((key, value) for key in keys)

最终确实回答了我自己的问题:

>>> myDict = {}
>>> myDict[FrozenDict(enumerate('ab'))] = 5
>>> FrozenDict(enumerate('ab')) in myDict
True
>>> FrozenDict(enumerate('bc')) in myDict
False
>>> FrozenDict(enumerate('ab', 3)) in myDict
False
>>> myDict[FrozenDict(enumerate('ab'))]
5

答案 6 :(得分:4)

@Unknown接受的答案以及@AlexMartelli的答案完全正常,但仅限于以下限制条件:

  1. 字典的值必须是可清除的。例如,hash(hashabledict({'a':[1,2]}))将引发TypeError
  2. 键必须支持比较操作。例如,hash(hashabledict({'a':'a', 1:1}))会引发TypeError
  3. 键上的比较运算符强加了总排序。例如,如果字典中的两个键是frozenset((1,2,3))frozenset((4,5,6)),则它们在两个方向上的比较不相等。因此,使用这些键对字典项进行排序可能会导致任意顺序,因此会违反相等对象必须具有相同哈希值的规则。
  4. @ObenSonne更快的答案解除了约束2和3,但仍受约束1的约束(值必须是可清除的)。

    @RaymondHettinger更快的回答解除了所有3个约束,因为它在哈希计算中不包含.values()。但是,只有在以下情况下才能表现出色:

    1. 大多数需要进行哈希处理的(不相等的)词典都不相同.keys()
    2. 如果不满足此条件,则散列函数仍然有效,但可能会导致过多的冲突。例如,在所有字典都是从网站模板生成的极端情况下(字段名称作为键,用户输入作为值),键将始终相同,并且散列函数将为所有输入返回相同的值。因此,在检索项目时,依赖于此类哈希函数的哈希表将变得与列表一样慢(O(N)而不是O(1))。

      我认为即使上面列出的所有4个约束都被违反,以下解决方案也会运行得相当好。它还有一个额外的好处,它不仅可以散列字典,还可以散列任何容器,即使它们有嵌套的可变容器。

      我非常感谢您对此的任何反馈,因为到目前为止我只是对此进行了轻微的测试。

      # python 3.4
      import collections
      import operator
      import sys
      import itertools
      import reprlib
      
      # a wrapper to make an object hashable, while preserving equality
      class AutoHash:
          # for each known container type, we can optionally provide a tuple
          # specifying: type, transform, aggregator
          # even immutable types need to be included, since their items
          # may make them unhashable
      
          # transformation may be used to enforce the desired iteration
          # the result of a transformation must be an iterable
          # default: no change; for dictionaries, we use .items() to see values
      
          # usually transformation choice only affects efficiency, not correctness
      
          # aggregator is the function that combines all items into one object
          # default: frozenset; for ordered containers, we can use tuple
      
          # aggregator choice affects both efficiency and correctness
          # e.g., using a tuple aggregator for a set is incorrect,
          # since identical sets may end up with different hash values
          # frozenset is safe since at worst it just causes more collisions
          # unfortunately, no collections.ABC class is available that helps
          # distinguish ordered from unordered containers
          # so we need to just list them out manually as needed
      
          type_info = collections.namedtuple(
              'type_info',
              'type transformation aggregator')
      
          ident = lambda x: x
          # order matters; first match is used to handle a datatype
          known_types = (
              # dict also handles defaultdict
              type_info(dict, lambda d: d.items(), frozenset), 
              # no need to include set and frozenset, since they are fine with defaults
              type_info(collections.OrderedDict, ident, tuple),
              type_info(list, ident, tuple),
              type_info(tuple, ident, tuple),
              type_info(collections.deque, ident, tuple),
              type_info(collections.Iterable, ident, frozenset) # other iterables
          )
      
          # hash_func can be set to replace the built-in hash function
          # cache can be turned on; if it is, cycles will be detected,
          # otherwise cycles in a data structure will cause failure
          def __init__(self, data, hash_func=hash, cache=False, verbose=False):
              self._data=data
              self.hash_func=hash_func
              self.verbose=verbose
              self.cache=cache
              # cache objects' hashes for performance and to deal with cycles
              if self.cache:
                  self.seen={}
      
          def hash_ex(self, o):
              # note: isinstance(o, Hashable) won't check inner types
              try:
                  if self.verbose:
                      print(type(o),
                          reprlib.repr(o),
                          self.hash_func(o),
                          file=sys.stderr)
                  return self.hash_func(o)
              except TypeError:
                  pass
      
              # we let built-in hash decide if the hash value is worth caching
              # so we don't cache the built-in hash results
              if self.cache and id(o) in self.seen:
                  return self.seen[id(o)][0] # found in cache
      
              # check if o can be handled by decomposing it into components
              for typ, transformation, aggregator in AutoHash.known_types:
                  if isinstance(o, typ):
                      # another option is:
                      # result = reduce(operator.xor, map(_hash_ex, handler(o)))
                      # but collisions are more likely with xor than with frozenset
                      # e.g. hash_ex([1,2,3,4])==0 with xor
      
                      try:
                          # try to frozenset the actual components, it's faster
                          h = self.hash_func(aggregator(transformation(o)))
                      except TypeError:
                          # components not hashable with built-in;
                          # apply our extended hash function to them
                          h = self.hash_func(aggregator(map(self.hash_ex, transformation(o))))
                      if self.cache:
                          # storing the object too, otherwise memory location will be reused
                          self.seen[id(o)] = (h, o)
                      if self.verbose:
                          print(type(o), reprlib.repr(o), h, file=sys.stderr)
                      return h
      
              raise TypeError('Object {} of type {} not hashable'.format(repr(o), type(o)))
      
          def __hash__(self):
              return self.hash_ex(self._data)
      
          def __eq__(self, other):
              # short circuit to save time
              if self is other:
                  return True
      
              # 1) type(self) a proper subclass of type(other) => self.__eq__ will be called first
              # 2) any other situation => lhs.__eq__ will be called first
      
              # case 1. one side is a subclass of the other, and AutoHash.__eq__ is not overridden in either
              # => the subclass instance's __eq__ is called first, and we should compare self._data and other._data
              # case 2. neither side is a subclass of the other; self is lhs
              # => we can't compare to another type; we should let the other side decide what to do, return NotImplemented
              # case 3. neither side is a subclass of the other; self is rhs
              # => we can't compare to another type, and the other side already tried and failed;
              # we should return False, but NotImplemented will have the same effect
              # any other case: we won't reach the __eq__ code in this class, no need to worry about it
      
              if isinstance(self, type(other)): # identifies case 1
                  return self._data == other._data
              else: # identifies cases 2 and 3
                  return NotImplemented
      
      d1 = {'a':[1,2], 2:{3:4}}
      print(hash(AutoHash(d1, cache=True, verbose=True)))
      
      d = AutoHash(dict(a=1, b=2, c=3, d=[4,5,6,7], e='a string of chars'),cache=True, verbose=True)
      print(hash(d))
      

答案 7 :(得分:2)

您可能还想添加这两种方法以使v2 pickling协议与hashdict实例一起使用。否则cPickle将尝试使用hashdict .____ setitem____导致TypeError。有趣的是,使用协议的其他两个版本,您的代码可以正常工作。

def __setstate__(self, objstate):
    for k,v in objstate.items():
        dict.__setitem__(self,k,v)
def __reduce__(self):
    return (hashdict, (), dict(self),)

答案 8 :(得分:-2)

如果您没有在字典中输入数字,并且您永远不会丢失包含字典的变量,则可以执行以下操作:

cache[id(rule)] = "whatever"

因为id()对于每个字典都是唯一的

编辑:

哦对不起,是的,在那种情况下,其他人说的会更好。我认为您也可以将字典序列化为字符串,例如

cache[ 'foo:bar' ] = 'baz'

如果你需要从钥匙中恢复字典,那么你必须做一些更加丑陋的事情

cache[ 'foo:bar' ] = ( {'foo':'bar'}, 'baz' )

我想这样做的好处是你不必编写尽可能多的代码。