1。关键交叉点

Question

我正在研究倒排索引的搜索程序。索引本身是一个字典，其键是术语，其值本身是短文档的字典，ID号作为键，其文本内容作为值。

要执行“AND”搜索两个术语，我需要与他们的帖子列表（词典）相交。在Python中做这个的明确（不一定是过于聪明）的方法是什么？我开始尝试使用iter：

p1 = index[term1]  
p2 = index[term2]
i1 = iter(p1)
i2 = iter(p2)
while ...  # not sure of the 'iter != end 'syntax in this case
...

Answer 1

一个鲜为人知的事实是，您不需要构建set来执行此操作：

在Python 2中：

In [78]: d1 = {'a': 1, 'b': 2}

In [79]: d2 = {'b': 2, 'c': 3}

In [80]: d1.viewkeys() & d2.viewkeys()
Out[80]: {'b'}

在Python 3中将viewkeys替换为keys;这同样适用于viewvalues和viewitems。

来自viewitems的文档：

In [113]: d1.viewitems??
Type:       builtin_function_or_method
String Form:<built-in method viewitems of dict object at 0x64a61b0>
Docstring:  D.viewitems() -> a set-like object providing a view on D's items

对于较大的dict s，这也比构建set然后相交它们稍快一些：

In [122]: d1 = {i: rand() for i in range(10000)}

In [123]: d2 = {i: rand() for i in range(10000)}

In [124]: timeit d1.viewkeys() & d2.viewkeys()
1000 loops, best of 3: 714 µs per loop

In [125]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000 loops, best of 3: 805 µs per loop

For smaller `dict`s `set` construction is faster:

In [126]: d1 = {'a': 1, 'b': 2}

In [127]: d2 = {'b': 2, 'c': 3}

In [128]: timeit d1.viewkeys() & d2.viewkeys()
1000000 loops, best of 3: 591 ns per loop

In [129]: %%timeit
s1 = set(d1)
s2 = set(d2)
res = s1 & s2

1000000 loops, best of 3: 477 ns per loop

我们在这里比较纳秒，这可能与您有关，也可能没有关系。在任何情况下，您都会获得set，因此使用viewkeys / keys可以消除一些混乱。

Answer 2

In [1]: d1 = {'a':1, 'b':4, 'f':3}

In [2]: d2 = {'a':1, 'b':4, 'd':2}

In [3]: d = {x:d1[x] for x in d1 if x in d2}

In [4]: d
Out[4]: {'a': 1, 'b': 4}

Answer 3

您可以轻松计算集合的交集，因此可以从键创建集合并将它们用于交集：

keys_a = set(dict_a.keys())
keys_b = set(dict_b.keys())
intersection = keys_a & keys_b # '&' operator is used for set intersection

Answer 4

在Python 3中，您可以使用

intersection = dict(dict1.items() & dict2.items())
union = dict(dict1.items() | dict2.items())
difference = dict(dict1.items() ^ dict2.items())

Answer 5

使用一个简单的类来包装字典实例，该类可以获得您想要的两个值

class DictionaryIntersection(object):
    def __init__(self,dictA,dictB):
        self.dictA = dictA
        self.dictB = dictB

    def __getitem__(self,attr):
        if attr not in self.dictA or attr not in self.dictB:
            raise KeyError('Not in both dictionaries,key: %s' % attr)

        return self.dictA[attr],self.dictB[attr]

x = {'foo' : 5, 'bar' :6}
y = {'bar' : 'meow' , 'qux' : 8}

z = DictionaryIntersection(x,y)

print z['bar']

Answer 6

好的，这是Python3中上面代码的通用版本。它被优化为使用足够快的理解和类似集合的dict视图。

函数与任意多个dicts相交，并返回带有公共键的dict和每个公用键的一组公共值：

def dict_intersect(*dicts):
    comm_keys = dicts[0].keys()
    for d in dicts[1:]:
        # intersect keys first
        comm_keys &= d.keys()
    # then build a result dict with nested comprehension
    result = {key:{d[key] for d in dicts} for key in comm_keys}
    return result

用法示例：

a = {1: 'ba', 2: 'boon', 3: 'spam', 4:'eggs'}
b = {1: 'ham', 2:'baboon', 3: 'sausages'}
c = {1: 'more eggs', 3: 'cabbage'}

res = dict_intersect(a, b, c)
# Here is res (the order of values may vary) :
# {1: {'ham', 'more eggs', 'ba'}, 3: {'spam', 'sausages', 'cabbage'}}

这里的dict值必须是可以清除的，如果不是，你可以简单地将set括号{}更改为list []：

result = {key:[d[key] for d in dicts] for key in comm_keys}

Answer 7

您的问题不够精确，无法给出单个答案。

1。关键交叉点

如果要与帖子（credits to James）中的ID相交，请执行以下操作：

common_ids = p1.keys() & p2.keys()

但是，如果要迭代文档，则必须考虑哪个帖子具有优先级，我假设它是p1。要迭代common_ids的文档，collections.ChainMap将是最有用的：

from collections import ChainMap
intersection = {id: document
                for id, document in ChainMap(p1, p2)
                if id in common_ids}
for id, document in intersection:
    ...

或者如果您不想创建单独的intersection字典：

from collections import ChainMap
posts = ChainMap(p1, p2)
for id in common_ids:
    document = posts[id]

2。项目交叉点

如果要与两个帖子的项目相交，这意味着要匹配ID和文档，请使用下面的代码（credits to DCPY）。但是，这仅在您要查找术语重复项时有用。

duplicates = dict(p1.items() & p2.items())
for id, document in duplicates:
    ...

3。遍历`p1`'AND'`p2`。

如果通过“ 'AND'搜索”并使用iter来搜索两者帖子，那么collections.ChainMap最好遍历（几乎）多个帖子中的所有项目：

from collections import ChainMap
for id, document in ChainMap(p1, p2):
    ...

Answer 8

def two_keys(term_a, term_b, index):
    doc_ids = set(index[term_a].keys()) & set(index[term_b].keys())
    doc_store = index[term_a] # index[term_b] would work also
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

def n_keys(terms, index):
    doc_ids = set.intersection(*[set(index[term].keys()) for term in terms])
    doc_store = index[term[0]]
    return {doc_id: doc_store[doc_id] for doc_id in doc_ids}

In [0]: index = {'a': {1: 'a b'}, 
                 'b': {1: 'a b'}}

In [1]: two_keys('a','b', index)
Out[1]: {1: 'a b'}

In [2]: n_keys(['a','b'], index)
Out[2]: {1: 'a b'}

我建议将您的索引从

更改

index = {term: {doc_id: doc}}

有两个索引，一个是术语，然后是一个单独的索引，用于保存值

term_index = {term: set([doc_id])}
doc_store = {doc_id: doc}

那样您就不会存储相同数据的多个副本

Answer 9

通过键和值找到完整的交集

d1 = {'a':1}
d2 = {'b':2, 'a':1}
{x:d1[x] for x in d1 if x in d2 and d1[x] == d2[x]}

>> {'a':1}

用Python相交两个词典

9 个答案:

1。关键交叉点

2。项目交叉点

3。遍历`p1`'AND'`p2`。

通过键和值找到完整的交集

用Python相交两个词典

9 个答案:

1。关键交叉点

2。项目交叉点

3。遍历p1'AND'p2。

通过键和值找到完整的交集

3。遍历`p1`'AND'`p2`。