Question

我有一个嵌套字典，如下所示：

d = {'DET':{'this':0.5, 'the':0.4}, 'NOUN': {'cat':0.8, 'can':0.2}, 'VERB': {'can':0.6, 'fly':0.3}...}

给定一个令牌列表，我想检查这些令牌中的每一个是否在字典中并检索其值和父键。如果存在歧义，我可以为每个令牌设置多个父键（在我的示例中，“can”是NOUN但也是VERB）并且我只想获得我的令牌具有最高值的父键。

到目前为止，我有：

sent = ['the', 'cat', 'can', 'fly']
for k, v in d.items():
    for token in sent:
        if token in d[k]:
            print token, k, v[token]

它为每个令牌提供了所有可能的标签和相关值，

cat NOUN 0.8
can NOUN 0.2
can VERB 0.6
fly VERB 0.3
the DET 0.4

但在“可以”的情况下，我只想获得

can VERB 0.6

Answer 1

我会做这样的事情：

sent = ['the', 'cat', 'can', 'fly']
found = {}
for k, v in d.items():
    for token in sent:
        if token in v:
            if v[token] > found.get(token, {}).get('val', 0):
                found[token] = {'type': k, 'val': v[token]}

现在看起来像：

{'can': {'type': 'VERB', 'val': 0.6},
 'cat': {'type': 'NOUN', 'val': 0.8},
 'fly': {'type': 'VERB', 'val': 0.3},
 'the': {'type': 'DET', 'val': 0.4}}

Answer 2

所以不要马上打印出来;将它们存储在字典中，当同一个案例重新出现时，只保留具有最高值的版本。当您完成扫描d后，您可以浏览此词典并打印出其中的内容。

Answer 3

这将打印每个标记具有最高值的所有标记（如果两者都有，例如“可以VERB 0.6”和“可以NOUN 0.6”）：

from collections import defaultdict

by_token = defaultdict(lambda: defaultdict(set))

for tag, token_values in d.items():
    for token, value in token_values.items():
        by_token[token][value].add(tag)

for token, by_value in by_token.items():
    value, tags = max(by_value.items())
    for tag in tags:
        print('{} {} {}'.format(token, tag, value))

第一个循环使用defaultdict为每个标记收集相同值的所有标记，这是创建嵌套字典结构的简便方法。对于每个标记，它存储字典映射值到标记集。

第二个循环通过使用元组（字典项）主要在其第一个元素（字典键，与相关令牌的标记关联的值）上排序的事实，将重构数据减少到仅最大值。）。

检查嵌套字典中的两个内部键是否重复，并仅检索最高值

3 个答案: