Question

所以我正在学习一门自然语言处理课程，我需要创建一个三元语言模型，根据一些样本数据生成在某种程度上看起来“真实”的随机文本。

根本需要创建一个“trigram”来保存各种3个字母的语法单词组合。我的教授暗示，这可以通过我试图使用字典词典来完成：

trigram = defaultdict( defaultdict(defaultdict(int)))

但是我收到一条错误消息：

trigram = defaultdict( dict(dict(int)))
TypeError: 'type' object is not iterable

如何创建3层嵌套字典或int值字典词典？

如果他们不知道如何回答，我猜人们就堆栈溢出问题投票。我将添加一些背景知识，以便为那些愿意提供帮助的人更好地解释这个问题。

此三元组用于跟踪三字模式。它们被用在文本语言处理软件中，几乎无处不在自然语言处理“思考siri或google现在”。

如果我们将3个级别的词典指定为 dict1 dict2和dict3 ，那么解析文本文件并阅读声明“男孩跑”将具有以下内容：< / p>

具有“the”键的dict1。访问该密钥将返回包含密钥“boy”的dict2。访问该密钥将返回最终的dict3，其中包含现在访问该密钥的密钥“runs”将返回值1.

这象征着在本文中“男孩跑”出现了1次。如果我们再次遇到它，那么我们将遵循相同的过程并将1增加到2。如果我们遇到“女孩走路”，那么dict2“the”键字典现在将包含另一个“女孩”的键，它将具有一个具有“行走”键值和值1的dict3等等。最终在解析了大量文本（并跟踪单词计数）之后，你将有一个三元组，它可以根据它们在先前解析的文本中出现的次数来确定导致3个单词组合的某个起始单词的可能性。。

这可以帮助您创建语法规则来识别语言，或者在我的例子中创建随机生成的文本，看起来非常像语法英语。我需要一个三层字典，因为在3个单词组合的任何位置，可以有另一个单词可以创建一组完整的不同组合。我尽我最大的努力，尽我所能地解释三元组及其背后的目的......我刚刚在几周前就说过了这个课程。

现在......所有人都说了。我如何创建一个字典词典字典，其基本字典在python中包含int类型的值？

trigram = defaultdict（defaultdict（defaultdict（int）））

为我抛出错误

Answer 1

我之前尝试过嵌套defaultdict，解决方案似乎是lambda来电：

trigram = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))

trigram['a']['b']['c'] += 1

它不漂亮，但我怀疑嵌套字典建议是为了有效查找。

Answer 2

通常，要创建三元组的嵌套字典，已发布的解决方案可能会起作用。如果您想扩展一个更通用的解决方案，可以执行以下操作之一，其中一个采用Perl's AutoVivification，另一个采用collection.defaultdict。

解决方案1：

class ngram(dict):
    """Based on perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return super(ngram, self).__getitem__(item)
        except KeyError:
            value = self[item] = type(self)()
            return value

解决方案2：

from collections import defaultdict
class ngram(defaultdict):
    def __init__(self):
        super(ngram, self).__init__(ngram)

使用解决方案1进行演示

>>> trigram = ngram()
>>> trigram['two']['three']['four'] = 4
>>> trigram
{'two': {'three': {'four': 4}}}
>>> a['two']
{'three': {'four': 4}}
>>> a['two']['three']
{'four': 4}
>>> a['two']['three']['four']
4

使用解决方案2进行演示

>>> a = ngram()
>>> a['two']['three']['four'] = 4
>>> a
defaultdict(<class '__main__.ngram'>, {'two': defaultdict(<class '__main__.ngram'>, {'three': defaultdict(<class '__main__.ngram'>, {'four': 4})})})

Answer 3

defaultdict __init__方法接受一个必须是可调用的参数。传递给defaultdict的可调用对象必须可以不带参数调用，并且必须返回默认值的实例。

嵌套defaultdict的问题是defaultdict的{{1}}接受了争论。赋予__init__该参数意味着，而不是包含defaultdict具有可调用的defaultdict参数，它具有__init__的实例，该实例不可调用。

@pcoving的defaultdict解决方案将起作用，因为它创建了一个匿名函数，该函数返回一个lambda初始化的函数，该函数为字典中的每一层返回正确的类型defaultdict筑巢。

Answer 4

如果它只是提取和检索三元组，您应该使用NLTK来尝试：

>>> import nltk
>>> sent = "this is a foo bar crazycoder"
>>> trigrams = nltk.ngrams(sent.split(), 3)
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar'), ('foo', 'bar', 'crazycoder')]
# token "a" in first element of trigram
>>> first_a = [i for i in trigrams if i[0] == "a"]
[('a', 'foo', 'bar')]
# token "a" in 2nd element of trigram
>>> second_a = [i for i in trigrams if i[1] == "a"]
[('is', 'a', 'foo')]
# token "a" in third element of trigram
>>> third = [i for i in trigrams if i[2] == "a"]
[('this', 'is', 'a')]
# look for 2gram in trigrams
>> two_foobar = [i for i in trigrams if "foo" in i and "bar" in i]
[('a', 'foo', 'bar'), ('foo', 'bar', 'crazycoder')]
# look for a perfect 3gram
>> perfect = [i fof i in trigrams if "foo bar crazycoder".split() == i]
[('foo', 'bar', 'crazycoder')]

如何在Python中创建字典词典

4 个答案: