Question

考虑到不同的单词可能有同音异义词，是否有办法使用NLTK获取单个字母串的一组可能的词性？

例如：report - ＆gt; {Noun，Verb}，kind - ＆gt; {形容词，名词}

我无法找到一个POS标记器来标记完整句子上下文之外的单词的词性。这似乎是NLTK的一个非常基本的要求，所以我很困惑为什么我找到它时遇到这么多麻烦。

Answer 1

因为POS模型是基于句子/文档的数据训练的，所以预训练模型的预期输入是句子/文档。当只有一个单词时，它将其视为单个单词句子，因此在该单个单词句子上下文中应该只有一个标签。

如果您试图为每个英语单词找到所有可能的POS标签，则需要使用多种不同词语的语料库，然后标记语料库并计算/提取否。每个单词的标签。 E.g。

>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]

>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]

>>> for word, pos in chain(*tagged_sents):
...     counts[word][pos] += 1
... 

>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})

>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})

或者，有WordNet：

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})

但请注意，WordNet是一个手工制作的资源，因此您不能指望每个英文单词都在其中。

Answer 2

是。最简单的方法是不使用标记器，而是简单地加载一个或多个语料库并收集您感兴趣的单词的所有标记集。如果您对多个单词感兴趣，则表示＆＃39;最简单的方法是收集语料库中所有单词的标签，然后查找你想要的任何内容。我会添加频率计数，因为我可以。例如，使用布朗语料库和简单的＆＃34;通用＆＃34;标签集：

>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t) 
        for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']

NLTK单字词性标注

2 个答案: