Question

我是python编码的新手。我想使用UnigramTagger和退避（在我的情况下是RegexpTagger），我一直在努力弄清楚下面的错误是什么。感谢任何帮助。

>>> train_sents = (['@Sakshi', 'Hi', 'I', 'am', 'meeting', 'my', 'friend', 'today'])    
>>> from tag_util import patterns  
>>> from nltk.tag import RegexpTagger  
>>> re_tagger = RegexpTagger(patterns)  
>>> from nltk.tag import UnigramTagger  
>>> from tag_util import backoff_tagger  
>>> tagger = backoff_tagger(train_sents, UnigramTagger, backoff=re_tagger)

Traceback (most recent call last):  
 File "<pyshell#6>", line 1, in <module>  
    tagger = backoff_tagger(train_sents, UnigramTagger, backoff=re_tagger)  
  File "tag_util.py", line 12, in backoff_tagger  
     for cls in tagger_classes:  
TypeError: 'YAMLObjectMetaclass' object is not iterable

这是我在tag_util中用于模式和backoff_tagger

的代码

import re  
patterns = [  
    (r'^@\w+', 'NNP'),  
    (r'^\d+$', 'CD'),  
    (r'.*ing$', 'VBG'), # gerunds, i.e. wondering  
    (r'.*ment$', 'NN'),  
    (r'.*ful$', 'JJ'), # i.e. wonderful  
    (r'.*', 'NN')  
]  

def backoff_tagger(train_sents, tagger_classes, backoff=None):
    for cls in tagger_classes:
        backoff = cls(train_sents, backoff=backoff)
    return backoff

Answer 1

您只需要更改一些内容即可。

您获得的错误是因为您无法迭代类UnigramTagger。我不确定你是否还有别的想法但只是删除for循环。此外，您需要传递UnigramTagger list个标记的句子，代表list s（word，tag）tuple s - 而不仅仅是一个单词列表。否则，它不知道如何训练。部分原因可能如下：

[[('@Sakshi', 'NN'), ('Hi', 'NN'),...],...[('Another', 'NN'), ('sentence', 'NN')]]

请注意，每个句子本身都是list。此外，您可以使用NTLK的标记语料库（我推荐）。

编辑：

在阅读你的帖子之后，我觉得你们对于某些功能的输入/输出感到困惑，并且对NLP意义上的训练缺乏了解。我认为你会从reading the NLTK book, starting at the beginning中受益匪浅。

我很高兴向您展示如何解决此问题，但我认为如果不进行更多研究，您将无法完全了解基础机制。

tag_util.py（根据您的代码）

from nltk.tag import RegexpTagger, UnigramTagger
from nltk.corpus import brown

patterns = [
    (r'^@\w+', 'NNP'),
    (r'^\d+$', 'CD'),
    (r'.*ing$', 'VBG'),
    (r'.*ment$', 'NN'),
    (r'.*ful$', 'JJ'),
    (r'.*', 'NN')
]
re_tagger = RegexpTagger(patterns)
tagger = UnigramTagger(brown.tagged_sents(), backoff=re_tagger) # train tagger

在Python解释器中

>>> import tag_util
>>> tag_util.brown.tagged_sents()[:2]
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')]]

注意这里的输出。我从标记句子的布朗语料库中得到前两句话。这是您需要传递给标记器作为输入（如UnigramTagger）来训练它的数据。现在让我们使用我们在tag_util.py中训练的标记器。

返回Python解释器

>>> tag_util.tagger.tag(['I', 'just', 'drank', 'some', 'coffee', '.'])
[('I', 'PPSS'), ('just', 'RB'), ('drank', 'VBD'), ('some', 'DTI'), ('coffee', 'NN'), ('.', '.')]

你有它，POS用你的方法标记一个句子的单词。

Answer 2

如果您正在使用我正在考虑的backoff_tagger。 UnigramTagger应该是列表中的项目，如下所示：

tagger = backoff_tagger(train_sents, [UnigramTagger], backoff=re_tagger)

我希望它有所帮助。

在nltk中退避标记

2 个答案: