Question

有这个：

text = word_tokenize("The quick brown fox jumps over the lazy dog")

跑步：

nltk.pos_tag(text)

我明白了：

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

这是不正确的。句子中quick brown lazy的标签应为：

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

通过他们的online tool进行测试会得到相同的结果; quick，brown和fox应该是形容词而不是名词。

Answer 1

简而言之：

NLTK并不完美。事实上，没有任何模型是完美的。

注意：

自NLTK 3.1版起，默认pos_tag功能不再是old MaxEnt English pickle。

现在是来自@Honnibal's implementation的 perceptron标记，请参阅nltk.tag.pos_tag

>>> import inspect >>> print inspect.getsource(pos_tag) def pos_tag(tokens, tagset=None): tagger = PerceptronTagger() return _pos_tag(tokens, tagset, tagger)

还是它更好但不完美：

>>> from nltk import pos_tag >>> pos_tag("The quick brown fox jumps over the lazy dog".split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

在某些时候，如果有人想要TL;DR解决方案，请参阅https://github.com/alvations/nltk_cli

长：

尝试使用其他标记器（请参阅https://github.com/nltk/nltk/tree/develop/nltk/tag），例如：

HunPos

Stanford POS

塞纳

使用NLTK的默认MaxEnt POS标记器，即nltk.pos_tag ：

>>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over the lazy dog" >>> pos_tag(word_tokenize(text)) [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

使用Stanford POS tagger ：

$ cd ~ $ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip $ unzip stanford-postagger-2015-04-20.zip $ mv stanford-postagger-2015-04-20 stanford-postagger $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.stanford import POSTagger >>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger' >>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar' >>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar) >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

使用HunPOS （注意：默认编码为ISO-8859-1而非UTF8）：

$ cd ~ $ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz $ tar zxvf hunpos-1.0-linux.tgz $ wget https://hunpos.googlecode.com/files/en_wsj.model.gz $ gzip -d en_wsj.model.gz $ mv en_wsj.model hunpos-1.0-linux/ $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.hunpos import HunposTagger >>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag' >>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model' >>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin) >>> text = "The quick brown fox jumps over the lazy dog" >>> ht.tag(text.split()) [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

使用Senna （确保您拥有最新版本的NLTK，对API进行了一些更改）：

$ cd ~ $ wget http://ronan.collobert.com/senna/senna-v3.0.tgz $ tar zxvf senna-v3.0.tgz $ python >>> from os.path import expanduser >>> home = expanduser("~") >>> from nltk.tag.senna import SennaTagger >>> st = SennaTagger(home+'/senna') >>> text = "The quick brown fox jumps over the lazy dog" >>> st.tag(text.split()) [('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

或尝试构建更好的POS标记：

Ngram Tagger：http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/

Affix / Regex Tagger：http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/

构建自己的Brill（阅读代码，它是一个非常有趣的标记器，http://www.nltk.org/_modules/nltk/tag/brill.html），请参阅http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/

Perceptron Tagger：https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/

LDA Tagger：http://scm.io/blog/hack/2015/02/lda-intentions/

抱怨堆栈溢出的pos_tag准确度包括：

POS tagging - NLTK thinks noun is adjective

python NLTK POS tagger not behaving as expected

How to obtain better results using NLTK pos tag

pos_tag in NLTK does not tag sentences correctly

关于NLTK HunPos的问题包括：

How do I tag textfiles with hunpos in nltk?

Does anyone know how to configure the hunpos wrapper class on nltk?

NLTK和Stanford POS标记器的问题包括：

trouble importing stanford pos tagger into nltk

Java Command Fails in NLTK Stanford POS Tagger

Error using Stanford POS Tagger in NLTK Python

How to improve speed with Stanford NLP Tagger and NLTK

Nltk stanford pos tagger error : Java command failed

Instantiating and using StanfordTagger within NLTK

Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows

Answer 2

诸如更换为Stanford或Senna或HunPOS标记器的解决方案肯定会产生结果，但是这是一种更简单的方法来尝试也包含在NLTK中的不同标记器。

目前，NTLK中的默认POS标记器是平均感知器标记器。这是一个选择使用Maxent Treebank Tagger的函数：

def treebankTag(text)
    words = nltk.word_tokenize(text)
    treebankTagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')
    return treebankTagger.tag(words)

我发现NLTK中的平均感知器预训练标签器偏向于将某些形容词视为名词，如您的示例。树堤标记器为我提供了更多正确的形容词。

Answer 3

def tagPOS(textcontent, taggedtextcontent, defined_tags):
    # Write your code here
    token = nltk.word_tokenize(textcontent)
    nltk_pos_tags = nltk.pos_tag(token)
    
    unigram_pos_tag = nltk.UnigramTagger(model=defined_tags).tag(token)
    
    tagged_pos_tag = [ nltk.tag.str2tuple(word) for word in taggedtextcontent.split() ]
    
    return (nltk_pos_tags,tagged_pos_tag,unigram_pos_tag)

Python NLTK pos_tag没有返回正确的词性标签

3 个答案: