在下面的代码中,为什么nltk认为'fish'是形容词而不是名词?
>>> import nltk
>>> s = "a woman needs a man like a fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
答案 0 :(得分:4)
我不确定解决方法是什么,但您可以在此处查看来源https://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/
与此同时,我用一点点不同的方法尝试了你的句子。
>>> s = "a woman needs a man. A fish needs a bicycle"
>>> nltk.pos_tag(s.split())
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man.', NP'), ('A','NNP'), ('fish', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('bicycle', 'NN')]
导致鱼为“NN”。
答案 1 :(得分:4)
如果您首先使用NLTK book, chapter 5中所述的Lookup Tagger(例如使用WordNet作为查找参考),那么您的标记器已经“知道” fish 不能成为形容词。对于具有多个可能的POS标签的所有单词,您可以使用统计标记器作为退避标记器。
答案 2 :(得分:3)
这是因为你希望a woman needs a man like a fish needs a bicycle
为这样的“解析”获取POS标签:
[ [[a woman] needs [a man]] like [[a fish] needs [a bicycle]] ]
但是NLTK默认的pos标记器不够智能,并且为这样的解析提供了POS标记:
[ [[a woman] needs [a man]] like [a fish needs] [a bicycle] ]
答案 3 :(得分:3)
这取决于POS标记器如何输入。例如句子: “女人需要一个像鱼一样需要自行车的男人”
如果使用默认的nltk字标记器和正则表达式标记器,则值将不同。
import nltk
from nltk.tokenize import RegexpTokenizer
TOKENIZER = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
s = "a woman needs a man like a fish needs a bicycle"
regex_tokenize = TOKENIZER.tokenize(s)
default_tokenize = nltk.word_tokenize(s)
regex_tag = nltk.pos_tag(regex_tokenize)
default_tag = nltk.pos_tag(default_tokenize)
print regex_tag
print "\n"
print default_tag
输出如下:
Regex Tokenizer:
[('a', 'DT'), (' ', 'NN'), ('woman', 'NN'), (' ', ':'), ('needs', 'NNS'), (' ', 'VBP'), ('a', 'DT'), (' ', 'NN'), ('man', 'NN'), (' ', ':'), ('like', 'IN'), (' ', 'NN'), ('a', 'DT'), (' ', 'NN'), ('fish', 'NN'), (' ', ':'), ('needs', 'VBZ'), (' ', ':'), ('a', 'DT'), (' ', 'NN'), ('bicycle', 'NN')]
Default Tokenizer:
[('a', 'DT'), ('woman', 'NN'), ('needs', 'VBZ'), ('a', 'DT'), ('man', 'NN'), ('like', 'IN'), ('a', 'DT'), ('fish', 'JJ'), ('needs', 'NNS'), ('a', 'DT'), ('bicycle', 'NN')]
在Regex Tokenizer中,fish是一个名词,而在默认的tokenizer中,fish是一个形容词。 根据使用的标记生成器,解析不同导致不同的解析树结构。
答案 4 :(得分:2)
如果您使用Stanford POS tagger(3.5.1),则该短语会被正确标记:
from nltk.tag.stanford import POSTagger
st = POSTagger("/.../stanford-postagger-full-2015-01-30/models/english-left3words-distsim.tagger",
"/.../stanford-postagger-full-2015-01-30/stanford-postagger.jar")
st.tag("a woman needs a man like a fish needs a bicycle".split())
的产率:
[('a', 'DT'),
('woman', 'NN'),
('needs', 'VBZ'),
('a', 'DT'),
('man', 'NN'),
('like', 'IN'),
('a', 'DT'),
('fish', 'NN'),
('needs', 'VBZ'),
('a', 'DT'),
('bicycle', 'NN')]