所以我试图在列表中标记一堆单词(准确地说是POS标记),如下所示:
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
其中lw
是一个单词列表(它真的很长或者我会发布它但它就像[['hello'],['world']]
(也就是每个列表包含一个单词的列表列表)但是当我尝试时跑吧我得到:
Traceback (most recent call last):
File "<pyshell#183>", line 1, in <module>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "<pyshell#183>", line 1, in <listcomp>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
return _pos_tag(tokens, tagset, tagger)
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
elif word[0].isdigit():
IndexError: string index out of range
有人可以告诉我为什么以及如何得到此错误以及如何解决此问题?非常感谢。
答案 0 :(得分:1)
使用pos标签解析文档的常用功能,
def get_pos(string):
string = nltk.word_tokenize(string)
pos_string = nltk.pos_tag(string)
return pos_string
get_post(sentence)
希望这会有所帮助!
答案 1 :(得分:0)
如果您将输入作为原始字符串,则可以在word_tokenize
之前使用pos_tag
:
import nltk
is_noun = lambda pos: pos[:2] == 'NN'
lines = 'You can never plan the future by the past'
lines = lines.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]
print(nouns) # ['future', 'past']