Question

所以我试图在列表中标记一堆单词（准确地说是POS标记），如下所示：

pos = [nltk.pos_tag(i,tagset='universal') for i in lw]

其中lw是一个单词列表（它真的很长或者我会发布它但它就像[['hello'],['world']]（也就是每个列表包含一个单词的列表列表）但是当我尝试时跑吧我得到：

Traceback (most recent call last):
  File "<pyshell#183>", line 1, in <module>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "<pyshell#183>", line 1, in <listcomp>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
    return _pos_tag(tokens, tagset, tagger)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
    tagged_tokens = tagger.tag(tokens)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
    elif word[0].isdigit():
IndexError: string index out of range

有人可以告诉我为什么以及如何得到此错误以及如何解决此问题？非常感谢。

Answer 1

使用pos标签解析文档的常用功能，

def get_pos(string):
    string = nltk.word_tokenize(string)
    pos_string = nltk.pos_tag(string)
    return pos_string

get_post(sentence)

希望这会有所帮助！

Answer 2

如果您将输入作为原始字符串，则可以在word_tokenize之前使用pos_tag：

import nltk

is_noun = lambda pos: pos[:2] == 'NN'

lines = 'You can never plan the future by the past'

lines = lines.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

print(nouns) # ['future', 'past']

如何在NLTK中使用pos_tag？

2 个答案: