如何在NLTK中使用pos_tag?

时间:2017-11-27 21:21:02

标签: python nlp nltk pos-tagger

所以我试图在列表中标记一堆单词(准确地说是POS标记),如下所示:

pos = [nltk.pos_tag(i,tagset='universal') for i in lw]

其中lw是一个单词列表(它真的很长或者我会发布它但它就像[['hello'],['world']](也就是每个列表包含一个单词的列表列表)但是当我尝试时跑吧我得到:

Traceback (most recent call last):
  File "<pyshell#183>", line 1, in <module>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "<pyshell#183>", line 1, in <listcomp>
    pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
    return _pos_tag(tokens, tagset, tagger)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
    tagged_tokens = tagger.tag(tokens)
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
    context = self.START + [self.normalize(w) for w in tokens] + self.END
  File "C:\Users\my system\AppData\Local\Programs\Python\Python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
    elif word[0].isdigit():
IndexError: string index out of range

有人可以告诉我为什么以及如何得到此错误以及如何解决此问题?非常感谢。

2 个答案:

答案 0 :(得分:1)

使用pos标签解析文档的常用功能,

def get_pos(string):
    string = nltk.word_tokenize(string)
    pos_string = nltk.pos_tag(string)
    return pos_string

get_post(sentence)

希望这会有所帮助!

答案 1 :(得分:0)

如果您将输入作为原始字符串,则可以在word_tokenize之前使用pos_tag

import nltk

is_noun = lambda pos: pos[:2] == 'NN'

lines = 'You can never plan the future by the past'

lines = lines.lower()
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)]

print(nouns) # ['future', 'past']