ut8 Wordnet Lemmatizer - NLTK

时间:2016-08-02 09:51:59

标签: python nltk wordnet

只是在NLTK中遇到了Wordnet Lemmatizer的麻烦。当我说“只是”时,我的意思是这样^^我的Python脚本在10分钟前就崩溃了(我不知道)。我希望我没有做错,好吧......我希望你能告诉我!

那是剧本:

sentence = """Hello, I am George."""

tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

wordnet_lem = nltk.stem.WordNetLemmatizer()
for (word, pos) in tagged :
    wordnet_pos = get_wordnet_pos(pos)

    if wordnet_pos != False:
        couple = (wordnet_lem.lemmatize(word, pos=wordnet_pos))
    else :
        couple = (wordnet_lem.lemmatize(word), pos)

我现在收到了这个错误:

Traceback (most recent call last):
  File "C:\Users\user\workspace\test.py", line 21, in <module>
    wordnet_pos = get_wordnet_pos(pos)
  File "C:\Users\user\workspace\test.py", line 9, in get_wordnet_pos
    return nltk.corpus.wordnet.NOUN
  File "C:\Python344\lib\site-packages\nltk\corpus\util.py", line 99, in __getattr__
    self.__load()
  File "C:\Python344\lib\site-packages\nltk\corpus\util.py", line 67, in __load
    corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
  File "C:\Python344\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1055, in __init__
    self._load_lemma_pos_offset_map()
  File "C:\Python344\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1111, in _load_lemma_pos_offset_map
    for i, line in enumerate(self.open('index.%s' % suffix)):
  File "C:\Python344\lib\site-packages\nltk\data.py", line 1188, in __next__
    return self.next()
  File "C:\Python344\lib\site-packages\nltk\data.py", line 1181, in next
    line = self.readline()
  File "C:\Python344\lib\site-packages\nltk\data.py", line 1135, in readline
    new_chars = self._read(readsize)
  File "C:\Python344\lib\site-packages\nltk\data.py", line 1367, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Python344\lib\site-packages\nltk\data.py", line 1398, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Python344\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 18: invalid start byte

我的第一个想法是我的wordnet语料库以某种方式被破坏了。你觉得怎么样?

非常感谢你的帮助!

编辑:

我正在添加get_wordnet_pos的定义:

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return nltk.corpus.wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return nltk.corpus.wordnet.VERB
    elif treebank_tag.startswith('N'):
        return nltk.corpus.wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return nltk.corpus.wordnet.ADV
    return False

0 个答案:

没有答案