我只是在NLTK中遇到了Wordnet Lemmatizer的麻烦。当我说“只是”时,我的意思是这样^^我的Python脚本在10分钟前就崩溃了(我不知道)。我希望我没有做错,好吧......我希望你能告诉我!
那是剧本:
sentence = """Hello, I am George."""
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
wordnet_lem = nltk.stem.WordNetLemmatizer()
for (word, pos) in tagged :
wordnet_pos = get_wordnet_pos(pos)
if wordnet_pos != False:
couple = (wordnet_lem.lemmatize(word, pos=wordnet_pos))
else :
couple = (wordnet_lem.lemmatize(word), pos)
我现在收到了这个错误:
Traceback (most recent call last):
File "C:\Users\user\workspace\test.py", line 21, in <module>
wordnet_pos = get_wordnet_pos(pos)
File "C:\Users\user\workspace\test.py", line 9, in get_wordnet_pos
return nltk.corpus.wordnet.NOUN
File "C:\Python344\lib\site-packages\nltk\corpus\util.py", line 99, in __getattr__
self.__load()
File "C:\Python344\lib\site-packages\nltk\corpus\util.py", line 67, in __load
corpus = self.__reader_cls(root, *self.__args, **self.__kwargs)
File "C:\Python344\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1055, in __init__
self._load_lemma_pos_offset_map()
File "C:\Python344\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1111, in _load_lemma_pos_offset_map
for i, line in enumerate(self.open('index.%s' % suffix)):
File "C:\Python344\lib\site-packages\nltk\data.py", line 1188, in __next__
return self.next()
File "C:\Python344\lib\site-packages\nltk\data.py", line 1181, in next
line = self.readline()
File "C:\Python344\lib\site-packages\nltk\data.py", line 1135, in readline
new_chars = self._read(readsize)
File "C:\Python344\lib\site-packages\nltk\data.py", line 1367, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python344\lib\site-packages\nltk\data.py", line 1398, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python344\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 18: invalid start byte
我的第一个想法是我的wordnet语料库以某种方式被破坏了。你觉得怎么样?
非常感谢你的帮助!
编辑:
我正在添加get_wordnet_pos的定义:
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return nltk.corpus.wordnet.ADJ
elif treebank_tag.startswith('V'):
return nltk.corpus.wordnet.VERB
elif treebank_tag.startswith('N'):
return nltk.corpus.wordnet.NOUN
elif treebank_tag.startswith('R'):
return nltk.corpus.wordnet.ADV
return False