我在Windows和Linux上使用python 3.5但是得到了同样的错误: UnicodeDecodeError:'ascii'编解码器无法解码位置0中的字节0xc1:序数不在范围内(128) 错误日志如下: 重装上阵的模块:lazylinker_ext 回溯(最近一次调用最后一次):
File "<ipython-input-2-d60a2349532e>", line 1, in <module>
runfile('C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py', wdir='C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622')
File "C:\Users\YZC\Anaconda3\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\YZC\Anaconda3\lib\site- packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py", line 70, in <module>
model = gensim.models.Word2Vec.load('model_all_no_lemma')
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\models\word2vec.py", line 1485, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 248, in load
obj = unpickle(fname)
File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 912, in unpickle
return _pickle.loads(f.read())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)
1.我检查并发现默认的解码方法是utf-8: 导入系统 sys.getdefaultencoding() 出[2]:'utf-8'
以下是代码:
# -*- coding: utf-8 -*-
import gensim
import csv
import numpy as np
import math
import string
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word
class SpeechParser(object):
def __init__(self, filename):
self.filename = filename
self.lemmatize = WordNetLemmatizer().lemmatize
self.cached_stopwords = stopwords.words('english')
def __iter__(self):
with open(self.filename, 'rb', encoding='utf-8') as csvfile:
file_reader = csv.reader(csvfile, delimiter=',', quotechar='|', )
headers = file_reader.next()
for row in file_reader:
parsed_row = self.parse_speech(row[-2])
yield parsed_row
def parse_speech(self, row):
speech_words = row.replace('\r\n', ' ').strip().lower().translate(None, string.punctuation).decode('utf-8', 'ignore')
return speech_words.split()
# -- source: https://github.com/prateekpg2455/U.S-Presidential- Speeches/blob/master/speech.py --
def pos(self, tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return ''
if __name__ == '__main__':
# instantiate object
sentences = SpeechParser("sample.csv")
# load an existing model
model = gensim.models.Word2Vec.load('model_all_no_lemma')
print('\n-----------------------------------------------------------')
print('MODEL:\t{0}'.format(model))
vocab = model.vocab
# print log-probability of first 10 sentences
row_count = 0
print('\n------------- Scores for first 10 documents: -------------')
for doc in sentences:
print(sum(model.score(doc))/len(doc))
row_count += 1
if row_count > 10:
break
print('\n-----------------------------------------------------------')
答案 0 :(得分:0)
当您尝试使用Python 3中包含非ASCII字符的Python 2 pickle文件时,它看起来像Gensim中的一个错误。
当您致电:
时,会发生这种情况model = gensim.models.Word2Vec.load('model_all_no_lemma')
在Python 3中,在unpickle期间,它希望将传统字节字符串转换为(Unicode)字符串。默认操作是在严格模式下使用'ASCII'进行解码。
修复程序将取决于原始pickle文件中的编码,并且需要您修补gensim代码。
我不熟悉gensim所以你必须尝试以下两个选项:
有可能,您的非ASCII数据采用UTF-8格式。
C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
将行更改为:
return _pickle.loads(f.read(), encoding='utf-8')
Python3中的Gensim可能会很乐意使用字节字符串:
C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
将行更改为:
return _pickle.loads(f.read(), encoding='bytes')