UnicodeDecodeError:'ascii'编解码器无法解码,使用gensim,python3.5

时间:2016-12-26 04:56:52

标签: encoding utf-8 python-3.5 gensim word2vec

我在Windows和Linux上使用python 3.5但是得到了同样的错误: UnicodeDecodeError:'ascii'编解码器无法解码位置0中的字节0xc1:序数不在范围内(128) 错误日志如下:     重装上阵的模块:lazylinker_ext     回溯(最近一次调用最后一次):

  File "<ipython-input-2-d60a2349532e>", line 1, in <module>
    runfile('C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py',     wdir='C:/Users/YZC/Google Drive/sunday/data/RA/data_20100101_20150622')

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)

  File "C:\Users\YZC\Anaconda3\lib\site-    packages\spyderlib\widgets\externalshell\sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)

  File "C:/Users/YZC/Google     Drive/sunday/data/RA/data_20100101_20150622/w2v_coherence.py", line 70, in     <module>
    model = gensim.models.Word2Vec.load('model_all_no_lemma')

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\models\word2vec.py",     line 1485, in load
    model = super(Word2Vec, cls).load(*args, **kwargs)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 248,     in load
    obj = unpickle(fname)

  File "C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py", line 912, in unpickle
    return _pickle.loads(f.read())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0:     ordinal not in range(128)

1.我检查并发现默认的解码方法是utf-8:     导入系统     sys.getdefaultencoding() 出[2]:'utf-8'

  1. 读取文件时,我还添加了.decode('utf-8')
  2. 我在开头添加了shepang line并声明了utf-8 所以我真的不知道为什么python无法读取文件。有人可以帮帮我吗?
  3. 以下是代码:

    # -*- coding: utf-8 -*-
    import gensim
    import csv
    import numpy as np
    import math
    import string
    from nltk.corpus import stopwords, wordnet
    from nltk.stem import WordNetLemmatizer
    from textblob import TextBlob, Word
    
    
    
    class SpeechParser(object):
    
        def __init__(self, filename):
            self.filename = filename
            self.lemmatize = WordNetLemmatizer().lemmatize
            self.cached_stopwords = stopwords.words('english')
    
        def __iter__(self):
    
            with open(self.filename, 'rb', encoding='utf-8') as csvfile:
                file_reader = csv.reader(csvfile, delimiter=',', quotechar='|', )
                headers = file_reader.next()
                for row in file_reader:
                    parsed_row = self.parse_speech(row[-2])
                    yield parsed_row
    
        def parse_speech(self, row):
    
            speech_words =  row.replace('\r\n', ' ').strip().lower().translate(None, string.punctuation).decode('utf-8', 'ignore')         
    
            return speech_words.split()
    
        # -- source: https://github.com/prateekpg2455/U.S-Presidential-    Speeches/blob/master/speech.py --
        def pos(self, tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''
    
    if __name__ == '__main__':
    
        # instantiate object
        sentences = SpeechParser("sample.csv")
    
        # load an existing model
        model = gensim.models.Word2Vec.load('model_all_no_lemma')
    
    
    
        print('\n-----------------------------------------------------------')
        print('MODEL:\t{0}'.format(model))
    
        vocab = model.vocab
    
        # print log-probability of first 10 sentences
        row_count = 0
        print('\n------------- Scores for first 10 documents: -------------')
        for doc in sentences: 
            print(sum(model.score(doc))/len(doc))
            row_count += 1
            if row_count > 10:
                break
        print('\n-----------------------------------------------------------')
    

1 个答案:

答案 0 :(得分:0)

当您尝试使用Python 3中包含非ASCII字符的Python 2 pickle文件时,它看起来像Gensim中的一个错误。

当您致电:

时,会发生这种情况
model = gensim.models.Word2Vec.load('model_all_no_lemma')

在Python 3中,在unpickle期间,它希望将传统字节字符串转换为(Unicode)字符串。默认操作是在严格模式下使用'ASCII'进行解码。

修复程序将取决于原始pickle文件中的编码,并且需要您修补gensim代码。

我不熟悉gensim所以你必须尝试以下两个选项:

强制UTF-8

有可能,您的非ASCII数据采用UTF-8格式。

  1. 修改C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
  2. 转到第912行
  3. 将行更改为:

    return _pickle.loads(f.read(), encoding='utf-8')
    
  4. 字节模式

    Python3中的Gensim可能会很乐意使用字节字符串:

    1. 修改C:\Users\YZC\Anaconda3\lib\site-packages\gensim\utils.py
    2. 转到第912行
    3. 将行更改为:

      return _pickle.loads(f.read(), encoding='bytes')