在以utf-8编码的数组上使用doc2bow时,为什么会出现UnicodeDecode错误?

时间:2018-11-06 16:11:03

标签: unicode utf-8 gensim topic-modeling

因此,我尝试进行一些主题建模,当尝试使用doc2bow返回语料库中单词的出现频率时遇到问题:

texta = acelem_array
textd = dcmslem_array

corpusa = [ace_word_id.doc2bow(texta) for text in texta]

错误:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-25-67c5535e8680> in <module>()
      2 textd = dcmslem_array
      3
----> 4 corpusa = [ace_word_id.doc2bow(texta) for text in ace_lemmas]

<ipython-input-25-67c5535e8680> in <listcomp>(.0)
      2 textd = dcmslem_array
      3
----> 4 corpusa = [ace_word_id.doc2bow(texta) for text in ace_lemmas]

~\Anaconda3\lib\site-packages\gensim\corpora\dictionary.py in doc2bow(self, document, allow_update, return_missing)
    243         counter = defaultdict(int)
    244         for w in document:
--> 245             counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
    246
    247         token2id = self.token2id

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 8: invalid start byte

我检查了我正在使用的文件是否已以utf-8编码打开:

ace_output = open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigaceutf.txt', 'w', encoding='utf8')
with open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigace.txt', 'r', encoding='utf8', errors='ignore') as text:
    for line in text:
        ace_output.write(line)
ace_output.close()

with open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigaceutf.txt', 'r', encoding='utf8') as myfile:
    ace_text = myfile.read()

还检查了用于创建texta的数组是否包含unicode字符串:

def lemmas_array(lemmas):
    lemmas_array = np.zeros((len(lemmas), 1), dtype = 'U15')
    for i in range(len(lemmas)):
        lemmas_array[i] = lemmas[i]
    return lemmas_array

acelem_array = lemmas_array(ace_lemmas)
texta = acelem_array

让我知道是否需要提供更多信息。

非常感谢!

0 个答案:

没有答案