因此,我尝试进行一些主题建模,当尝试使用doc2bow返回语料库中单词的出现频率时遇到问题:
texta = acelem_array
textd = dcmslem_array
corpusa = [ace_word_id.doc2bow(texta) for text in texta]
错误:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-25-67c5535e8680> in <module>()
2 textd = dcmslem_array
3
----> 4 corpusa = [ace_word_id.doc2bow(texta) for text in ace_lemmas]
<ipython-input-25-67c5535e8680> in <listcomp>(.0)
2 textd = dcmslem_array
3
----> 4 corpusa = [ace_word_id.doc2bow(texta) for text in ace_lemmas]
~\Anaconda3\lib\site-packages\gensim\corpora\dictionary.py in doc2bow(self, document, allow_update, return_missing)
243 counter = defaultdict(int)
244 for w in document:
--> 245 counter[w if isinstance(w, unicode) else unicode(w, 'utf-8')] += 1
246
247 token2id = self.token2id
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 8: invalid start byte
我检查了我正在使用的文件是否已以utf-8编码打开:
ace_output = open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigaceutf.txt', 'w', encoding='utf8')
with open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigace.txt', 'r', encoding='utf8', errors='ignore') as text:
for line in text:
ace_output.write(line)
ace_output.close()
with open('C:/Users/Ayan-Yue/Documents/PROPOSAL/TESTCORPUS/btask/bigaceutf.txt', 'r', encoding='utf8') as myfile:
ace_text = myfile.read()
还检查了用于创建texta
的数组是否包含unicode字符串:
def lemmas_array(lemmas):
lemmas_array = np.zeros((len(lemmas), 1), dtype = 'U15')
for i in range(len(lemmas)):
lemmas_array[i] = lemmas[i]
return lemmas_array
acelem_array = lemmas_array(ace_lemmas)
texta = acelem_array
让我知道是否需要提供更多信息。
非常感谢!