Question

我想使用包含大量印度字符的word2vec模块。该模块由Facebook培训 - https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md。（我正在使用古吉拉特语矢量）

我安装了gensim并尝试加载模块，但发生了以下错误：

In [1]: import gensim  

In [2]: from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('wiki.gu/wiki.gu.bin', binary=True,unicode_errors='ignore')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 32: invalid start byte

我试图在python 2.7和3.5中加载模块，以同样的方式失败。那么如何在gensim中加载模块呢？感谢。

Answer 1

FastText二进制格式与Gensim的word2vec格式不兼容;前者包含有关word2vec未使用的子字单元的其他信息。

在FastText Github页面上讨论了这个问题（以及解决方法）。简而言之，您必须加载文字格式（https://stackoverflow.com/users/4535284/ashutosh-baheti在上面的评论中为您提供了链接）。

加载文本格式后，您可以使用Gensim以二进制格式保存，这将大大减少模型大小，并加快将来加载。

https://github.com/facebookresearch/fastText/issues/171#issuecomment-294295302

＆＃39; UTF8＆＃39;加载word2vec模块时解码错误

1 个答案: