无法加载Gensim Fasttext模型-UTF-8 Unicode错误

时间:2020-09-04 11:20:58

标签: python encoding pickle gensim fasttext

我已经使用Gensim库训练了法语的FastText模型。 突然,这个训练有素的模型没有被加载到内存中。

我正在使用以下代码:-

from gensim.models import FastText
fname = "filename"
model = FastText.load(fname)

并引发以下错误:-

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/gensim/models/fasttext.py", line 1070, in load
    model = super(FastText, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 1244, in load
    model = super(BaseWordEmbeddingsModel, cls).load(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 603, in load
    return super(BaseAny2VecModel, cls).load(fname_or_handle, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 426, in load
    obj = unpickle(fname)
  File "/usr/local/lib/python3.7/site-packages/gensim/utils.py", line 1384, in unpickle
    return _pickle.load(f, encoding='latin1')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x86 in position 14072054: invalid start byte

由于该模型是针对大型数据集训练的,因此有什么方法可以恢复/加载该模型?

1 个答案:

答案 0 :(得分:0)

此错误意味着您存储在模型中的文本不符合here所述的UPDATE delivery_note_entry dne set dne.base_price =( select pp.price from product_price pp join delivery_note dn on dn.id=dne.delivery_note_id join customer c on dn.customer_id = c.id join customer_category cc on cc.id = c.customer_category_id where dn.creation_date between '2020-08-28' and '2020-08-29' ) where dne.product_id = pp.product_id and 编码。

使用已经训练好的模型的解决方案是在运行模型时设置utf-8标志:

unicode_errors

但是,这将导致忽略所讨论的单词/字符,这可能不是理想的选择。

更好的方法是使用符合from gensim.models import FastText fname = "filename" model = FastText.load(fname, unicode_errors='ignore') 的设置来重新训练模型,但这需要重新训练。