Question

我正在使用GoogleNews-vectors-negative300.bin读取gensim文件，并尝试使用Vector将其转换为pytorch格式。但是，返回ValueError。一种解决方案是以.txt格式transform对其进行操作，但是文件变得三倍大。是否存在使用二进制格式文件纠正此错误的解决方案？

脚本：

from gensim import models
vectors = models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)
models.KeyedVectors.save_word2vec_format(vectors, 'data/GoogleNews-vectors-negative300_pytorch.txt')
vectors = Vectors(name='data/GoogleNews-vectors-negative300_pytorch.bin', cache='./data')

错误：

ValueError：无法将字符串转换为浮点型：b'\ x00 \ x00 \ x94：\ x00 \ x00k \ xba \ x00 \ x00 \ x

Answer 1

最后使用gensim通过以下方式解决。

from gensim.models import KeyedVectors
from torchtext import data
import gensim

emb_model = KeyedVectors.load_word2vec_format(emb_bin_filename, binary=True, encoding="ISO-8859-1", unicode_errors='ignore')
word2index = {token: token_index for token_index, token in enumerate(emb_model.index2word)}
TEXT = data.Field(tokenize=my_tokenizer(), lower=lower)
TEXT.build_vocab(train_data)
TEXT.vocab.set_vectors(word2index, torch.from_numpy(emb_model.vectors).float().to(device), emb_model.vector_size)

使用Vector从二进制文件加载词嵌入：无法将字符串转换为float

1 个答案: