我使用下面定义的export_vectors
保存numpy数组。在这个函数中,我加载由空格分隔的字符串值,然后将它们存储为numpy数组中的浮点数。
def export_vectors(vocab, input_filename, output_filename, dim):
embeddings = np.zeros([len(vocab), dim])
with open(input_filename) as f:
for line in f:
line = line.strip().split(' ')
word = line[0]
embedding = line[1:]
if word in vocab:
word_idx = vocab[word]
embeddings[word_idx] = np.asarray(embedding).astype(float)
np.savez_compressed(output_filename, embeddings=embeddings)
此处embeddings
是ndarray
float64
类型。
虽然,然后在尝试加载文件时,使用:
def get_vectors(filename):
with open(filename) as f:
return np.load(f)["embeddings"]
尝试加载时,我收到错误:
文件“/usr/lib/python3.5/codecs.py”,第321行,在解码中 (结果,消耗)= self._buffer_decode(data,self.errors,final)UnicodeDecodeError:'utf-8'编解码器无法解码位置中的字节0x99 10:无效的起始字节
为什么会这样?
答案 0 :(得分:4)
您可能使用open
错误。
我怀疑,你需要给它一个标志,使用二进制模式,如(docs):
open(filename, 'rb') # r: read-only; b: binary
文档解释了默认行为:Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding.
但是你可以简单地使用文件路径本身(因为np.load可以使用file-like object, string, or pathlib.Path
):
np.load(filename) # This would be more natural
# as it's kind of the direct inverse of your save-code;
# -> no manual file-handling
(一个简化的规则:所有使用通用压缩的东西都在使用二进制文件;而不是文本文件!)