Question

我使用下面定义的export_vectors保存numpy数组。在这个函数中，我加载由空格分隔的字符串值，然后将它们存储为numpy数组中的浮点数。

def export_vectors(vocab, input_filename, output_filename, dim):
    embeddings = np.zeros([len(vocab), dim])
    with open(input_filename) as f:
        for line in f:
            line = line.strip().split(' ')
            word = line[0]
            embedding = line[1:]
            if word in vocab:
                word_idx = vocab[word]
                embeddings[word_idx] = np.asarray(embedding).astype(float)

    np.savez_compressed(output_filename, embeddings=embeddings)

此处embeddings是ndarray float64类型。

虽然，然后在尝试加载文件时，使用：

def get_vectors(filename):
    with open(filename) as f:
        return np.load(f)["embeddings"]

尝试加载时，我收到错误：

文件“/usr/lib/python3.5/codecs.py”，第321行，在解码中（结果，消耗）= self._buffer_decode（data，self.errors，final）UnicodeDecodeError：'utf-8'编解码器无法解码位置中的字节0x99 10：无效的起始字节

为什么会这样？

Answer 1

您可能使用open错误。我怀疑，你需要给它一个标志，使用二进制模式，如（docs）：

open(filename, 'rb')  # r: read-only; b: binary

文档解释了默认行为：Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding.

但是你可以简单地使用文件路径本身（因为np.load可以使用file-like object, string, or pathlib.Path）：

np.load(filename)  # This would be more natural
                   # as it's kind of the direct inverse of your save-code;
                   # -> no manual file-handling

（一个简化的规则：所有使用通用压缩的东西都在使用二进制文件;而不是文本文件！）

如何加载用numpy.savez_compressed创建的文件？

1 个答案: