Question

我正在尝试下载MNIST数据集并对其进行解码而不将其写入磁盘（主要是为了好玩）。

request_stream = urlopen('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz')
zip_file = GzipFile(fileobj=request_stream, mode='rb')
with zip_file as fd:
    magic, numberOfItems = struct.unpack('>ii', fd.read(8))
    rows, cols = struct.unpack('>II', fd.read(8))
    images = np.fromfile(fd, dtype='uint8') # < here be dragons
    images = images.reshape((numberOfItems, rows, cols))
    return images

此代码失败并显示OSError: obtaining file position failed，这个错误似乎不可识别。问题是什么？

Answer 1

问题seems to be，即gzip和类似模块提供的，不是真正的文件对象（不出所料），但numpy试图通读实际FILE*指针，所以这不起作用。

如果可以将整个文件读入内存（它可能不是），那么可以通过将所有非标头信息读入bytearray并从中反序列化来解决这个问题：

rows, cols = struct.unpack('>II', fd.read(8))
b = bytearray(fd.read())
images = np.frombuffer(b, dtype='uint8')
images = images.reshape((numberOfItems, rows, cols))
return images

通过网络从GZip文件中读取numpy数据

1 个答案: