使用该链接https://leon.bottou.org/projects/infimnist上的程序,我生成了一些数据。
据我所知,它是某种二进制格式:
b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...
我需要从像这样生成的两个数据集中提取标签和图片:
https://leon.bottou.org/projects/infimnist
with open("test10k-labels", "rb") as binary_file:
data = binary_file.read()
print(data)
>>> b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...
b"\x00\x00\x08\x01 ...".decode('ascii')
>>> "\x00\x00\x08\x01 ..."
我也尝试过binascii软件包,但是没有用。
感谢您的帮助!
要创建数据集,我正在从以下链接下载软件包:https://leon.bottou.org/projects/infimnist。
$ cd dir_of_folder
$ make
然后,我选择了弹出的最终非宗教可执行文件的路径,并且:
$ app_path lab 10000 69999 > mnist60k-labels-idx1-ubyte
这应该将我使用的文件放在文件夹中。
app_path之后的命令可以用他在侧面列出的任何其他命令替换。
有效! 使用一些numpy函数,图像可以恢复为正常方向。
# for the labels
with open(path, "rb") as binary_file:
y_train = np.array(array("B", binary_file.read()))
# for the images
with open("images path", "rb") as binary_file:
images = []
emnistRotate = True
magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
if magic != 2051:
raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
for i in range(size):
images.append([0] * rows * cols)
image_data = array("B", binary_file.read())
for i in range(size):
images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]
# for some reason EMNIST is mirrored and rotated
if emnistRotate:
x = image_data[i * rows * cols:(i + 1) * rows * cols]
subs = []
for r in range(rows):
subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])
l = list(zip(*reversed(subs)))
fixed = [item for sublist in l for item in sublist]
images[i][:] = fixed
x = []
for image in images:
x.append(np.rot90(np.flip(np.array(image).reshape((28,28)), 1), 1))
x_train = np.array(x)
如此简单的事情的疯狂解决方案:)
答案 0 :(得分:1)
好吧,所以看看python-mnist
源,似乎解压缩二进制格式的正确方法如下:
from array import array
with open("test10k-labels", "rb") as binary_file:
magic, size = struct.unpack(">II", file.read(8))
if magic != 2049:
raise ValueError("Magic number mismatch, expected 2049,got{}".format(magic))
labels = array("B", binary_file.read())
print(labels)
更新
因此,我没有对此进行广泛的测试,但是以下代码应该可以工作。它是从上述python-mnist
参见source
from array import array
import struct
with open("mnist8m-patterns-idx3-ubyte", "rb") as binary_file:
images = []
emnistRotate = True
magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
if magic != 2051:
raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
for i in range(size):
images.append([0] * rows * cols)
image_data = array("B", binary_file.read())
for i in range(size):
images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]
# for some reason EMNIST is mirrored and rotated
if emnistRotate:
x = image_data[i * rows * cols:(i + 1) * rows * cols]
subs = []
for r in range(rows):
subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])
l = list(zip(*reversed(subs)))
fixed = [item for sublist in l for item in sublist]
images[i][:] = fixed
print(images)
上一个答案:
您可以使用python-mnist
库:
from mnist import MNIST
mndata = MNIST('./data')
images, labels = mndata.load_training()