我从LeCun网站下载了MNIST数据集。我想要的是编写Python代码以提取gzip并直接从目录中读取数据集,这意味着我不再需要下载或访问MNIST站点。
欲望过程: 访问文件夹/目录 - > extract gzip - >读数据集(一个热编码)
怎么做?由于几乎所有教程都必须访问LeCun或Tensoflow站点才能下载和读取数据集。提前谢谢!
答案 0 :(得分:6)
此张量流程调用
from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')
...如果您已经拥有该文件,则不会下载任何。
但如果由于某种原因你希望自己解压缩,请按照以下方式进行操作:
from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels
with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
train_labels = extract_labels(f)
with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
test_labels = extract_labels(f)
答案 1 :(得分:4)
如果提取了MNIST data,则可以直接使用NumPy将其低级加载:
def loadMNIST( prefix, folder ):
intType = np.dtype( 'int32' ).newbyteorder( '>' )
nMetaDataBytes = 4 * intType.itemsize
data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )
labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
dtype = 'ubyte' )[2 * intType.itemsize:]
return data, labels
trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )
并转换为热编码:
def toHotEncoding( classification ):
# emulates the functionality of tf.keras.utils.to_categorical( y )
hotEncoding = np.zeros( [ len( classification ),
np.max( classification ) + 1 ] )
hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
return hotEncoding
trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )
答案 2 :(得分:3)
我将展示如何从头开始加载(以更好地理解),并展示如何通过matplotlib.pyplot
来显示数字图像
import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt
def load_data():
path = '../../data/mnist.pkl.gz'
f = gzip.open(path, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
X_train, y_train = training_data[0], training_data[1]
print X_train.shape, y_train.shape
# (50000L, 784L) (50000L,)
# get the first image and it's label
img1_arr, img1_label = X_train[0], y_train[0]
print img1_arr.shape, img1_label
# (784L,) , 5
# reshape first image(1 D vector) to 2D dimension image
img1_2d = np.reshape(img1_arr, (28, 28))
# show it
plt.subplot(111)
plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
plt.show()
您还可以通过以下示例函数将标签矢量化到a 10-dimensional unit vector
:
def vectorized_result(label):
e = np.zeros((10, 1))
e[label] = 1.0
return e
矢量化以上标签:
print vectorized_result(img1_label)
# output as below:
[[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]
[ 1.]
[ 0.]
[ 0.]
[ 0.]
[ 0.]]
如果要将其转换为CNN输入,则可以像这样重新调整其形状:
def load_data_v2():
path = '../../data/mnist.pkl.gz'
f = gzip.open(path, 'rb')
training_data, validation_data, test_data = cPickle.load(f)
f.close()
X_train, y_train = training_data[0], training_data[1]
print X_train.shape, y_train.shape
# (50000L, 784L) (50000L,)
X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
y_train = np.array([vectorized_result(item) for item in y_train])
print X_train.shape, y_train.shape
# (50000L, 28L, 28L) (50000L, 10L, 1L)