我有一个大型的numpy整数数据集,我想用GPU进行分析。数据集太大,无法容纳GPU上的主内存,因此我尝试将它们序列化为TFRecord,然后使用API流式传输记录进行处理。下面的代码是示例代码:它想要创建一些伪数据,将其序列化为TFRecord对象,然后使用TF会话将数据读回内存,使用map()函数进行解析。我的原始数据在numpy数组的维度方面是非同质的,尽管每个都是一个3D数组,其中第一个轴的长度为10。当我制作假数据时,我使用随机数重新创建了非均匀性。我的想法是在序列化数据时存储每个图像的大小,我可以使用它来将每个阵列恢复到其原始大小。无论出于何种原因,它绝对不会起作用。这是代码:
import numpy as np
from skimage import io
from skimage.io import ImageCollection
import tensorflow as tf
import argparse
#A function for parsing TFRecords
def record_parser(record):
keys_to_features = {
'fil' : tf.FixedLenFeature([],tf.string),
'm' : tf.FixedLenFeature([],tf.int64),
'n' : tf.FixedLenFeature([],tf.int64)}
parsed = tf.parse_single_example(record, keys_to_features)
m = tf.cast(parsed['m'],tf.int32)
n = tf.cast(parsed['n'],tf.int32)
fil_shape = tf.stack([10,m,n])
fil = tf.decode_raw(parsed['fil'],tf.float32)
fil = tf.reshape(fil,fil_shape)
return (fil,m,n)
#For writing and reading from the TFRecord
filename = "test.tfrecord"
if __name__ == "__main__":
#Create the TFRecordWriter
data_writer = tf.python_io.TFRecordWriter(filename)
#Create some fake data
files = []
i_vals = np.random.randint(20,size=10)
j_vals = np.random.randint(20,size=10)
print(i_vals)
print(j_vals)
for x in range(5):
files.append(np.random.rand(10,i_vals[x],j_vals[x]))
#Serialize the fake data and record it as a TFRecord using the TFRecordWriter
for fil in files:
f,m,n = fil.shape
fil_raw = fil.tostring()
print("fil.shape: ",fil.shape)
example = tf.train.Example(
features = tf.train.Features(
feature = {
'fil' : tf.train.Feature(bytes_list=tf.train.BytesList(value=[fil_raw])),
'm' : tf.train.Feature(int64_list=tf.train.Int64List(value=[m])),
'n' : tf.train.Feature(int64_list=tf.train.Int64List(value=[n]))
}
)
)
data_writer.write(example.SerializeToString())
data_writer.close()
#Deserialize and report on the fake data
sess = tf.Session()
dataset = tf.data.TFRecordDataset([filename])
dataset = dataset.map(record_parser)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
sess.run(iterator.initializer)
while True:
try:
sess.run(next_element)
fil,m,n = next_element
print("fil.shape: ",file.shape)
print("M: ",m)
print("N: ",n)
except tf.errors.OutOfRangeError:
break
错误会在map()函数中抛出:
MacBot$ python test.py
/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
[ 2 12 17 18 19 15 11 5 0 12]
[13 5 3 5 2 6 5 11 12 10]
fil.shape: (10, 2, 13)
fil.shape: (10, 12, 5)
fil.shape: (10, 17, 3)
fil.shape: (10, 18, 5)
fil.shape: (10, 19, 2)
2018-04-03 09:01:18.382870: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-03 09:01:18.420114: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at iterator_ops.cc:870 : Invalid argument: Input to reshape is a tensor with 520 values, but the requested shape has 260
[[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw, stack)]]
Traceback (most recent call last):
File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
target_list, status, run_metadata)
File "/Users/MacBot/anaconda/envs/tflow/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 520 values, but the requested shape has 260
[[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32](DecodeRaw, stack)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[10,?,?], [], []], output_types=[DT_FLOAT, DT_INT32, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
有人对这个问题有点了解吗?非常感谢您的帮助!值得注意的是,数据总是看起来是我预期的两倍......
答案 0 :(得分:1)
您似乎正在编写np.random.rand
的结果。但是,这会返回float64
个值。另一方面,您告诉Tensorflow将字节解释为float32
。这是一个不匹配 - 并且可以解释为什么数字的数量是预期的两倍(因为字节数是原来的两倍!)。
请尝试使用files.append(np.random.rand(10,i_vals[x],j_vals[x]).astype(np.float32))
。对于CUDA,建议使用float32
。一般情况下你需要小心:默认情况下,numpy在大多数地方使用float64
(但int32
)。