当尝试使用tf.data()
与keras api批量生成数据时,我看到了奇怪的问题。它会不断抛出错误,说它用完了training_data。
TensorFlow 2.1
import numpy as np
import nibabel
import tensorflow as tf
from tensorflow.keras.layers import Conv3D, MaxPooling3D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras import Model
import os
import random
"""Configure GPUs to prevent OOM errors"""
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
"""Retrieve file names"""
ad_files = os.listdir("/home/asdf/OASIS/3D/ad/")
cn_files = os.listdir("/home/asdf/OASIS/3D/cn/")
sub_id_ad = []
sub_id_cn = []
"""OASIS AD: 178 Subjects, 278 3T MRIs"""
"""OASIS CN: 588 Subjects, 1640 3T MRIs"""
"""Down-sampling CN to 278 MRIs"""
random.Random(129).shuffle(ad_files)
random.Random(129).shuffle(cn_files)
"""Split files for training"""
ad_train = ad_files[0:276]
cn_train = cn_files[0:276]
"""Shuffle Train data and Train labels"""
train = ad_train + cn_train
labels = np.concatenate((np.ones(len(ad_train)), np.zeros(len(cn_train))), axis=None)
random.Random(129).shuffle(train)
random.Random(129).shuffle(labels)
print(len(train))
print(len(labels))
"""Change working directory to OASIS/3D/all/"""
os.chdir("/home/asdf/OASIS/3D/all/")
"""Create tf data pipeline"""
def load_image(file, label):
nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata())
xs, ys, zs = np.where(nifti != 0)
nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
nifti = nifti[0:100, 0:100, 0:100]
nifti = np.reshape(nifti, (100, 100, 100, 1))
nifti = tf.convert_to_tensor(nifti, np.float64)
return nifti, label
@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, labels):
return tf.py_function(load_image, [file, labels], [tf.float64, tf.float64])
dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.shuffle(6, 129)
dataset = dataset.repeat(50)
dataset = dataset.map(load_image_wrapper, num_parallel_calls=6)
dataset = dataset.batch(6)
dataset = dataset.prefetch(buffer_size=1)
iterator = iter(dataset)
batch_images, batch_labels = iterator.get_next()
########################################################################################
with tf.device("/cpu:0"):
with tf.device("/gpu:0"):
model = tf.keras.Sequential()
model.add(Conv3D(64,
input_shape=(100, 100, 100, 1),
data_format='channels_last',
kernel_size=(7, 7, 7),
strides=(2, 2, 2),
padding='valid',
activation='relu'))
with tf.device("/gpu:1"):
model.add(Conv3D(64,
kernel_size=(3, 3, 3),
padding='valid',
activation='relu'))
with tf.device("/gpu:2"):
model.add(Conv3D(128,
kernel_size=(3, 3, 3),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=(2, 2, 2),
padding='valid'))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.Adagrad(0.01),
metrics=['accuracy'])
########################################################################################
model.fit(batch_images, batch_labels, steps_per_epoch=92, epochs=50)
在创建数据集之后,我进行了改组并将重复参数添加到num_of_epochs
,在这种情况下为50。
这可行,但是在第三个时期后崩溃,我似乎无法弄清楚在此特定情况下我做错了什么。我是否可以在管道顶部声明repeat和shuffle语句?
这是错误:
Epoch 3/50
92/6 [============================================================================================================================================================================================================================================================================================================================================================================================================================================================================] - 3s 36ms/sample - loss: 0.1902 - accuracy: 0.8043
Epoch 4/50
5/6 [========================>.....] - ETA: 0s - loss: 0.2216 - accuracy: 0.80002020-03-06 15:18:17.804126: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[BiasAddGrad_3/_54]]
2020-03-06 15:18:17.804137: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[sequential/conv3d_3/Conv3D/ReadVariableOp/_21]]
2020-03-06 15:18:17.804140: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[Conv3DBackpropFilterV2_3/_68]]
2020-03-06 15:18:17.804263: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[sequential/dense/MatMul/ReadVariableOp/_30]]
2020-03-06 15:18:17.804364: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
[[BiasAddGrad_5/_62]]
2020-03-06 15:18:17.804561: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
[[{{node IteratorGetNext}}]]
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 4600 batches). You may need to use the repeat() f24/6 [========================================================================================================================] - 1s 36ms/sample - loss: 0.1673 - accuracy: 0.8750
Traceback (most recent call last):
File "python_scripts/gpu_farm/tf_data_generator/3D_tf_data_generator.py", line 181, in <module>
evaluation_ad = model.evaluate(ad_test, ad_test_labels, verbose=0)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 930, in evaluate
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 490, in evaluate
use_multiprocessing=use_multiprocessing, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 426, in _model_iteration
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 646, in _process_inputs
x, y, sample_weight=sample_weights)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2383, in _standardize_user_data
batch_size=batch_size)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2489, in _standardize_tensors
y, self._feed_loss_fns, feed_output_shapes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 810, in check_loss_and_target_compatibility
' while using as loss `' + loss_name + '`. '
ValueError: A target array with shape (5, 2) was passed for an output of shape (None, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.
更新:
因此,由于奇怪的问题,在使用model.fit()
时,model.fit(x=data, y=labels)
应该随附tf.data()
。
这样可以消除list out of index
错误。
现在,我又回到了原来的错误。
但是看起来这可能是张量流问题:
https://github.com/tensorflow/tensorflow/issues/32
因此,当我将批量大小从6增加到更大的数字并减小steps_per_epoch
时,它将经历更多的时期而不会引发StartAbort: Out of range
错误
Update2:
根据@ jkjung13的建议,model.fit()
在使用数据集model.fit(x=batch)
时采用一个参数。这是正确的实现。
但是,如果仅在dataset
中使用x
参数,则应该提供model.fit()
而不是可迭代对象。
因此,它应该是:model.fit(dataset, epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))
随之,我得到一个新错误:GitHub Issue
现在要克服这个问题,我将数据集转换为numpy_iterator():
model.fit(dataset.as_numpy_iterator(), epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))
这解决了问题,但是性能令人赞叹,类似于没有多处理的旧keras model.fit_generator
。因此,这违反了“ tf.data”的整个目的。
答案 0 :(得分:0)
TF 2.1
现在可以使用以下参数:
def load_image(file, label):
nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata()).astype(np.float32)
xs, ys, zs = np.where(nifti != 0)
nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
nifti = nifti[0:100, 0:100, 0:100]
nifti = np.reshape(nifti, (100, 100, 100, 1))
return nifti, label
@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, label):
return tf.py_function(load_image, [file, label], [tf.float64, tf.float64])
dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.map(load_image_wrapper, num_parallel_calls=32)
dataset = dataset.prefetch(buffer_size=1)
dataset = dataset.apply(tf.data.experimental.prefetch_to_device('/device:GPU:0', 1))
# So, my dataset size is 522, i.e. 522 MRI images.
# I need to load the entire dataset as a batch.
# This should exceed 60GiBs of RAM, but it doesn't go over 12GiB of RAM.
# I'm not sure how tf.data batch() stores the data, maybe a custom file?
# And also add a repeat parameter to iterate with each epoch.
dataset = dataset.batch(522, drop_remainder=True).repeat()
# Now initialise an iterator
iterator = iter(dataset)
# Create two objects, x & y, from batch
batch_image, batch_label = iterator.get_next()
##################################################################################
with tf.device("/cpu:0"):
with tf.device("/gpu:0"):
model = tf.keras.Sequential()
model.add(Conv3D(64,
input_shape=(100, 100, 100, 1),
data_format='channels_last',
kernel_size=(7, 7, 7),
strides=(2, 2, 2),
padding='valid',
activation='relu'))
with tf.device("/gpu:1"):
model.add(Conv3D(64,
kernel_size=(3, 3, 3),
padding='valid',
activation='relu'))
with tf.device("/gpu:2"):
model.add(Conv3D(128,
kernel_size=(3, 3, 3),
padding='valid',
activation='relu'))
model.add(MaxPooling3D(pool_size=(2, 2, 2),
padding='valid'))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.7))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.Adagrad(0.01),
metrics=['accuracy'])
##################################################################################
# Now supply x=batch_image, y= batch_label to Keras' model.fit()
# And finally, supply your batchs_size here!
model.fit(batch_image, batch_label, epochs=100, batch_size=12)
##################################################################################
由此,大约需要8分钟才能开始训练。 但是一旦训练开始,我将看到令人难以置信的速度!
Epoch 30/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.3526 - accuracy: 0.8640
Epoch 31/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.3334 - accuracy: 0.8448
Epoch 32/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.3308 - accuracy: 0.8697
Epoch 33/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2936 - accuracy: 0.8755
Epoch 34/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2935 - accuracy: 0.8851
Epoch 35/100
522/522 [==============================] - 14s 28ms/sample - loss: 0.3157 - accuracy: 0.8889
Epoch 36/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2910 - accuracy: 0.8851
Epoch 37/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2810 - accuracy: 0.8697
Epoch 38/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2536 - accuracy: 0.8966
Epoch 39/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2506 - accuracy: 0.9004
Epoch 40/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.2353 - accuracy: 0.8927
Epoch 41/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2336 - accuracy: 0.9042
Epoch 42/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2243 - accuracy: 0.9234
Epoch 43/100
522/522 [==============================] - 15s 29ms/sample - loss: 0.2181 - accuracy: 0.9176
每个纪元15秒,而以前每个纪元12分钟!
我将做进一步的测试,以查看这是否确实有效,以及它对我的测试数据有什么影响。如果有任何错误,我会回来更新此帖子。
为什么这样做?我不知道。我在Keras文档中找不到任何内容。