在Tensorflow + keras上实施和培训Tiny-DSOD网络。从第一个纪元开始时,训练因错误而终止:tensorflow.python.framework.errors_impl.InvalidArgumentError:不兼容的形状:[7,128,2,2]与[7,128,3,3]
批处理大小为8,图像大小为(300,300),用于训练的数据集为PASCAL VOC 2007 + 2012。错误发生在预测层的输出之一(非常类似于SSD)和损耗之间: [[{{node add_fpn_0_ / add}}]] [[{{node loss / add_50}}]]
目前,tensorflow的版本是1.13,而keras是2.2.4。 Python版本是3.6。我已经检查了模型本身的所有内容(形状是否符合预期),为批次生成的图像(每个图像均符合预期),更改了损耗计算(当前使用Adam,但也尝试了SGD,它完全相同)问题。)并检查张量板是否可以提供任何信息(一切正常,直到终止点为止)。
WARNING:tensorflow:From /home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-06-04 15:45:59.614299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-06-04 15:45:59.614330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-04 15:45:59.614337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-06-04 15:45:59.614341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-06-04 15:45:59.614513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2998 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Epoch 1/10
2019-06-04 15:46:28.296307: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.77GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "/home/alexandre.pires/PycharmProjects/neural_networks/tiny-dsod.py", line 830, in <module>
validation_steps=math.ceil(n_val_samples/batch_size)
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1426, in fit_generator
initial_epoch=initial_epoch)
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 191, in model_iteration
batch_outs = batch_function(*batch_data)
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1191, in train_on_batch
outputs = self._fit_function(ins) # pylint: disable=not-callable
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
run_metadata=self.run_metadata)
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
run_metadata_ptr)
File "/home/alexandre.pires/.conda/envs/neural_network/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [7,128,2,2] vs. [7,128,3,3]
[[{{node add_fpn_0_/add}}]]
[[{{node loss/add_50}}]]
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [8,128,2,2] vs. [8,128,3,3]
[[{{node add_fpn_0_/add}}]]
[[{{node loss/predictions_loss/broadcast_weights/assert_broadcastable/is_valid_shape/has_valid_nonscalar_shape/has_invalid_dims/concat}}]]
要添加的最后一件事是,预测层的前一个输出确实具有形状[7,128,2,2],但这从来没有引起任何错误。关于下一步应该调试的任何提示?还是这个错误到底是从哪里来的?
在模型中进行了一些更正,出现了一个新的错误,但是仍然具有相同的不兼容形状:
layer_name = "conv_" + name
output = tf.keras.layers.Conv2D(filters=filter, kernel_size=kernel, padding=pad,
strides=stride, kernel_initializer=self.kernel_initializer,
kernel_regularizer=self.regularize, name=layer_name)(input)
output = tf.keras.layers.BatchNormalization(name=layer_name + "batch_")(output)
output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)
return output
深度卷积被校正为按照原始模型(使用Caffe制)中的预期作用。
if stride == 2:
output = tf.keras.layers.ZeroPadding2D(padding=self.correct_pad(input, kernel[0]),
name='zeropad_' + layer_name)(input)
output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
strides=stride, kernel_initializer=self.kernel_initializer,
kernel_regularizer=self.regularize, name=layer_name)(output)
else:
output = tf.keras.layers.DepthwiseConv2D(kernel_size=kernel, padding='SAME' if stride == 1 else 'VALID',
strides=stride, kernel_initializer=self.kernel_initializer,
kernel_regularizer=self.regularize, name=layer_name)(input)
if use_batch_norm:
output = tf.keras.layers.BatchNormalization(center=True, scale=True, trainable=True,
name=layer_name + "batch_")(output)
output = tf.keras.layers.Activation('relu', name=layer_name + "relu_")(output)
layer_name = "upsample_" + name
output = tf.keras.layers.UpSampling2D(size=(input_shape[0], input_shape[1]), interpolation='bilinear',
name=layer_name)(input)
output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)
return output
sed
答案 0 :(得分:0)
我认为问题在于网络内部的图像尺寸。
尝试更改此部分:
output = self._depthwise_conv_2d(output, filter=128, kernel=(3, 3), pad='SAME', stride=1, name=layer_name)
为此。
output = self._depthwise_conv_2d(output, filter=128, kernel=(2, 2), pad='SAME', stride=1, name=layer_name)
如果看到输出,则说明您有7个元素的输出,即128个2 x 2尺寸的滤镜,而网络上有7个元素的输出,具有128个3x 3尺寸的滤镜。 / p>
让我知道是否可以帮忙。
答案 1 :(得分:0)
我设法解决了这个问题。问题位于升采样层。我基于的模型是在caffe上使用双线性上采样x2。 Caffe实现与tensorflow / keras中的实现不同。我制作了一个自定义测试层来检查这个假设,并设法解决了这个问题。我现在使用的升采样层是这个:
def UpSampling2DBilinear(self, stride, **kwargs):
def layer(x):
input_shape = tf.keras.backend.int_shape(x)
output_shape = (stride * (input_shape[1] - 1) + 1, stride * (input_shape[2] - 1) + 1)
if output_shape[0] == 9:
output_shape = (10,10)
if output_shape[0] == 37:
output_shape = (38,38)
return tf.image.resize_bilinear(x, output_shape, align_corners=True)
return tf.keras.layers.Lambda(layer, **kwargs)
显然,它不是最终的自定义层解决方案,但目前,它适用于输入图像大小为(300,300)。
因此,对于以后遇到类似问题的任何人,以下步骤清单对调试非常有帮助:
预测中的不兼容形状错误通常与模型相关联。这意味着在某些步骤中,您做错了什么。 Double / Triple / Quadruple检查每层模型的每个输出(keras有model.summary()函数在这种情况下有帮助)。
如果要实现的模型,其构造基于Caffe(或与您使用的框架不同的任何其他框架),请检查该层的实现详细信息。就我而言,我必须更改深度卷积,最大池化和上采样以适应所需的行为。
确保损失函数,批处理生成器等也完全正确,以避免进一步的问题。
希望这对将来解决此类错误的许多人有帮助。谢谢所有试图帮助我的人!