我想用Keras运行YOLO算法(对象检测)的实现。我使用的代码主要来自here。
我正在尝试使用Google提供的Open Image Dataset V4示例来训练我的模型。 问题是,当我尝试训练模型时,出现以下警告和异常:
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 831.81MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 380.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 84.50MiB. Current allocation summary follows.
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node conv2d_3/Conv2D}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv2d_3/Conv2D_grad/Conv2DBackpropInput"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](leaky_re_lu_2/LeakyRelu, conv2d_3/Conv2D/ReadVariableOp)]]
(这里我使用的是tensorflow-GPU lib,但是没有使用非GPU的tensorflow时也会遇到类似的错误。)
起初,我虽然是因为我的数据集的大小(200.000张图片=>〜60GB),但是当运行带有最少样本(500张图片=>〜150MB)的代码时,却遇到了完全相同的错误。 所以我想我的代码有问题。
这是问题部分的最小示例(我想):
def _main():
input_shape = [416,416]
model = ### #Create YOLO model
anchors = ### #Collection of 9 anchors
num_classes = 601
train_data = ### # A collection of the form [PathToImage, X1,X2,Y1,Y2, class], where the X,Y values define the bounding box
valid_data = ### # A collection of the form [PathToImage, X1,X2,Y1,Y2, class], where the X,Y values define the bounding box
batch_size = 8
model.fit_generator(data_generator(train_data, batch_size, input_shape, anchors, num_classes),
steps_per_epoch=max(1, len(train_data)//batch_size),
validation_data=data_generator(valid_data, batch_size, input_shape, anchors, num_classes),
validation_steps=max(1, len(valid_data)//batch_size),
epochs=50,
initial_epoch=0)
# Unfreeze and continue training, to fine-tune.
for i in range(len(model.layers)):
model.layers[i].trainable = True
model.compile(optimizer=Adam(lr=1e-4), loss={'yolo_loss': lambda y_true, y_pred: y_pred}) # recompile to apply the change
print('Unfreeze all of the layers.')
print('Train on {} samples, val on {} samples, with batch size {}.'.format(num_train, num_val, batch_size))
model.fit_generator(data_generator(train_data, batch_size, input_shape, anchors, num_classes),
steps_per_epoch=max(1, len(train_data)//batch_size),
validation_data=data_generator(valid_data, batch_size, input_shape, anchors, num_classes),
validation_steps=max(1, len(valid_data)//batch_size),
epochs=100,
initial_epoch=50)
def data_generator(lines, batch_size, input_shape, anchors, num_classes):
'''data generator for fit_generator'''
n = len(lines)
i = 0
while True:
image_data = []
box_data = []
for b in range(batch_size):
if i==0:
np.random.shuffle(lines)
image, box = get_data(lines[i], input_shape) # Retrieve the image from path and return it with the bounding box (the object class is in box object)
image_data.append(image)
box_data.append(box)
i = (i+1) % n
image_data = np.array(image_data)
box_data = np.array(box_data)
y_true = preprocess_true_boxes(box_data, input_shape, anchors, num_classes) # For each boxes, find the best anchor
yield [image_data, *y_true], np.zeros(batch_size)
第二次调用fit_generator()
在answer on similar question之后,我在TensorFlow会话中添加了gpu_options allow_growth:
K.clear_session() # get a new session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
K.set_session(sess)
但是id无法解决问题。
所以我有点卡在这里。我在做什么错了?
注意: