Keras:训练YOLO模型时的异常:分配张量时的OOM

时间:2019-04-18 11:34:13

标签: python tensorflow keras yolo

我想用Keras运行YOLO算法(对象检测)的实现。我使用的代码主要来自here

我正在尝试使用Google提供的Open Image Dataset V4示例来训练我的模型。 问题是,当我尝试训练模型时,出现以下警告和异常:

W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 831.81MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 380.25MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 84.50MiB.  Current allocation summary follows.
...
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8,64,208,208] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node conv2d_3/Conv2D}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv2d_3/Conv2D_grad/Conv2DBackpropInput"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](leaky_re_lu_2/LeakyRelu, conv2d_3/Conv2D/ReadVariableOp)]]

(这里我使用的是tensorflow-GPU lib,但是没有使用非GPU的tensorflow时也会遇到类似的错误。)

起初,我虽然是因为我的数据集的大小(200.000张图片=>〜60GB),但是当运行带有最少样本(500张图片=>〜150MB)的代码时,却遇到了完全相同的错误。 所以我想我的代码有问题。

这是问题部分的最小示例(我想):

def _main():

    input_shape = [416,416]
    model = ### #Create YOLO model
    anchors = ### #Collection of 9 anchors
    num_classes = 601
    train_data = ### # A collection of the form [PathToImage, X1,X2,Y1,Y2, class], where the X,Y values define the bounding box 
    valid_data = ### # A collection of the form [PathToImage, X1,X2,Y1,Y2, class], where the X,Y values define the bounding box
    batch_size = 8

    model.fit_generator(data_generator(train_data, batch_size, input_shape, anchors, num_classes),
            steps_per_epoch=max(1, len(train_data)//batch_size),
            validation_data=data_generator(valid_data, batch_size, input_shape, anchors, num_classes),
            validation_steps=max(1, len(valid_data)//batch_size),
            epochs=50,
            initial_epoch=0)

    # Unfreeze and continue training, to fine-tune.
    for i in range(len(model.layers)):
        model.layers[i].trainable = True
    model.compile(optimizer=Adam(lr=1e-4), loss={'yolo_loss': lambda y_true, y_pred: y_pred}) # recompile to apply the change
    print('Unfreeze all of the layers.')

    print('Train on {} samples, val on {} samples, with batch size {}.'.format(num_train, num_val, batch_size))
    model.fit_generator(data_generator(train_data, batch_size, input_shape, anchors, num_classes),
        steps_per_epoch=max(1, len(train_data)//batch_size),
        validation_data=data_generator(valid_data, batch_size, input_shape, anchors, num_classes),
        validation_steps=max(1, len(valid_data)//batch_size),
        epochs=100,
        initial_epoch=50)

def data_generator(lines, batch_size, input_shape, anchors, num_classes):
    '''data generator for fit_generator'''
    n = len(lines)
    i = 0
    while True:
        image_data = []
        box_data = []
        for b in range(batch_size):
            if i==0:
                np.random.shuffle(lines)
            image, box = get_data(lines[i], input_shape) # Retrieve the image from path and return it with the bounding box (the object class is in box object)
            image_data.append(image)
            box_data.append(box)
            i = (i+1) % n
        image_data = np.array(image_data)
        box_data = np.array(box_data)
        y_true = preprocess_true_boxes(box_data, input_shape, anchors, num_classes) # For each boxes, find the best anchor
        yield [image_data, *y_true], np.zeros(batch_size)

第二次调用fit_generator()

时引发OOM异常

answer on similar question之后,我在TensorFlow会话中添加了gpu_options allow_growth:

K.clear_session() # get a new session

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

K.set_session(sess)

但是id无法解决问题。

所以我有点卡在这里。我在做什么错了?

注意:

  • 我有一个配备20GB GPU内存的Quadro P1000 GPU(根据Windows任务管理器)
  • 我有32GB RAM
  • 我没有更改模型架构,您可以找到它here

0 个答案:

没有答案