Question

我正在使用Tensorflow-gpu = 2.0，CUDA = 10.0，Keras = 2.1.6

我目前正在尝试在Imagenet数据集上训练Keras resnet-56体系结构模型。虽然我可以很容易地在CIFAR-10数据集上做到这一点，但是Imagenet却更加麻烦。

因此，我有一个关于使用Keras训练Imagenet的问题。

我从实验室的一个可靠消息来源得知，使用我的GPU，我应该能够在6到10天内训练一个resnet-56模型（请记住，该人员在一年前就做了），这意味着应该训练一个时期大约需要1到2个小时，但这是我的问题，训练一个时期需要25到33个小时之间的任何时间。每个训练步骤需要8秒钟； as you can see; 8s per step 当我通过nvidia-smi命令检查GPU使用情况时，我注意到GPU倾向于每8秒“峰值”。它们将每8秒运行1秒，仅此而已。 GPUs usage at a given time--> notice the non use of the gpus GPUs usage one second later --> notice that all of them are being used this time GPUs usage one second later --> notice that now none of them are being used

我确保遵循有关如何分配gpu的keras准则，因此我不确定为什么在阅读多个互联网网站后，我目前的最佳猜测是可能会因时间浪费（由于GPU滥用而来）使用CPU加载数据。因此对于7s，脚本将使用CPU加载数据并可能对其进行预处理，然后将使用GPU 1s来训练模型，依此类推... 这就是我的猜测，因此我试图通过tensorflow实现以下帮助器； https://www.tensorflow.org/guide/data_performance#optimize_performance 但我不确定是否会有所帮助。我还包括了脚本的一部分，因为这可能是原因；

training_dir = ['path/to/imagenet/training/set/']
validation_dir = ['path/to/imagenet/validation/set/']

strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","    gpu:1","gpu:2","gpu:3","gpu:4","gpu:5","gpu:6","gpu:7"])
with strategy.scope():
    model = get_compiled_model()

train_datagen = ImageDataGenerator(rotation_range=80,
                                   width_shift_range=0.25,
                                   fill_mode='nearest',
                                   rescale = (1.0/255.0),
                                   horizontal_flip=True,
                                   data_format='channels_first')

valid_datagen = ImageDataGenerator(data_format='channels_first',
                                   rescale = (1.0/255.0))

ds = tf.data.Dataset.from_generator(lambda:train_datagen.flow_from_directory(training_dir,
                                    target_size=(224,224),
                                    class_mode='categorical',
                                    batch_size=16*strategy.num_replicas_in_sync,
                                    shuffle=True),
                                    output_types=(tf.float32,tf.float32),
                                    output_shapes=([None,3,224,224],[None,1000])

val = tf.data.Dataset.from_generator(lambda:train_datagen.flow_from_directory(validation_dir,
                                     target_size=(224,224),
                                     class_mode='categorical',
                                     batch_size=16*strategy.num_replicas_in_sync,
                                     shuffle=True),
                                     output_types=(tf.float32,tf.float32),
                                     output_shapes=([None,3,224,224],[None,1000])

callbacks = [lr_reducer, lr_scheduler]
history = model.fit(ds,
                    epochs=200,
                    callbacks=callbacks)

所以这是问题；

GPU效率低下的原因可能是什么？我猜对了吗？
如何优化我的GPU使用率？这是关于您的意见的吗？

训练imagenet时如何优化Keras对GPU的使用？

0 个答案: