为什么 Keras 训练一段时间后会变慢?

时间:2021-02-11 15:17:14

标签: python tensorflow keras deep-learning

我遇到了一个问题,我的模型训练速度显着减慢

这是发生了什么:


Epoch 00001: val_loss did not improve from 0.03340
Run 27 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 156us/step - loss: 0.0420 - binary_accuracy: 0.9459 - accuracy: 0.9848 - val_loss: 0.0362 - val_binary_accuracy: 0.9501 - val_accuracy: 0.9876

Epoch 00001: val_loss did not improve from 0.03340
Run 28 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 2s 150us/step - loss: 0.0422 - binary_accuracy: 0.9431 - accuracy: 0.9851 - val_loss: 0.0395 - val_binary_accuracy: 0.9418 - val_accuracy: 0.9863

Epoch 00001: val_loss did not improve from 0.03340
Run 29 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 6s 474us/step - loss: 0.0454 - binary_accuracy: 0.9479 - accuracy: 0.9833 - val_loss: 0.0395 - val_binary_accuracy: 0.9475 - val_accuracy: 0.9856

Epoch 00001: val_loss did not improve from 0.03340
Run 30 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 701us/step - loss: 0.0462 - binary_accuracy: 0.9406 - accuracy: 0.9830 - val_loss: 0.0339 - val_binary_accuracy: 0.9502 - val_accuracy: 0.9882

Epoch 00001: val_loss did not improve from 0.03340
Run 31 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 646us/step - loss: 0.0457 - binary_accuracy: 0.9462 - accuracy: 0.9836 - val_loss: 0.0375 - val_binary_accuracy: 0.9417 - val_accuracy: 0.9861

Epoch 00001: val_loss did not improve from 0.03340
Run 32 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 640us/step - loss: 0.0471 - binary_accuracy: 0.9313 - accuracy: 0.9827 - val_loss: 0.0373 - val_binary_accuracy: 0.9446 - val_accuracy: 0.9868

Epoch 00001: val_loss did not improve from 0.03340
Run 33 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 669us/step - loss: 0.0423 - binary_accuracy: 0.9458 - accuracy: 0.9852 - val_loss: 0.0356 - val_binary_accuracy: 0.9510 - val_accuracy: 0.9873

Epoch 00001: val_loss did not improve from 0.03340
Run 34 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 8s 648us/step - loss: 0.0441 - binary_accuracy: 0.9419 - accuracy: 0.9841 - val_loss: 0.0407 - val_binary_accuracy: 0.9357 - val_accuracy: 0.9849

Epoch 00001: val_loss did not improve from 0.03340
Run 35 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 9s 713us/step - loss: 0.0460 - binary_accuracy: 0.9473 - accuracy: 0.9829 - val_loss: 0.0423 - val_binary_accuracy: 0.9604 - val_accuracy: 0.9840

Epoch 00001: val_loss did not improve from 0.03340
Run 36 of 40 | Epoch 61 of 100
(15000, 4410) (15000, 12)
Train on 12000 samples, validate on 3000 samples
Epoch 1/1
12000/12000 [==============================] - 7s 557us/step - loss: 0.0508 - binary_accuracy: 0.9530 - accuracy: 0.9810 - val_loss: 0.0470 - val_binary_accuracy: 0.9323 - val_accuracy: 0.9820

我的 GPU 使用率没有减少(实际上增加了):

enter image description here

我的 CPU 使用率、时钟和 GPU 时钟(核心和内存)都保持不变。我的 RAM 使用量也大致保持不变。

唯一奇怪的部分是我的整体功率下降(图像百分比):

enter image description here

我在某处读到这是由于 ADAM 优化器的 beta_1 参数造成的,将其设置为 0.99 应该可以解决问题,但问题仍然存在。

是否还有其他原因导致这种情况发生?它看起来像是计算方面的问题,因为没有硬件/驱动程序问题的迹象。

1 个答案:

答案 0 :(得分:0)

以防万一有人遇到这个问题,我将列出可能有帮助的内容:

  1. 尝试在 ADAM 优化器中将 beta_1 设置为 0.99
  2. 如果您多次运行 model.fit(),在 fit() 之后添加它也可能有帮助:K.clear_session()(确保您执行 import from keras import backend as K
  3. 在导入后拍下这个(如果使用 tensorflow > 2.0):
config = tf.compat.v1.ConfigProto()

config.gpu_options.allow_growth=True

sess = tf.compat.v1.Session(config=config)
  1. 如果你有一个打开的文件(在使用 file.open() 之后)确保你关闭(或者更好的是,使用 with
  2. 确保后台没有运行其他可能使用 GPU 的东西(游戏、繁重的网站等)
  3. 检查您的页面文件使用情况。由于页面文件比 RAM 慢得多,您可能会耗尽内存。执行 del VARIABLE 可能会有所帮助。最坏的情况是,您必须加载较小的数据块或减小模型大小。
  4. 尝试在 NVIDIA 控制面板中将 GPU 设置为最高性能

如果有人对可能解决此类问题的方法有任何其他想法,请随时发表评论,我会编辑此答案。

相关问题