Tensorflow自动混合精度FP16慢于官方ResNet上的FP32

时间:2019-06-04 16:36:37

标签: tensorflow half-precision-float

我正在尝试使用https://github.com/tensorflow/models/blob/master/official/resnet/estimator_benchmark.py#L191中的官方ResNet模型基准来试验tensorflow-gpu==1.14.0rc0中包含的AMP支持。我正在2080 Ti,驱动程序410.78,CUDA 10,Ubuntu上运行。

我进行了以下更改,以确保比较快速且准确无误:

  • 将时代减少到10个。
  • 删除了tweaked运行的2倍大的批量大小,以便所有内容都在相同数量的样本上进行训练。
  • 将检查点设置为仅在训练结束后发生一次。
  • 因为我已将培训下载到本地磁盘上,所以切换了使用CIFAR-10的培训。

我在日志中看到了这一点,这对我来说意味着AMP处于活动状态:

2019-06-03 16:08:40.976829: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-06-03 16:08:40.977057: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:40.985402: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:40.986858: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:40.987745: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:40.996781: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:41.001948: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:41.003208: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:41.004589: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:41.005981: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-06-03 16:08:41.511761: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-06-03 16:08:41.527751: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1723] Converted 529/2910 nodes to float16 precision using 3 cast(s) to float16 (excluding Const and Variable casts)

但是实际运行速度较慢:

fp32 (cyan) runtime is less than all of the fp16 runs.

我该怎么做才能改善性能?

0 个答案:

没有答案