我正在使用可用的Tensorflow Debian图像在GCP上训练模型。 根据建议,我已将训练数据(每个文件〜250mb的tfrecords格式,〜50个文件)放入与该Trainer实例相同区域的存储桶中。我使用TensorFlow的本机功能访问文件,而不是指定文件夹来指定GCP存储桶(即使用tf.gFile.FastGFile等)。
我已经注意到,有时在训练期间,训练会严重减慢4倍或更多。诸如填充随机缓冲区和写入/读取检查点(也包括到GS存储桶)之类的操作需要很长时间-如下面的输出所示,用20k项填充随机缓冲区耗时近四分钟。在其他运行中,它花费了几秒钟。绝对不是GPU的瓶颈-在“慢速运行”期间,nvidia-smi报告说GPU处于空闲状态的时间为75%。
奇怪的是,这似乎与保存/加载检查点有关。如果我从头开始运行而没有从GCS加载任何检查点,则会得到“快速运行”,但是一旦程序从检查点重新加载(例如,运行一个eval),它就会再次变慢。重新开始运行,但从检查点加载将导致运行缓慢。
这是一个快速运行的示例(从头开始)
2019-02-24 05:53:32.813397: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 13102 of 20000
2019-02-24 05:53:35.439741: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
INFO:tensorflow:loss = 7.331788, step = 0
INFO:tensorflow:global_step/sec: 0.188288
INFO:tensorflow:loss = 7.0379214, step = 20 (105.878 sec)
INFO:tensorflow:global_step/sec: 2.9655
INFO:tensorflow:loss = 7.1371107, step = 40 (6.744 sec)
INFO:tensorflow:global_step/sec: 3.10018
INFO:tensorflow:loss = 6.97763, step = 60 (6.451 sec)
INFO:tensorflow:global_step/sec: 3.06168
INFO:tensorflow:loss = 6.624346, step = 80 (6.833 sec)
INFO:tensorflow:global_step/sec: 2.89621
INFO:tensorflow:loss = 6.374439, step = 100 (6.605 sec)
INFO:tensorflow:global_step/sec: 3.07356
INFO:tensorflow:loss = 6.3277745, step = 120 (6.506 sec)
INFO:tensorflow:global_step/sec: 3.05334
INFO:tensorflow:loss = 5.9941797, step = 140 (6.550 sec)
这是从检查点加载时的相同设置
2019-02-24 01:24:17.863009: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 1475 of 20000
2019-02-24 01:24:27.410920: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 3036 of 20000
2019-02-24 01:24:37.407969: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 4567 of 20000
2019-02-24 01:24:47.512008: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 5966 of 20000
2019-02-24 01:24:57.390448: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 7442 of 20000
2019-02-24 01:25:07.596700: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 8931 of 20000
2019-02-24 01:25:17.642448: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 10145 of 20000
2019-02-24 01:25:27.360374: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 11292 of 20000
2019-02-24 01:25:37.341567: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 12754 of 20000
2019-02-24 01:25:47.510659: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 14295 of 20000
2019-02-24 01:25:57.387689: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 15723 of 20000
2019-02-24 01:26:07.198808: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 16919 of 20000
2019-02-24 01:26:17.202806: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 18120 of 20000
2019-02-24 01:26:27.385366: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this
may take a while): 19506 of 20000
2019-02-24 01:26:31.774001: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
INFO:tensorflow:loss = 1.099082, step = 5000
INFO:tensorflow:global_step/sec: 0.160334
INFO:tensorflow:loss = 0.87060547, step = 5020 (125.102 sec)
INFO:tensorflow:global_step/sec: 0.877509
INFO:tensorflow:loss = 1.0365343, step = 5040 (22.429 sec)
INFO:tensorflow:global_step/sec: 0.890049
INFO:tensorflow:loss = 0.9461464, step = 5060 (22.471 sec)
INFO:tensorflow:global_step/sec: 0.880202
INFO:tensorflow:loss = 1.0375876, step = 5080 (22.722 sec)
INFO:tensorflow:global_step/sec: 0.948469
INFO:tensorflow:loss = 0.9984442, step = 5100 (21.087 sec)
INFO:tensorflow:global_step/sec: 0.953351
INFO:tensorflow:loss = 0.6918698, step = 5120 (20.978 sec)
INFO:tensorflow:global_step/sec: 0.819123
INFO:tensorflow:loss = 1.0685252, step = 5140 (24.416 sec)
INFO:tensorflow:global_step/sec: 0.905911
INFO:tensorflow:loss = 0.90230775, step = 5160 (22.077 sec)
INFO:tensorflow:global_step/sec: 0.822452