gcloud ML引擎-Keras未在GPU上运行

时间:2018-11-18 03:12:26

标签: python tensorflow keras google-cloud-platform deep-learning

我是Google Cloud Machine Learning Engine的新手, 我正在尝试在gcloud中训练基于Keras的图像分类的DL算法。 为了在gcloud上配置GPU,我在'tensorflow-gpu'中加入了setup.py install_requires。 我的cloud-gpu.yaml是以下

trainingInput:
  scaleTier: BASIC_GPU
  runtimeVersion: "1.0"

在我添加的代码中:

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

开头和

with tf.device('/gpu:0'):

在任何Keras代码之前。

可以看到,结果是gcloud识别出了gpu但没有使用它

实际云培训的屏幕截图:

INFO    2018-11-18 12:19:59 -0600   master-replica-0        Epoch 1/20
INFO    2018-11-18 12:20:56 -0600   master-replica-0          1/219 [..............................] - ETA: 4:17:12 - loss: 0.8846 - acc: 0.5053 - f1_measure: 0.1043
INFO    2018-11-18 12:21:57 -0600   master-replica-0          2/219 [..............................] - ETA: 3:51:32 - loss: 0.8767 - acc: 0.5018 - f1_measure: 0.1013
INFO    2018-11-18 12:22:59 -0600   master-replica-0          3/219 [..............................] - ETA: 3:46:49 - loss: 0.8634 - acc: 0.5039 - f1_measure: 0.1010
INFO    2018-11-18 12:23:58 -0600   master-replica-0          4/219 [..............................] - ETA: 3:44:59 - loss: 0.8525 - acc: 0.5045 - f1_measure: 0.0991
INFO    2018-11-18 12:24:48 -0600   master-replica-0          5/219 [..............................] - ETA: 3:41:17 - loss: 0.8434 - acc: 0.5031 - f1_measure: 0.0992Sun Nov 18 18:24:48 2018       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |===============================+======================+======================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | N/A   32C    P0    56W / 149W |  10955MiB / 11441MiB |      0%      Default |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0                                                                                       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Processes:                                                       GPU Memory |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |  GPU       PID   Type   Process name                             Usage      |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |=============================================================================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+

在训练期间,GPU使用率基本上保持在0%,这怎么可能?

1 个答案:

答案 0 :(得分:0)

我建议使用standard_gpu,该should be具有相同的n1-standard-8,并且在cloud-gpu.yaml中使用一个k80 GPU:

trainingInput:
  scaleTier: CUSTOM
  # standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
  masterType: standard_gpu
  runtimeVersion: "1.5"

此:

with tf.device('/gpu:0'):

cnn_with_keras.py

with tf.device('/device:GPU:0'):

我建议您查看此Hurken's paradox以获得更好的示例。