我是Google Cloud Machine Learning Engine的新手,
我正在尝试在gcloud中训练基于Keras的图像分类的DL算法。
为了在gcloud上配置GPU,我在'tensorflow-gpu'
中加入了setup.py install_requires
。
我的cloud-gpu.yaml
是以下
trainingInput:
scaleTier: BASIC_GPU
runtimeVersion: "1.0"
在我添加的代码中:
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
开头和
with tf.device('/gpu:0'):
在任何Keras代码之前。
从
可以看到,结果是gcloud识别出了gpu但没有使用它实际云培训的屏幕截图:
INFO 2018-11-18 12:19:59 -0600 master-replica-0 Epoch 1/20
INFO 2018-11-18 12:20:56 -0600 master-replica-0 1/219 [..............................] - ETA: 4:17:12 - loss: 0.8846 - acc: 0.5053 - f1_measure: 0.1043
INFO 2018-11-18 12:21:57 -0600 master-replica-0 2/219 [..............................] - ETA: 3:51:32 - loss: 0.8767 - acc: 0.5018 - f1_measure: 0.1013
INFO 2018-11-18 12:22:59 -0600 master-replica-0 3/219 [..............................] - ETA: 3:46:49 - loss: 0.8634 - acc: 0.5039 - f1_measure: 0.1010
INFO 2018-11-18 12:23:58 -0600 master-replica-0 4/219 [..............................] - ETA: 3:44:59 - loss: 0.8525 - acc: 0.5045 - f1_measure: 0.0991
INFO 2018-11-18 12:24:48 -0600 master-replica-0 5/219 [..............................] - ETA: 3:41:17 - loss: 0.8434 - acc: 0.5031 - f1_measure: 0.0992Sun Nov 18 18:24:48 2018
INFO 2018-11-18 12:24:48 -0600 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | NVIDIA-SMI 396.26 Driver Version: 396.26 |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 |-------------------------------+----------------------+----------------------+
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 |===============================+======================+======================|
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | N/A 32C P0 56W / 149W | 10955MiB / 11441MiB | 0% Default |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 +-------------------------------+----------------------+----------------------+
INFO 2018-11-18 12:24:48 -0600 master-replica-0
INFO 2018-11-18 12:24:48 -0600 master-replica-0 +-----------------------------------------------------------------------------+
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | Processes: GPU Memory |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 | GPU PID Type Process name Usage |
INFO 2018-11-18 12:24:48 -0600 master-replica-0 |=============================================================================|
INFO 2018-11-18 12:24:48 -0600 master-replica-0 +-----------------------------------------------------------------------------+
在训练期间,GPU使用率基本上保持在0%,这怎么可能?
答案 0 :(得分:0)
我建议使用standard_gpu
,该should be具有相同的n1-standard-8,并且在cloud-gpu.yaml
中使用一个k80 GPU:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
此:
with tf.device('/gpu:0'):
with tf.device('/device:GPU:0'):
我建议您查看此Hurken's paradox以获得更好的示例。