我正在尝试使用g2.2xlarge EC2实例来训练一些简单的ml模型,但是我不确定GPU支持是否有效。恐怕不会,因为培训时间与我笨拙的笔记本电脑非常相似。
我已经安装了Tensorflow GPU支持following these official guidelines,以下是一些命令的输出。
在外壳程序中运行nvidia-smi
返回
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 00000000:00:03.0 Off | N/A |
| N/A 29C P8 17W / 125W | 0MiB / 4037MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
运行pip list
...
jupyterlab (0.31.5)
jupyterlab-launcher (0.10.2)
Keras (2.2.2)
Keras-Applications (1.0.4)
Keras-Preprocessing (1.0.2)
kiwisolver (1.0.1)
...
tensorboard (1.10.0)
tensorflow (1.10.0)
tensorflow-gpu (1.10.0)
...
通过运行conda list
得到非常相似的输出。
Python版本为Python 3.6.4 |Anaconda
。
其他一些希望有用的输出:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
2018-08-11 16:42:54.942052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-11 16:42:54.943269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-08-11 16:42:54.943309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:42:54.943337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:42:54.943355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:42:54.943371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
[]
。
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
2018-08-11 16:44:03.560954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:44:03.561015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:44:03.561035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:44:03.561052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15704826459248001252
]
。
import tensorflow as tf
tf.test.is_gpu_available()
2018-08-11 16:45:22.049670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:45:22.049748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:45:22.049782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:45:22.049814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
False
您可以确认Keras不在GPU上运行吗?您对如何最终解决此问题有任何建议吗?
谢谢
编辑:
我尝试使用p2.xlarge EC2实例,但是问题似乎并未解决。这是几个输出
>>> from keras import backend as K
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
2018-08-11 21:54:24.238022: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-11 21:54:24.247402: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-08-11 21:54:24.247430: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (ip-172-31-2-145): /proc/driver/nvidia/version does not exist
[]
。
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 80385595218229545
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6898783310276970136
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 4859092998934769352
physical_device_desc: "device: XLA_CPU device"
]
答案 0 :(得分:0)
我通过以下操作解决了这个问题:
conda env list
查看可用环境列表,然后使用source activate tensorflow_p36
激活我需要的环境最后一点可能是我在以前的测试中从未意识到要做的事情。
之后,一切都按预期进行
>>> from keras import backend as K
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
['/job:localhost/replica:0/task:0/device:GPU:0']
另外,运行nvidia-smi
展示了模型训练期间对gpu资源的使用,与nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER
一样。
在我的示例中,训练一个纪元从42s变为13s。