我正在尝试使用K80在GCP VM实例上训练示例代码。我能够安装tensorflow和所有CUDA依赖项(驱动程序,cuDNN等),并且还遵循了诸如在.bashrc末尾添加LD_LIBRARY_PATH之类的步骤
LD_LIBRARY_PATH="LD_LIBRARY_PATH=${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/local/cuda/extras/CUPTI/lib64"
我能够通过运行以下命令来验证安装(此处未包含结果,但是它看到了gpu,一切看起来都不错)
python (& python3)
import tensorflow as tf
>>> sess = \
tf.Session(config=tf.ConfigProto(log_device_placement=True))
但是,当我尝试初始化训练时,遇到了以下错误。 GPU开始加速,但是稍后会弹出LD_LIBRARY_PATH错误,并且程序中止。
python train.py --data_dir dataset/
WARNING:tensorflow:From /home/minsoo/source/e2e_train/model.py:169: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2018-08-09 20:44:46.979762: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-09 20:44:47.893303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-09 20:44:47.893862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.08GiB
2018-08-09 20:44:47.894074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-09 20:45:05.025264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-09 20:45:05.025418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-09 20:45:05.025491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2018-08-09 20:45:05.041153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10740 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
('Restore model from', u'./saved_model/model.ckpt')
2018-08-09 20:45:20.876092: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.492806: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.627640: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.760862: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.894173: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
('Step', 0, 'train_loss: ', 0.12469736, 'validation_loss: ', 0.26866225515890224)
2018-08-09 20:48:04.227934: I tensorflow/stream_executor/dso_loader.cc:141] Couldn't open CUDA library libcupti.so.9.0. LD_LIBRARY_PATH:
2018-08-09 20:48:04.228138: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcupti.so.9.0; dlerror: libcupti.so.9.0: cannot open shared object file: No such file or directory
Aborted (core dumped)
我有点傻眼了,任何有关故障排除的建议将不胜感激。已经必须清除并重新安装一次,而且我仍然不确定为什么没有为gcp vm预先配置的tensorflow实例。
谢谢