Question

我正在尝试使用K80在GCP VM实例上训练示例代码。我能够安装tensorflow和所有CUDA依赖项（驱动程序，cuDNN等），并且还遵循了诸如在.bashrc末尾添加LD_LIBRARY_PATH之类的步骤

LD_LIBRARY_PATH="LD_LIBRARY_PATH=${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}/usr/local/cuda/extras/CUPTI/lib64"

我能够通过运行以下命令来验证安装（此处未包含结果，但是它看到了gpu，一切看起来都不错）

python (& python3)
import tensorflow as tf
>>> sess = \
tf.Session(config=tf.ConfigProto(log_device_placement=True))

但是，当我尝试初始化训练时，遇到了以下错误。 GPU开始加速，但是稍后会弹出LD_LIBRARY_PATH错误，并且程序中止。

python train.py --data_dir dataset/
WARNING:tensorflow:From /home/minsoo/source/e2e_train/model.py:169: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2018-08-09 20:44:46.979762: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-08-09 20:44:47.893303: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-09 20:44:47.893862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.08GiB
2018-08-09 20:44:47.894074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-09 20:45:05.025264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-09 20:45:05.025418: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-08-09 20:45:05.025491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2018-08-09 20:45:05.041153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10740 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
('Restore model from', u'./saved_model/model.ckpt')
2018-08-09 20:45:20.876092: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.492806: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.627640: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.760862: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
2018-08-09 20:45:22.894173: W tensorflow/core/framework/allocator.cc:108] Allocation of 79200000 exceeds 10% of system memory.
('Step', 0, 'train_loss: ', 0.12469736, 'validation_loss: ', 0.26866225515890224)
2018-08-09 20:48:04.227934: I tensorflow/stream_executor/dso_loader.cc:141] Couldn't open CUDA library libcupti.so.9.0. LD_LIBRARY_PATH: 
2018-08-09 20:48:04.228138: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcupti.so.9.0; dlerror: libcupti.so.9.0: cannot open shared object file: No such file or directory
Aborted (core dumped)

我有点傻眼了，任何有关故障排除的建议将不胜感激。已经必须清除并重新安装一次，而且我仍然不确定为什么没有为gcp vm预先配置的tensorflow实例。

谢谢

尽管.bashrc包含路径，仍然存在“ LD_LIBRARY_PATH为空”错误

0 个答案: