Question

我正在使用从源代码编译的TensorFlow运行tutorial example for XLA。运行python mnist_softmax_xla.py会导致以下错误：

I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
W tensorflow/core/framework/op_kernel.cc:993] Not found: ./libdevice.compute_35.10.bc not found
Traceback (most recent call last):
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
    return fn(*args)
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: ./libdevice.compute_35.10.bc not found
         [[Node: cluster_0/_0/_1 = _XlaLaunch[Targs=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], Tconstants=[DT_INT32], Tresults=[DT_FLOAT, DT_FLOAT], function=cluster_0[_XlaCompiledKernel=true, _XlaNumConstantArgs=1], _device="/job:localhost/replica:0/task:0/gpu:0"](Shape_2, _recv_Placeholder_0/_3, _recv_Placeholder_1_0/_1, Variable_1, Variable)]]

我安装了带有cuDNN 5.1的CUDA 8。文件libdevice.compute_35.10.bc确实存在于计算机上：

$ find /usr/local/cuda/ -type f | grep libdevice.compute_35.10.bc
/usr/local/cuda/nvvm/libdevice/libdevice.compute_35.10.bc

我的预感是，这与消息TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.有关，但我不知道该怎么办。

Answer 1

关键是此日志消息：

I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.

（我之后才在日志中注意到它;我实际上是通过在源代码中挖掘来找到该文件，然后才注意到日志中的消息。）

由于我目前无法理解的原因，XLA不会查找/ usr / local / cuda（或者在运行./configure时为libdevice提供的任何目录）。根据cuda_libdevice_path.cc [1]，它正在寻找专门为将其指向libdevice而创建的符号链接。

我将循环编写此代码的人，以弄清楚它应该做什么。与此同时，我能够自己解决以下问题：

$ mkdir local_config_cuda
$ ln -s /usr/local/cuda local_config_cuda/cuda
$ TEST_SRCDIR=$(pwd) python my_program.py

重要的是将TEST_SRCDIR设置为local_config_cuda目录的父级。

对不起，我很抱歉，我现在没有给你一个不那么苛刻的答案。

[1] https://github.com/tensorflow/tensorflow/blob/e1f44d8/tensorflow/core/platform/cuda_libdevice_path.cc#L23 https://github.com/tensorflow/tensorflow/blob/1084748efa3234c7daa824718aeb7df7b9252def/tensorflow/core/platform/default/cuda_libdevice_path.cc#L27

Answer 2

https://github.com/tensorflow/tensorflow/pull/7079应该能解决这个问题。感谢错误报告！

运行TensorFlow XLA示例的NotFoundError（libdevice.compute_35.10.bc）

2 个答案: