Question

我是TensorFlow的新手。

我刚刚安装了TensorFlow并测试了安装，我尝试了以下代码，一旦启动TF会话，我就会收到 Segmentation fault（core dumped） 错误。

bafhf@remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)

我的 nvidia-smi 是：

Tue May 15 12:12:26 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    2 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc --version 是：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

gcc --version 也是：

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

以下是路径：

/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

和 LD_LIBRARY_PATH ：

/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib

我在服务器上运行这个，但我没有root权限。我仍然按照官方网站上的说明安装了所有东西。

编辑：新观察：

似乎GPU正在为进程分配内存一秒钟，然后抛出核心分段转储错误：

编辑2：更改张量流版本

我将我的tensorflow版本从v1.8降级到v1.5。问题仍然存在。

有没有办法解决或调试此问题？

Answer 1

由于您在此处使用多个GPU，因此可能会发生这种情况。尝试将cuda可见设备设置为其中一个GPU。有关如何执行此操作的说明，请参阅this link。就我而言，这解决了这个问题。

Answer 2

检查您是否正在使用tensorflow所需的CUDA和CuDNN的确切版本，以及您使用此CUDA版本附带的显卡驱动程序版本。

我曾经有一个类似的问题，有一个太近的司机。将其降级为tensorflow所需的CUDA版本的版本为我解决了这个问题。

Answer 3

如果可以看到 nvidia-smi 输出，则第二个GPU的 ECC 代码为2。 CUDA版本或TF版本错误，通常是段错误，有时在堆栈跟踪中带有CUDA_ERROR_ECC_UNCORRECTABLE标志。

我从this帖子中得出了这个结论：

“不可纠正的ECC错误”通常是指硬件故障。 ECC是   纠错码，一种检测和纠正位错误的方法   存储在RAM中。宇宙射线可能会破坏RAM中存储的一位   每隔一段时间，但是“无法纠正的ECC错误”表示   内存中有几位出现“错误”信息-太多   ECC以恢复原始位值。

这可能意味着您的GPU中的RAM单元损坏或不足   设备内存。

任何种类的边缘电路可能不会100％失效，但更有可能   在大量使用的压力下失败-随之而来的是   温度。

重新启动通常可以消除 ECC 错误。如果没有，似乎唯一的选择就是更改硬件。

那我做了什么，最后是如何解决这个问题？

我使用NVIDIA 1050 Ti在单独的机器上测试了我的代码机器和我的代码执行得很好。
我使代码仅在 ECC 的第一张卡上运行值正常，只是为了缩小问题范围。我做到了接下来this发布，设置 CUDA_VISIBLE_DEVICES环境变量。
然后我请求对Tesla-K80服务器进行重启进行检查重新启动是否可以解决此问题，他们花了一段时间，但然后重新启动服务器

现在问题不再存在，我可以同时运行两张卡张量流蕴涵。

Answer 4

如果仍然有人感兴趣，我碰巧遇到了同样的问题，输出为“ Volatile Uncorr。ECC”。我的问题是版本不兼容，如下所示：

已加载的运行时CuDNN库：7.1.1，但源代码编译为： 7.2.1。如果是CuDNN 7.0或更高版本，则CuDNN库的主要版本和次要版本必须匹配或具有更高的次要版本。如果使用二进制安装，升级您的CuDNN库。如果从源头建造，确保在运行时加载的库与该版本兼容在编译配置期间指定。分割错误

将CuDNN库升级到7.3.1（大于7.2.1）后，分段错误错误消失了。要升级，我做了以下工作（也here中有记录）。

从NVIDIA website下载CuDNN库
sudo tar -xzvf [TAR_FILE]
sudo cp cuda / include / cudnn.h / usr / local / cuda / include
sudo cp cuda / lib64 / libcudnn * / usr / local / cuda / lib64
sudo chmod a + r /usr/local/cuda/include/cudnn.h / usr / local / cuda / lib64 / libcudnn *

Answer 5

我最近遇到了这个问题。

原因是docker容器中有多个GPU。解决方案非常简单，您可以：

在主机中

设置CUDA_VISIBLE_DEVICES 指https://stackoverflow.com/a/50464695/2091555

或

如果需要多个GPU，请使用--ipc=host启动docker 例如

docker run --runtime nvidia --ipc host \
  --rm -it
  nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest

这个问题实际上很棘手，在docker容器中的cuInit()调用期间发生了段错误，并且在主机中一切正常。我将在此处留下日志，以使搜索引擎更容易为其他人找到这个答案。

(base) root@e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]

warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/572' in core file.
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5  0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6  0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>, 
    flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7  _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8  0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9  0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit

另一种尝试是使用pip进行安装

(base) root@e121c445c1eb:~# pip install torch torchvision
(base) root@e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 28]

warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/28' in core file.
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>, 
... (stack omitted)

Answer 6

我也面临着同样的问题。对于相同的解决方法，我也可以尝试。

我遵循以下步骤： 1.重新安装python 3.5或更高版本 2.重新安装Cuda并将Cudnn库添加到其中。 3.重新安装Tensorflow 1.8.0 GPU版本。

Answer 7

我正在用纸空间在云环境中使用张量流。

cuDNN 7.3.1的更新对我不起作用。

一种方法是在适当的GPU和CPU支持下构建Tensorflow。

这不是适当的解决方案，但这暂时解决了我的问题（将tensoflow降级为1.5.0）：

pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12

希望这会有所帮助！

tf.Session（）上的分段错误（核心转储）

7 个答案:

那我做了什么，最后是如何解决这个问题？