Question

拟合模型Segmentation fault (core dumped)时收到以下错误。我在Ubuntu 18.04上使用Nvidia rtx 2070（用于CUDA）和AMD RX 570（用于4k显示器）。我不认为双GPU是一个问题，我可以在安装amd gpu之前在rtx 2070上成功运行代码。我遍历了本教程，为深度学习Installing Tensorflow-GPU设置了系统。以下是我尝试运行的从Install Tensorflow with GPU support获得的代码：

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten,  MaxPooling2D, Conv2D
from keras.callbacks import TensorBoard

(X_train,y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000,28,28,1).astype('float32')
X_test = X_test.reshape(10000,28,28,1).astype('float32')

X_train /= 255
X_test /= 255

n_classes = 10
y_train = keras.utils.to_categorical(y_train, n_classes)
y_test = keras.utils.to_categorical(y_test, n_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(28,28,1)) )
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

tensor_board = TensorBoard('./logs/LeNet-MNIST-1')

model.fit(X_train, y_train, batch_size=16, epochs=15, verbose=1, validation_data=(X_test,y_test), callbacks=[tensor_board])

这是运行上面的代码的输出：

Using TensorFlow backend.
Train on 60000 samples, validate on 10000 samples
2018-12-21 21:28:32.425989: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-12-21 21:28:33.111624: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-21 21:28:33.112435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.65
pciBusID: 0000:09:00.0
totalMemory: 7.77GiB freeMemory: 7.65GiB
2018-12-21 21:28:33.112452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-21 21:28:33.380127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-21 21:28:33.380166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-12-21 21:28:33.380172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-12-21 21:28:33.380625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7359 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:09:00.0, compute capability: 7.5)
Epoch 1/15
Segmentation fault (core dumped)

当查看nvidia-smi窗口时，它会显示约1秒钟的使用情况，然后变为零，然后从终端获得分段错误。我试图在Jupyter中运行它，而内核就死了。我唯一能想到的就是我已安装的程序的版本。这是这些程序的版本：

海湾合作委员会：

gcc version 6.5.0 20181026 (Ubuntu 6.5.0-2ubuntu1~18.04)

CUDA：

CUDA Version 9.0.176
CUDA Patch Version 9.0.176.4

Tensorflow：

1.12.0

CUDNN：

#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#include "driver_types.h"

我的Nvidia SMI看起来像这样： + ------------------------------------------------- ---------------------

-------+
| NVIDIA-SMI 415.23       Driver Version: 415.23       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   46C    P0     1W / 175W |      0MiB /  7952MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

根据上面提到的博客，该代码似乎很容易在rtx 2070上运行，但它不想运行。有什么建议吗？

Answer 1

我实际上在博客的底部找到了答案（带有上面提到的MNIST示例）作为评论。这是评论：

好的，我发现这可能与不同版本的cudnn。我用conda创建了一个新环境指定python = 3.6（conda create --name tf-gpu python = 3.6），并然后安装tensorflow-gpu = 1.8.0（conda安装 tensorflow-gpu = 1.8.0）。我仍然想知道为什么会这样但至少现在此页面上的所有代码都可以顺利运行。

我使用这些特定的安装创建了一个新的conda环境，并且代码现在可以平稳运行。如果其他人遇到此问题，我将保留在此发布，因为我最初的问题不是此MNIST代码，而是其他一些代码。

拟合模型Ubuntu 18.04时，Python Tensorflow GPU错误

1 个答案: