tensorflow非常慢,有16个GPU而且卡住了

时间:2018-01-31 07:18:19

标签: python tensorflow amazon-ec2 nvidia

我正在使用带有16 GPU的amazon EC2进行计算。 当我配置我需要的所有内容并在python中测试它时,发生了一些奇怪的事情。

Follwing是一些实验:

import tensorflow as tf
import time
a=time.time()
hello=tf.constant('hello')
sess=tf.Session()

在上面之后我收到了很长的消息:

2018-01-31 07:10:27.922290: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-31 07:10:27.922347: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-31 07:10:27.922360: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-01-31 07:10:27.922371: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-31 07:10:27.922381: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-01-31 07:11:05.263488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-31 07:11:05.265392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:0f.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-01-31 07:11:05.487461: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x56312fdf3970 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2018-01-31 07:11:05.488072: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-31 07:11:05.489826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:10.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-01-31 07:11:05.707955: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x56312fdf7e80 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2018-01-31 07:11:05.708452: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-31 07:11:05.709916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:11.0
Total memory: 11.17GiB
Free memory: 11.10GiB

一直......

似乎tensorflow正在扫描GPU设备。 但这很慢。我等了5分钟看到上面的东西,后来它一直卡住,直到亚马逊自动断开连接。 之前我在我的实验室服务器上做了同样的事情,有4 tesela k40一切顺利。

有人知道为什么会这样吗?

1 个答案:

答案 0 :(得分:0)

经过反复试验,我终于解决了这个问题。 我卸载了所有内容并重新安装了NVIDIA驱动程序deb文件但是使用以下命令安装它:

display: boolean = false;

constructor() {

 }

 items: MenuItem[];

     ngOnInit() {
         this.items = [
             {
                 label: 'Chapter 1',
                 icon: 'fa-file-o',
                 items: [{
                         label: 'Chapter1.1', 
                         icon: 'fa-plus',
                         items: [
                             {label: 'Chapter1.1.1'},
                             {label: 'Chapter1.1.2'},
                         ]
                     },
                     {label: 'Chapter1.2'},
                     {label: 'Chapter1.3'}
                 ]
             },
             {
                 label: 'Chapter 2',
                 icon: 'fa-edit',
                 items: [
                     {label: 'Chapter 2.1', icon: 'fa-mail-forward'},
                     {label: 'Chapter 2.2', icon: 'fa-mail-reply'}
                 ]
             }


         ];
     }
     clicked(event=1) {
         console.log("event",event)

            this.display=true;


    }
}

然后使用Anaconda安装加速和张量流。 稍后根据标准程序安装CUDA。