Question

如何解释TensorFlow输出以在GPGPU上构建和执行计算图？

给出以下命令，使用python API执行任意tensorflow脚本。

python3 tensorflow_test.py＆gt;出

第一部分stream_executor似乎是它的加载依赖项。

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally

什么是NUMA节点？

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

我认为这是在找到可用GPU的时候

I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB

一些gpu初始化？什么是DMA？

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)

为什么会抛出错误E？

E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

pool_allocator所做的很好的答案：https://stackoverflow.com/a/35166985/4233809

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863

Answer 1

关于NUMA - https://software.intel.com/en-us/articles/optimizing-applications-for-numa

粗略地说，如果你有双插槽CPU，它们每个都有自己的内存，必须通过较慢的QPI链接访问其他处理器的内存。因此每个CPU +内存都是NUMA节点。

您可以将两个不同的NUMA节点视为两个不同的设备，并构建您的网络以针对不同的节点内/节点间带宽进行优化

但是，我认为目前TF中没有足够的线路可以做到这一点。检测也不起作用 - 我刚试过一台有2个NUMA节点的机器，它仍然打印相同的消息并初始化为1个NUMA节点。

DMA =直接内存访问。您可以在不使用CPU的情况下将内容从一个GPU复制到另一个GPU（即通过NVlink）。 NVLink集成还没有。

就错误而言，TensorFlow尝试分配接近GPU最大内存，因此听起来您的某些GPU内存已经分配给其他内容并且分配失败。

您可以执行以下操作以避免分配如此多的内存

config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000   # terminate on long hangs
sess = tf.InteractiveSession("", config=config)

Answer 2

successfully opened CUDA library xxx locally表示已加载库，但并不意味着它将被使用。
successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero表示您的内核没有NUMA支持。您可以阅读有关NUMA here和here。
Found device 0 with properties:您有1个可以使用的GPU。它列出了这个GPU的属性。
DMA是直接内存访问。有关Wikipedia。
failed to allocate 11.15G错误清楚地解释了为什么会发生这种情况，但很难说你为什么需要这么多内存而不看代码。
池分配器消息在this answer

如何解释TensorFlow输出？

2 个答案: