我正在运行Tensorflow代码,此时此代码总是卡住:
tensorflow / core / common_runtime / gpu / pool_allocator.cc:259]将pool_size_limit_从256提升到281
我尝试了不同的内存配置但没有工作。
代码永远不会失败,但永远不会从那里进步,所以我最终取消了这项工作。我在Tesla K40m上运行它,每个CPU有4个CPU,内存为16G。
以下是完整输出:
2017-11-10 17:00:15.091618: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-10 17:00:15.091997: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-10 17:00:15.092010: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-10 17:00:17.609926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:0d:00.0
Total memory: 11.17GiB
Free memory: 11.09GiB
2017-11-10 17:00:17.609969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-11-10 17:00:17.609979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-11-10 17:00:17.609994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:0d:00.0)
2017-11-10 17:00:39.678955: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1787 get requests, put_count=1575 evicted_count=1000 eviction_rate=0.634921 and unsatisfied allocation rate=0.734191
2017-11-10 17:00:39.679428: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
2017-11-10 17:01:25.550744: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2416 get requests, put_count=2444 evicted_count=1000 eviction_rate=0.409165 and unsatisfied allocation rate=0.411838
2017-11-10 17:01:25.551299: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 256 to 281
slurmstepd: error: *** JOB 5538559 ON dgpu501-26-r CANCELLED AT 2017-11-10T17:47:10 ***
有什么建议吗?