Tensorflow. Cifar10 Multi-gpu example performs worse with more gpus

时间:2016-07-11 20:16:31

标签: neural-network tensorflow distributed-computing

I have to test the distributed version of tensorflow across multiple gpus.

I run the Cifar-10 multi-gpu example on an AWS g2.8x EC2 instance.

Running time for 2000 steps of the cifar10_multi_gpu_train.py (code here) was 427 seconds with 1 gpu (flag num_gpu=1). Afterwards the eval.py script returned precision @ 1 = 0.537.

With the same example running for the same number of steps (with one step being executed in parallel across all gpus), but using 4 gpus (flag num_gpu=4) running time was about 530 seconds and the eval.py script returned only a slightly higher precision @ 1 of 0.552 (maybe due to randomness in the computation?).

Why is the example performing worse with a higher number of gpus? I have used a very small number of steps for testing purposes and was expecting a much higher gain in precision using 4 gpus. Did I miss something or made some basic mistakes? Did someone else try the above example?

Thank you very much.

1 个答案:

答案 0 :(得分:0)

cifar10示例默认使用CPU上的变量,这是多GPU架构所需的。与具有2个GPU的单个GPU设置相比,您可以实现大约1.5倍的加速。

您的问题与Nvidia Tesla K80的双GPU架构有关。它有一个PCIe交换机,可以在内部与两个GPU卡通信。它将引入通信开销。见框图: enter image description here