I have to test the distributed version of tensorflow across multiple gpus.
I run the Cifar-10 multi-gpu example on an AWS g2.8x EC2 instance.
Running time for 2000 steps of the cifar10_multi_gpu_train.py (code here) was 427 seconds with 1 gpu (flag num_gpu=1
). Afterwards the eval.py script returned precision @ 1 = 0.537.
With the same example running for the same number of steps (with one step being executed in parallel across all gpus), but using 4 gpus (flag num_gpu=4
) running time was about 530
seconds and the eval.py
script returned only a slightly higher precision @ 1 of 0.552 (maybe due to randomness in the computation?).
Why is the example performing worse with a higher number of gpus? I have used a very small number of steps for testing purposes and was expecting a much higher gain in precision using 4 gpus. Did I miss something or made some basic mistakes? Did someone else try the above example?
Thank you very much.