Question

我上周克隆了CNTK存储库，并使用在AWS上的p2.8xlarge实例上运行的Nvidia-docker构建它。一切似乎都在起作用，除了我在启用1位SGD时没有通过运行多个GPU获得加速。我正在运行CMUDict Sequence2Sequence_distributed.py示例。当我在一个GPU上运行时，这是我的成绩单：

root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# python Sequence2Sequence_Distributed.py
Selected GPU[0] Tesla K80 as the process wide default device.
ping [requestnodes (before change)]: 1 nodes pinging each other
ping [requestnodes (after change)]: 1 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 1 out of 1 MPI nodes on a single host (1 requested); we (0) are in (participating)
ping [mpihelper]: 1 nodes pinging each other
------------------------------------------------------------------- 
Build info:

                Built time: Jun  2 2017 19:46:11
                Last modified date: Fri Jun  2 19:21:14 2017
                Build type: release
                Build target: GPU
                With 1bit-SGD: yes
                With ASGD: yes
                Math lib: mkl
                CUDA_PATH: /usr/local/cuda
                CUB_PATH: /usr/local/cub-1.4.1
                CUDNN_PATH: /usr/local/cudnn
                Build Branch: master
                Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
                Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
                Build Path: /cntk
                MPI distribution: Open MPI
                MPI version: 1.10.3
-------------------------------------------------------------------
Finished Epoch[1 of 160]: [Training] loss = 4.234002 * 64, metric = 98.44% * 64 3.014s ( 21.2 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.231473 * 71, metric = 85.92% * 71 1.013s ( 70.1 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.227827 * 61, metric = 81.97% * 61 0.953s ( 64.0 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.227088 * 68, metric = 86.76% * 68 0.970s ( 70.1 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.222957 * 62, metric = 88.71% * 62 0.922s ( 67.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.221479 * 63, metric = 84.13% * 63 0.950s ( 66.3 samples/s);

这是我运行两个GPU时的成绩单：

root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# mpiexec --allow-run-as-root --npernode 2 python Sequence2Sequence_Distributed.py -q 1


Selected GPU[0] Tesla K80 as the process wide default device.
Selected CPU as the process wide default device.
ping [requestnodes (before change)]: 2 nodes pinging each other
ping [requestnodes (before change)]: 2 nodes pinging each other
ping [requestnodes (after change)]: 2 nodes pinging each other
ping [requestnodes (after change)]: 2 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (0) are in (participating)
ping [mpihelper]: 2 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (1) are in (participating)
ping [mpihelper]: 2 nodes pinging each other
-------------------------------------------------------------------
Build info:

                Built time: Jun  2 2017 19:46:11
                Last modified date: Fri Jun  2 19:21:14 2017
                Build type: release
                Build target: GPU
                With 1bit-SGD: yes
                With ASGD: yes
                Math lib: mkl
                CUDA_PATH: /usr/local/cuda
                CUB_PATH: /usr/local/cub-1.4.1
                CUDNN_PATH: /usr/local/cudnn
                Build Branch: master
                Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
                Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
                Build Path: /cntk
                MPI distribution: Open MPI
                MPI version: 1.10.3
-------------------------------------------------------------------

-------------------------------------------------------------------
Build info: 

                Built time: Jun  2 2017 19:46:11
                Last modified date: Fri Jun  2 19:21:14 2017
                Build type: release
                Build target: GPU
                With 1bit-SGD: yes
                With ASGD: yes
                Math lib: mkl
                CUDA_PATH: /usr/local/cuda
                CUB_PATH: /usr/local/cub-1.4.1
                CUDNN_PATH: /usr/local/cudnn
                Build Branch: master
                Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
                Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
                Build Path: /cntk
                MPI distribution: Open MPI
                MPI version: 1.10.3
-------------------------------------------------------------------

这是一条错误消息 - 这是否意味着当我将作业作为两个MPI进程运行时，GPU未被使用？我该如何解决这个问题？

NcclComm: disabled, at least one rank using CPU device
NcclComm: disabled, at least one rank using CPU device

您可以看到样本数/ s已关闭：

Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.377s ( 13.6 samples/s);
Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.877s ( 12.4 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.196s ( 30.5 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.197s ( 30.5 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.667s ( 27.6 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.666s ( 27.6 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);

Answer 1

我能够通过告诉nvidia-docker使用哪些GPU来让Nvidia-docker采用多个GPU来解决最初的问题。例如 - 如果要设置2个GPU，请使用：

NV_GPU=0,1 nvidia-docker run ....

Answer 2

默认情况下，Sequence2Sequence_distributed.py以32位运行数据并行SGD。您可以尝试其他分布式训练算法，例如通过热启动阻止动量吗？

此外，如果您希望与多个GPU更好的并行性，请考虑增加您的小批量大小（默认值为16）。您可以使用nvidia-smi来检查GPU利用率和内存使用情况。

如何在SequenceToSequence示例中获得CNTK并行加速

2 个答案: