我上周克隆了CNTK存储库,并使用在AWS上的p2.8xlarge实例上运行的Nvidia-docker构建它。一切似乎都在起作用,除了我在启用1位SGD时没有通过运行多个GPU获得加速。我正在运行CMUDict Sequence2Sequence_distributed.py示例。当我在一个GPU上运行时,这是我的成绩单:
root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# python Sequence2Sequence_Distributed.py
Selected GPU[0] Tesla K80 as the process wide default device.
ping [requestnodes (before change)]: 1 nodes pinging each other
ping [requestnodes (after change)]: 1 nodes pinging each other
requestnodes [MPIWrapperMpi]: using 1 out of 1 MPI nodes on a single host (1 requested); we (0) are in (participating)
ping [mpihelper]: 1 nodes pinging each other
-------------------------------------------------------------------
Build info:
Built time: Jun 2 2017 19:46:11
Last modified date: Fri Jun 2 19:21:14 2017
Build type: release
Build target: GPU
With 1bit-SGD: yes
With ASGD: yes
Math lib: mkl
CUDA_PATH: /usr/local/cuda
CUB_PATH: /usr/local/cub-1.4.1
CUDNN_PATH: /usr/local/cudnn
Build Branch: master
Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72
Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133
Build Path: /cntk
MPI distribution: Open MPI
MPI version: 1.10.3
-------------------------------------------------------------------
Finished Epoch[1 of 160]: [Training] loss = 4.234002 * 64, metric = 98.44% * 64 3.014s ( 21.2 samples/s);
Finished Epoch[2 of 160]: [Training] loss = 4.231473 * 71, metric = 85.92% * 71 1.013s ( 70.1 samples/s);
Finished Epoch[3 of 160]: [Training] loss = 4.227827 * 61, metric = 81.97% * 61 0.953s ( 64.0 samples/s);
Finished Epoch[4 of 160]: [Training] loss = 4.227088 * 68, metric = 86.76% * 68 0.970s ( 70.1 samples/s);
Finished Epoch[5 of 160]: [Training] loss = 4.222957 * 62, metric = 88.71% * 62 0.922s ( 67.2 samples/s);
Finished Epoch[6 of 160]: [Training] loss = 4.221479 * 63, metric = 84.13% * 63 0.950s ( 66.3 samples/s);
这是我运行两个GPU时的成绩单:
root@cb3aab88d4e9:/cntk/Examples/SequenceToSequence/CMUDict/Python# mpiexec --allow-run-as-root --npernode 2 python Sequence2Sequence_Distributed.py -q 1 Selected GPU[0] Tesla K80 as the process wide default device. Selected CPU as the process wide default device. ping [requestnodes (before change)]: 2 nodes pinging each other ping [requestnodes (before change)]: 2 nodes pinging each other ping [requestnodes (after change)]: 2 nodes pinging each other ping [requestnodes (after change)]: 2 nodes pinging each other requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (0) are in (participating) ping [mpihelper]: 2 nodes pinging each other requestnodes [MPIWrapperMpi]: using 2 out of 2 MPI nodes on a single host (2 requested); we (1) are in (participating) ping [mpihelper]: 2 nodes pinging each other ------------------------------------------------------------------- Build info: Built time: Jun 2 2017 19:46:11 Last modified date: Fri Jun 2 19:21:14 2017 Build type: release Build target: GPU With 1bit-SGD: yes With ASGD: yes Math lib: mkl CUDA_PATH: /usr/local/cuda CUB_PATH: /usr/local/cub-1.4.1 CUDNN_PATH: /usr/local/cudnn Build Branch: master Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72 Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133 Build Path: /cntk MPI distribution: Open MPI MPI version: 1.10.3 ------------------------------------------------------------------- ------------------------------------------------------------------- Build info: Built time: Jun 2 2017 19:46:11 Last modified date: Fri Jun 2 19:21:14 2017 Build type: release Build target: GPU With 1bit-SGD: yes With ASGD: yes Math lib: mkl CUDA_PATH: /usr/local/cuda CUB_PATH: /usr/local/cub-1.4.1 CUDNN_PATH: /usr/local/cudnn Build Branch: master Build SHA1: 2bcdc9dff6dc6393813f6043d80e167fb31aed72 Built by Source/CNTK/buildinfo.h$$0 on 72cb11c66133 Build Path: /cntk MPI distribution: Open MPI MPI version: 1.10.3 -------------------------------------------------------------------
这是一条错误消息 - 这是否意味着当我将作业作为两个MPI进程运行时,GPU未被使用?我该如何解决这个问题?
NcclComm: disabled, at least one rank using CPU device NcclComm: disabled, at least one rank using CPU device
您可以看到样本数/ s已关闭:
Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.377s ( 13.6 samples/s); Finished Epoch[1 of 160]: [Training] loss = 4.233786 * 73, metric = 97.26% * 73 5.877s ( 12.4 samples/s); Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.196s ( 30.5 samples/s); Finished Epoch[2 of 160]: [Training] loss = 4.232235 * 67, metric = 94.03% * 67 2.197s ( 30.5 samples/s); Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s); Finished Epoch[3 of 160]: [Training] loss = 4.229795 * 54, metric = 83.33% * 54 2.227s ( 24.2 samples/s); Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s); Finished Epoch[4 of 160]: [Training] loss = 4.229072 * 83, metric = 87.95% * 83 2.229s ( 37.2 samples/s); Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.667s ( 27.6 samples/s); Finished Epoch[5 of 160]: [Training] loss = 4.227438 * 46, metric = 86.96% * 46 1.666s ( 27.6 samples/s); Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s); Finished Epoch[6 of 160]: [Training] loss = 4.225661 * 65, metric = 84.62% * 65 2.388s ( 27.2 samples/s);
答案 0 :(得分:0)
我能够通过告诉nvidia-docker使用哪些GPU来让Nvidia-docker采用多个GPU来解决最初的问题。例如 - 如果要设置2个GPU,请使用:
NV_GPU=0,1 nvidia-docker run ....
答案 1 :(得分:0)
默认情况下,Sequence2Sequence_distributed.py以32位运行数据并行SGD。您可以尝试其他分布式训练算法,例如通过热启动阻止动量吗?
此外,如果您希望与多个GPU更好的并行性,请考虑增加您的小批量大小(默认值为16)。您可以使用nvidia-smi来检查GPU利用率和内存使用情况。