在theano预测方面,GPU比CPU慢?

时间:2017-04-17 22:34:56

标签: keras theano prediction theano-cuda

我使用keras + theano来预测NVDIA TK1上VGG预训练模型的标签。

我在预测中从CPU获得的预测时间比从GPU获得的预测时间更快。如果我的记忆是正确的,那么预测也会以重复的方式进行大量的数字运算。我不明白为什么CPU会慢一些。

有没有人有好的解释?

GPU详细信息行:Using gpu device 0: GK20A (CNMeM is enabled with initial size: 75.0% of memory, cuDNN Version is too old. Update to v5, was 2000.)

预测的分析结果如下:

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  39.5%    39.5%       0.019s       6.42e-03s     C        3       3   theano.sandbox.cuda.blas.GpuDot22
  24.8%    64.3%       0.012s       6.04e-03s     C        2       2   theano.sandbox.cuda.blas.GpuCorrMM
  16.4%    80.8%       0.008s       1.33e-03s     C        6       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   7.8%    88.5%       0.004s       1.89e-03s     C        2       2   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   4.2%    92.7%       0.002s       2.03e-03s     C        1       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   3.8%    96.4%       0.002s       4.57e-04s     C        4       4   theano.sandbox.cuda.basic_ops.GpuContiguous
   2.3%    98.8%       0.001s       5.66e-04s     C        2       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.5%    99.3%       0.000s       2.51e-04s     C        1       1   theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
   0.5%    99.8%       0.000s       2.39e-04s     C        1       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.8%       0.000s       1.37e-05s     C        3       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%    99.9%       0.000s       9.54e-06s     C        2       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%    99.9%       0.000s       4.35e-06s     C        4       4   theano.tensor.elemwise.Elemwise
   0.0%    99.9%       0.000s       5.01e-06s     C        2       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.000s       3.26e-06s     C        3       3   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       4.53e-06s     C        2       2   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.000s       5.96e-06s     C        1       1   theano.tensor.elemwise.Prod
   0.0%   100.0%       0.000s       3.10e-06s     C        1       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  39.5%    39.5%       0.019s       6.42e-03s     C        3        3   GpuDot22
  24.8%    64.3%       0.012s       6.04e-03s     C        2        2   GpuCorrMM{valid, (1, 1)}
  11.2%    75.5%       0.005s       1.36e-03s     C        4        4   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
   7.8%    83.3%       0.004s       1.89e-03s     C        2        2   GpuDownsampleFactorMax{(2, 2),True}
   4.2%    87.4%       0.002s       2.03e-03s     C        1        1   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
   3.8%    91.2%       0.002s       4.57e-04s     C        4        4   GpuContiguous
   2.9%    94.1%       0.001s       1.43e-03s     C        1        1   GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)]
   2.3%    96.5%       0.001s       5.66e-04s     C        2        2   GpuFromHost
   2.3%    98.8%       0.001s       1.12e-03s     C        1        1   GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)]
   0.5%    99.3%       0.000s       2.51e-04s     C        1        1   GpuSoftmaxWithBias
   0.5%    99.8%       0.000s       2.39e-04s     C        1        1   HostFromGpu
   0.1%    99.8%       0.000s       1.60e-05s     C        2        2   GpuReshape{4}
   0.0%    99.9%       0.000s       9.54e-06s     C        2        2   GpuSubtensor{::, ::, ::int64, ::int64}
   0.0%    99.9%       0.000s       5.01e-06s     C        2        2   GpuDimShuffle{x,0}
   0.0%    99.9%       0.000s       4.53e-06s     C        2        2   MakeVector{dtype='int64'}
   0.0%    99.9%       0.000s       9.06e-06s     C        1        1   GpuReshape{2}
   0.0%    99.9%       0.000s       4.17e-06s     C        2        2   Elemwise{Composite{((i0 + ((i1 + i2) // i3)) // i3)}}[(0, 2)]
   0.0%   100.0%       0.000s       5.96e-06s     C        1        1   Prod{acc_dtype=int64}
   0.0%   100.0%       0.000s       5.96e-06s     C        1        1   Elemwise{Cast{float32}}
   0.0%   100.0%       0.000s       5.01e-06s     C        1        1   Shape_i{0}
   ... (remaining 4 Ops account for   0.02%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  36.4%    36.4%       0.018s       1.77e-02s      1    33   GpuDot22(GpuReshape{2}.0, dense_5_W)
  15.7%    52.1%       0.008s       7.64e-03s      1    18   GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
   9.1%    61.2%       0.004s       4.44e-03s      1    28   GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
   5.7%    66.9%       0.003s       2.76e-03s      1    25   GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
   4.2%    71.0%       0.002s       2.03e-03s      1    20   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
   3.6%    74.6%       0.002s       1.74e-03s      1    34   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
   3.2%    77.8%       0.002s       1.54e-03s      1    22   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
   2.9%    80.7%       0.001s       1.43e-03s      1    23   GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)](GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}.1, CudaNdarrayConstant{[[[[ 0.80000001]]]]})
   2.7%    83.4%       0.001s       1.29e-03s      1    36   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
   2.3%    85.7%       0.001s       1.12e-03s      1    31   GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)](GpuFromHost.0, CudaNdarrayConstant{[[[[ 1.25]]]]}, GpuDownsampleFactorMax{(2, 2),True}.0, GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)].0)
   2.2%    87.8%       0.001s       1.06e-03s      1    14   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   2.2%    90.0%       0.001s       1.06e-03s      1    35   GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_6_W)
   2.1%    92.1%       0.001s       1.01e-03s      1    30   GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
   2.0%    94.1%       0.001s       9.61e-04s      1     3   GpuFromHost(convolution2d_input_1)
   1.8%    95.9%       0.001s       8.71e-04s      1    29   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
   1.6%    97.4%       0.001s       7.58e-04s      1    15   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   1.0%    98.4%       0.000s       4.72e-04s      1    37   GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_7_W)
   0.5%    98.9%       0.000s       2.51e-04s      1    38   GpuSoftmaxWithBias(GpuDot22.0, dense_7_b)
   0.5%    99.4%       0.000s       2.39e-04s      1    39   HostFromGpu(GpuSoftmaxWithBias.0)
   0.3%    99.8%       0.000s       1.70e-04s      1    19   GpuFromHost(Elemwise{Cast{float32}}.0)
   ... (remaining 20 Apply instances account for 0.25%(0.00s) of the runtime)

1 个答案:

答案 0 :(得分:1)

事实证明,当keras + theano正在进行预测时,第一次是最慢的。在使用SetWinDelay -1 ^1:: Area1() ^2:: Area2() ^3:: Area3() ^4:: Area4() ^5:: Area5() ^6:: Area6() ^7:: i := "" ; number of windows WinGet, id, list,,, Program Manager Loop, %id% { this_ID := id%A_Index% WinGet, exStyle, exStyle, ahk_id %this_ID% If !(exStyle & 0x100) continue WinGetTitle, title, ahk_id %this_ID% If (title = "") continue i++ WinActivate, ahk_id %this_ID% Area%i%() } return ; Top_Left Area1(){ WinRestore, A WinMove, A, , 0, 0,(A_ScreenWidth/3),(A_ScreenHeight/2) } ; Top_Middle Area2(){ WinRestore, A WinMove, A, , (A_ScreenWidth/3), 0,(A_ScreenWidth/3),(A_ScreenHeight/2) } ; Top_Right Area3(){ WinRestore, A WinMove, A, , (2*A_ScreenWidth/3), 0,(A_ScreenWidth/3),(A_ScreenHeight/2) } ; Bottom_Left Area4(){ WinRestore, A WinMove, A, , 0, (A_ScreenHeight/2),(A_ScreenWidth/3),(A_ScreenHeight/2) } ; Bottom_Middle Area5(){ WinRestore, A WinMove, A, , (A_ScreenWidth/3), (A_ScreenHeight/2),(A_ScreenWidth/3),(A_ScreenHeight/2) } ; Bottom_Right Area6(){ WinRestore, A WinMove, A, , (2*A_ScreenWidth/3), (A_ScreenHeight/2),(A_ScreenWidth/3),(A_ScreenHeight/2) } 加载模型后,模型似乎还没有完全在内存中,并且第一个预测调用将处理其余的设置。

在第一次预测之后,剩下的预测变得非常快。 VGG模型的第一次预测需要3秒左右,但随后的预测需要0.5到0.2秒。一切都很好。