Theano GPU使用:试图了解如何避免转移

时间:2016-07-14 22:10:45

标签: python performance gpu theano

我试图了解如何有效地使用theano与GPU,并一直在寻找一些与我想要解决的实际问题相关的简单示例。

首先,我有以下功能:

x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], x)

这里的TFlatTimes在一个~30万次的float32共享向量中。 TRP是float32共享数字,phase是一个自由参数。基本上,给定一个阶段,这应该返回一个包含在-TRP / 2和+ TRP / 2之间的时间向量。

如果我使用以下标志在CPU上评估此功能20000次:

setenv THEANO_FLAGS 'mode=FAST_RUN,device=cpu,floatX=float32'

它在20秒内完成。

如果我只是将这些标志更改为:

setenv THEANO_FLAGS 'mode=FAST_RUN,device=gpu,floatX=float32'

它在15秒内运行,如果我将功能更改为:

x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))

它在8秒内运行。

这一切都有道理。简单地将该功能放在GPU上可以加快速度,但是如果我指定不再传输x向量,它会加速相当多。

然后我试图让这个例子稍微复杂一点:

x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)

MakeXVec = theano.function([phase], x)
MakeYVec = theano.function([x], y)

并且类似地:

x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)

MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))
MakeYVec = theano.function([x], sandbox.cuda.basic_ops.gpu_from_host(y))

然后比较评估MakeXVec后跟MakeYVec 20,000次,无论有没有数据传输,它们都是20秒!

我觉得这很令人困惑。即使MakeYVec完全没有从传输数据中获益,使用gpu_from_host的版本仍然应该更快。

我认为这必然意味着仍然会有一些数据传输,所以我尝试使用配置文件。但是,我根本不理解输出:

Function profiling
==================
  Message: gpu.py:396
  Time in 20000 calls to Function.__call__: 4.140420e+00s
  Time in Function.fn.__call__: 3.110955e+00s (75.136%)
  Time in thunks: 2.404417e+00s (58.072%)
  Total compile time: 3.778410e-01s
    Number of Apply nodes: 6
    Theano Optimizer time: 2.160940e-01s
       Theano validate time: 4.644394e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.131201e-02s
       Import time 3.909349e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.509s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  65.6%    65.6%       1.578s       3.94e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuElemwise
  32.8%    98.4%       0.789s       1.97e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.6%   100.0%       0.038s       9.44e-07s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  42.5%    42.5%       1.022s       5.11e-05s     C     20000        1   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
  32.8%    75.3%       0.789s       1.97e-05s     C     40000        2   GpuFromHost
  23.1%    98.4%       0.556s       2.78e-05s     C     20000        1   GpuElemwise{mul,no_inplace}
   1.6%   100.0%       0.038s       9.44e-07s     C     40000        2   GpuDimShuffle{x}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  42.5%    42.5%       1.022s       5.11e-05s   20000     5   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32, vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
  23.1%    65.6%       0.556s       2.78e-05s   20000     4   GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
  20.3%    85.9%       0.487s       2.44e-05s   20000     1   GpuFromHost(phase)
  12.6%    98.4%       0.302s       1.51e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   1.0%    99.4%       0.024s       1.19e-06s   20000     3   GpuDimShuffle{x}(GpuFromHost.0)
   0.6%   100.0%       0.014s       6.96e-07s   20000     2   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: gpu.py:397
  Time in 20000 calls to Function.__call__: 1.929911e+01s
  Time in Function.fn.__call__: 6.424892e+00s (33.291%)
  Time in thunks: 6.052589e+00s (31.362%)
  Total compile time: 2.222779e-01s
    Number of Apply nodes: 5
    Theano Optimizer time: 6.165099e-02s
       Theano validate time: 1.147032e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.032209e-02s
       Import time 3.772259e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.511s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  84.1%    84.1%       5.088s       1.27e-04s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
  15.6%    99.7%       0.943s       2.36e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.3%   100.0%       0.021s       1.05e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  84.1%    84.1%       5.088s       1.27e-04s     C     40000        2   GpuFromHost
   9.3%    93.3%       0.560s       2.80e-05s     C     20000        1   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
   6.3%    99.7%       0.383s       1.92e-05s     C     20000        1   GpuElemwise{Sqr}[(0, 0)]
   0.3%   100.0%       0.021s       1.05e-06s     C     20000        1   GpuDimShuffle{x}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  68.5%    68.5%       4.145s       2.07e-04s   20000     1   GpuFromHost(<TensorType(float32, vector)>)
  15.6%    84.1%       0.944s       4.72e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   9.3%    93.3%       0.560s       2.80e-05s   20000     4   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
   6.3%    99.7%       0.383s       1.92e-05s   20000     3   GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
   0.3%   100.0%       0.021s       1.05e-06s   20000     2   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: gpu.py:403
  Time in 20000 calls to Function.__call__: 5.344449e+00s
  Time in Function.fn.__call__: 5.037848e+00s (94.263%)
  Time in thunks: 3.975644e+00s (74.388%)
  Total compile time: 2.805271e-01s
    Number of Apply nodes: 11
    Theano Optimizer time: 1.057930e-01s
       Theano validate time: 2.398491e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.601004e-02s
       Import time 1.127982e-02s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.512s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  43.9%    43.9%       1.747s       2.18e-05s     C    80000       4   theano.sandbox.cuda.basic_ops.GpuElemwise
  27.1%    71.0%       1.077s       5.38e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuJoin
  16.5%    87.6%       0.657s       3.29e-05s     C    20000       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   8.9%    96.5%       0.353s       1.76e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   2.7%    99.1%       0.107s       1.78e-06s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.9%   100.0%       0.034s       1.71e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuReshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  27.1%    27.1%       1.077s       5.38e-05s     C     20000        1   GpuJoin
  16.5%    43.6%       0.657s       3.29e-05s     C     20000        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
  13.9%    57.5%       0.551s       2.75e-05s     C     20000        1   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
  13.8%    71.3%       0.549s       2.75e-05s     C     20000        1   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
   8.9%    80.2%       0.353s       1.76e-05s     C     20000        1   HostFromGpu
   8.2%    88.4%       0.327s       1.64e-05s     C     20000        1   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
   8.0%    96.5%       0.320s       1.60e-05s     C     20000        1   GpuElemwise{Mul}[(0, 1)]
   1.9%    98.4%       0.076s       1.89e-06s     C     40000        2   GpuSubtensor{:int64:}
   0.9%    99.2%       0.034s       1.71e-06s     C     20000        1   GpuReshape{1}
   0.8%   100.0%       0.031s       1.56e-06s     C     20000        1   GpuSubtensor{int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  27.1%    27.1%       1.077s       5.38e-05s   20000     7   GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
  16.5%    43.6%       0.657s       3.29e-05s   20000     0   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
  13.9%    57.5%       0.551s       2.75e-05s   20000     5   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
  13.8%    71.3%       0.549s       2.75e-05s   20000     3   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
   8.9%    80.2%       0.353s       1.76e-05s   20000    10   HostFromGpu(GpuReshape{1}.0)
   8.2%    88.4%       0.327s       1.64e-05s   20000     6   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   8.0%    96.5%       0.320s       1.60e-05s   20000     4   GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
   1.0%    97.4%       0.039s       1.97e-06s   20000     2   GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.9%    98.4%       0.036s       1.81e-06s   20000     8   GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
   0.9%    99.2%       0.034s       1.71e-06s   20000     9   GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
   0.8%   100.0%       0.031s       1.56e-06s   20000     1   GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
  Time in 60000 calls to Function.__call__: 2.878398e+01s
  Time in Function.fn.__call__: 1.457369e+01s (50.631%)
  Time in thunks: 1.243265e+01s (43.193%)
  Total compile time: 8.806460e-01s
    Number of Apply nodes: 6
    Theano Optimizer time: 3.835380e-01s
       Theano validate time: 4.009962e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.764414e-02s
       Import time 1.896143e-02s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.516s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  47.3%    47.3%       5.877s       7.35e-05s     C    80000       4   theano.sandbox.cuda.basic_ops.GpuFromHost
  34.3%    81.6%       4.268s       2.67e-05s     C   160000       8   theano.sandbox.cuda.basic_ops.GpuElemwise
   8.7%    90.3%       1.077s       5.38e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuJoin
   5.3%    95.6%       0.657s       3.29e-05s     C    20000       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   2.8%    98.4%       0.353s       1.76e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.9%    99.3%       0.107s       1.78e-06s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%    99.7%       0.059s       9.78e-07s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%   100.0%       0.034s       1.71e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuReshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  47.3%    47.3%       5.877s       7.35e-05s     C     80000        4   GpuFromHost
   8.7%    55.9%       1.077s       5.38e-05s     C     20000        1   GpuJoin
   8.2%    64.2%       1.022s       5.11e-05s     C     20000        1   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
   5.3%    69.4%       0.657s       3.29e-05s     C     20000        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
   4.5%    74.0%       0.560s       2.80e-05s     C     20000        1   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
   4.5%    78.4%       0.556s       2.78e-05s     C     20000        1   GpuElemwise{mul,no_inplace}
   4.4%    82.8%       0.551s       2.75e-05s     C     20000        1   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
   4.4%    87.3%       0.549s       2.75e-05s     C     20000        1   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
   3.1%    90.3%       0.383s       1.92e-05s     C     20000        1   GpuElemwise{Sqr}[(0, 0)]
   2.8%    93.2%       0.353s       1.76e-05s     C     20000        1   HostFromGpu
   2.6%    95.8%       0.327s       1.64e-05s     C     20000        1   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
   2.6%    98.4%       0.320s       1.60e-05s     C     20000        1   GpuElemwise{Mul}[(0, 1)]
   0.6%    99.0%       0.076s       1.89e-06s     C     40000        2   GpuSubtensor{:int64:}
   0.5%    99.5%       0.059s       9.78e-07s     C     60000        3   GpuDimShuffle{x}
   0.3%    99.7%       0.034s       1.71e-06s     C     20000        1   GpuReshape{1}
   0.3%   100.0%       0.031s       1.56e-06s     C     20000        1   GpuSubtensor{int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  33.3%    33.3%       4.145s       2.07e-04s   20000     1   GpuFromHost(<TensorType(float32, vector)>)
   8.7%    42.0%       1.077s       5.38e-05s   20000     7   GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
   8.2%    50.2%       1.022s       5.11e-05s   20000     5   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32,


'''
 vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
   7.6%    57.8%       0.944s       4.72e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   5.3%    63.1%       0.657s       3.29e-05s   20000     0   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
   4.5%    67.6%       0.560s       2.80e-05s   20000     4   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
   4.5%    72.1%       0.556s       2.78e-05s   20000     4   GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
   4.4%    76.5%       0.551s       2.75e-05s   20000     5   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   4.4%    80.9%       0.549s       2.75e-05s   20000     3   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
   3.9%    84.8%       0.487s       2.44e-05s   20000     1   GpuFromHost(phase)
   3.1%    87.9%       0.383s       1.92e-05s   20000     3   GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
   2.8%    90.8%       0.353s       1.76e-05s   20000    10   HostFromGpu(GpuReshape{1}.0)
   2.6%    93.4%       0.327s       1.64e-05s   20000     6   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   2.6%    96.0%       0.320s       1.60e-05s   20000     4   GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
   2.4%    98.4%       0.302s       1.51e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   0.3%    98.7%       0.039s       1.97e-06s   20000     2   GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.3%    99.0%       0.036s       1.81e-06s   20000     8   GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
   0.3%    99.3%       0.034s       1.71e-06s   20000     9   GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
   0.3%    99.5%       0.031s       1.56e-06s   20000     1   GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.2%    99.7%       0.024s       1.19e-06s   20000     3   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 2 Apply instances account for 0.28%(0.03s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

有人可以帮助我理解这个吗?为什么指定不转移结果不会在评估两个函数时获得任何收益而不是改善事物,而只评估一个函数?在这个庞大的配置文件输出中,我应该尝试找出正在发生的事情吗?

非常感谢您的帮助。

0 个答案:

没有答案