Theano GPU使用:试图了解如何避免转移

时间:2016-07-14 22:10:45

标签: python performance gpu theano



x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], x)

这里的TFlatTimes在一个~30万次的float32共享向量中。 TRP是float32共享数字,phase是一个自由参数。基本上,给定一个阶段,这应该返回一个包含在-TRP / 2和+ TRP / 2之间的时间向量。


setenv THEANO_FLAGS 'mode=FAST_RUN,device=cpu,floatX=float32'



setenv THEANO_FLAGS 'mode=FAST_RUN,device=gpu,floatX=float32'


x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))




x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)

MakeXVec = theano.function([phase], x)
MakeYVec = theano.function([x], y)


x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)

MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))
MakeYVec = theano.function([x], sandbox.cuda.basic_ops.gpu_from_host(y))

然后比较评估MakeXVec后跟MakeYVec 20,000次,无论有没有数据传输,它们都是20秒!



Function profiling
  Time in 20000 calls to Function.__call__: 4.140420e+00s
  Time in Function.fn.__call__: 3.110955e+00s (75.136%)
  Time in thunks: 2.404417e+00s (58.072%)
  Total compile time: 3.778410e-01s
    Number of Apply nodes: 6
    Theano Optimizer time: 2.160940e-01s
       Theano validate time: 4.644394e-04s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.131201e-02s
       Import time 3.909349e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.509s
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  65.6%    65.6%       1.578s       3.94e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuElemwise
  32.8%    98.4%       0.789s       1.97e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.6%   100.0%       0.038s       9.44e-07s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  42.5%    42.5%       1.022s       5.11e-05s     C     20000        1   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
  32.8%    75.3%       0.789s       1.97e-05s     C     40000        2   GpuFromHost
  23.1%    98.4%       0.556s       2.78e-05s     C     20000        1   GpuElemwise{mul,no_inplace}
   1.6%   100.0%       0.038s       9.44e-07s     C     40000        2   GpuDimShuffle{x}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  42.5%    42.5%       1.022s       5.11e-05s   20000     5   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32, vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
  23.1%    65.6%       0.556s       2.78e-05s   20000     4   GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
  20.3%    85.9%       0.487s       2.44e-05s   20000     1   GpuFromHost(phase)
  12.6%    98.4%       0.302s       1.51e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   1.0%    99.4%       0.024s       1.19e-06s   20000     3   GpuDimShuffle{x}(GpuFromHost.0)
   0.6%   100.0%       0.014s       6.96e-07s   20000     2   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
  Time in 20000 calls to Function.__call__: 1.929911e+01s
  Time in Function.fn.__call__: 6.424892e+00s (33.291%)
  Time in thunks: 6.052589e+00s (31.362%)
  Total compile time: 2.222779e-01s
    Number of Apply nodes: 5
    Theano Optimizer time: 6.165099e-02s
       Theano validate time: 1.147032e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.032209e-02s
       Import time 3.772259e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.511s
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  84.1%    84.1%       5.088s       1.27e-04s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuFromHost
  15.6%    99.7%       0.943s       2.36e-05s     C    40000       2   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.3%   100.0%       0.021s       1.05e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  84.1%    84.1%       5.088s       1.27e-04s     C     40000        2   GpuFromHost
   9.3%    93.3%       0.560s       2.80e-05s     C     20000        1   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
   6.3%    99.7%       0.383s       1.92e-05s     C     20000        1   GpuElemwise{Sqr}[(0, 0)]
   0.3%   100.0%       0.021s       1.05e-06s     C     20000        1   GpuDimShuffle{x}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  68.5%    68.5%       4.145s       2.07e-04s   20000     1   GpuFromHost(<TensorType(float32, vector)>)
  15.6%    84.1%       0.944s       4.72e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   9.3%    93.3%       0.560s       2.80e-05s   20000     4   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
   6.3%    99.7%       0.383s       1.92e-05s   20000     3   GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
   0.3%   100.0%       0.021s       1.05e-06s   20000     2   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
  Time in 20000 calls to Function.__call__: 5.344449e+00s
  Time in Function.fn.__call__: 5.037848e+00s (94.263%)
  Time in thunks: 3.975644e+00s (74.388%)
  Total compile time: 2.805271e-01s
    Number of Apply nodes: 11
    Theano Optimizer time: 1.057930e-01s
       Theano validate time: 2.398491e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.601004e-02s
       Import time 1.127982e-02s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.512s
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  43.9%    43.9%       1.747s       2.18e-05s     C    80000       4   theano.sandbox.cuda.basic_ops.GpuElemwise
  27.1%    71.0%       1.077s       5.38e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuJoin
  16.5%    87.6%       0.657s       3.29e-05s     C    20000       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   8.9%    96.5%       0.353s       1.76e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   2.7%    99.1%       0.107s       1.78e-06s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.9%   100.0%       0.034s       1.71e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuReshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  27.1%    27.1%       1.077s       5.38e-05s     C     20000        1   GpuJoin
  16.5%    43.6%       0.657s       3.29e-05s     C     20000        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
  13.9%    57.5%       0.551s       2.75e-05s     C     20000        1   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
  13.8%    71.3%       0.549s       2.75e-05s     C     20000        1   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
   8.9%    80.2%       0.353s       1.76e-05s     C     20000        1   HostFromGpu
   8.2%    88.4%       0.327s       1.64e-05s     C     20000        1   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
   8.0%    96.5%       0.320s       1.60e-05s     C     20000        1   GpuElemwise{Mul}[(0, 1)]
   1.9%    98.4%       0.076s       1.89e-06s     C     40000        2   GpuSubtensor{:int64:}
   0.9%    99.2%       0.034s       1.71e-06s     C     20000        1   GpuReshape{1}
   0.8%   100.0%       0.031s       1.56e-06s     C     20000        1   GpuSubtensor{int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  27.1%    27.1%       1.077s       5.38e-05s   20000     7   GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
  16.5%    43.6%       0.657s       3.29e-05s   20000     0   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
  13.9%    57.5%       0.551s       2.75e-05s   20000     5   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
  13.8%    71.3%       0.549s       2.75e-05s   20000     3   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
   8.9%    80.2%       0.353s       1.76e-05s   20000    10   HostFromGpu(GpuReshape{1}.0)
   8.2%    88.4%       0.327s       1.64e-05s   20000     6   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   8.0%    96.5%       0.320s       1.60e-05s   20000     4   GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
   1.0%    97.4%       0.039s       1.97e-06s   20000     2   GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.9%    98.4%       0.036s       1.81e-06s   20000     8   GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
   0.9%    99.2%       0.034s       1.71e-06s   20000     9   GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
   0.8%   100.0%       0.031s       1.56e-06s   20000     1   GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
  Time in 60000 calls to Function.__call__: 2.878398e+01s
  Time in Function.fn.__call__: 1.457369e+01s (50.631%)
  Time in thunks: 1.243265e+01s (43.193%)
  Total compile time: 8.806460e-01s
    Number of Apply nodes: 6
    Theano Optimizer time: 3.835380e-01s
       Theano validate time: 4.009962e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.764414e-02s
       Import time 1.896143e-02s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.516s
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  47.3%    47.3%       5.877s       7.35e-05s     C    80000       4   theano.sandbox.cuda.basic_ops.GpuFromHost
  34.3%    81.6%       4.268s       2.67e-05s     C   160000       8   theano.sandbox.cuda.basic_ops.GpuElemwise
   8.7%    90.3%       1.077s       5.38e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuJoin
   5.3%    95.6%       0.657s       3.29e-05s     C    20000       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   2.8%    98.4%       0.353s       1.76e-05s     C    20000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.9%    99.3%       0.107s       1.78e-06s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%    99.7%       0.059s       9.78e-07s     C    60000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%   100.0%       0.034s       1.71e-06s     C    20000       1   theano.sandbox.cuda.basic_ops.GpuReshape
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  47.3%    47.3%       5.877s       7.35e-05s     C     80000        4   GpuFromHost
   8.7%    55.9%       1.077s       5.38e-05s     C     20000        1   GpuJoin
   8.2%    64.2%       1.022s       5.11e-05s     C     20000        1   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
   5.3%    69.4%       0.657s       3.29e-05s     C     20000        1   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
   4.5%    74.0%       0.560s       2.80e-05s     C     20000        1   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
   4.5%    78.4%       0.556s       2.78e-05s     C     20000        1   GpuElemwise{mul,no_inplace}
   4.4%    82.8%       0.551s       2.75e-05s     C     20000        1   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
   4.4%    87.3%       0.549s       2.75e-05s     C     20000        1   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
   3.1%    90.3%       0.383s       1.92e-05s     C     20000        1   GpuElemwise{Sqr}[(0, 0)]
   2.8%    93.2%       0.353s       1.76e-05s     C     20000        1   HostFromGpu
   2.6%    95.8%       0.327s       1.64e-05s     C     20000        1   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
   2.6%    98.4%       0.320s       1.60e-05s     C     20000        1   GpuElemwise{Mul}[(0, 1)]
   0.6%    99.0%       0.076s       1.89e-06s     C     40000        2   GpuSubtensor{:int64:}
   0.5%    99.5%       0.059s       9.78e-07s     C     60000        3   GpuDimShuffle{x}
   0.3%    99.7%       0.034s       1.71e-06s     C     20000        1   GpuReshape{1}
   0.3%   100.0%       0.031s       1.56e-06s     C     20000        1   GpuSubtensor{int64::}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  33.3%    33.3%       4.145s       2.07e-04s   20000     1   GpuFromHost(<TensorType(float32, vector)>)
   8.7%    42.0%       1.077s       5.38e-05s   20000     7   GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
   8.2%    50.2%       1.022s       5.11e-05s   20000     5   GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32,

 vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
   7.6%    57.8%       0.944s       4.72e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   5.3%    63.1%       0.657s       3.29e-05s   20000     0   GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
   4.5%    67.6%       0.560s       2.80e-05s   20000     4   GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
   4.5%    72.1%       0.556s       2.78e-05s   20000     4   GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
   4.4%    76.5%       0.551s       2.75e-05s   20000     5   GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   4.4%    80.9%       0.549s       2.75e-05s   20000     3   GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
   3.9%    84.8%       0.487s       2.44e-05s   20000     1   GpuFromHost(phase)
   3.1%    87.9%       0.383s       1.92e-05s   20000     3   GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
   2.8%    90.8%       0.353s       1.76e-05s   20000    10   HostFromGpu(GpuReshape{1}.0)
   2.6%    93.4%       0.327s       1.64e-05s   20000     6   GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
   2.6%    96.0%       0.320s       1.60e-05s   20000     4   GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
   2.4%    98.4%       0.302s       1.51e-05s   20000     0   GpuFromHost(<TensorType(float32, scalar)>)
   0.3%    98.7%       0.039s       1.97e-06s   20000     2   GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.3%    99.0%       0.036s       1.81e-06s   20000     8   GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
   0.3%    99.3%       0.034s       1.71e-06s   20000     9   GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
   0.3%    99.5%       0.031s       1.56e-06s   20000     1   GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
   0.2%    99.7%       0.024s       1.19e-06s   20000     3   GpuDimShuffle{x}(GpuFromHost.0)
   ... (remaining 2 Apply instances account for 0.28%(0.03s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.



0 个答案:
