我试图了解如何有效地使用theano与GPU,并一直在寻找一些与我想要解决的实际问题相关的简单示例。
首先,我有以下功能:
x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], x)
这里的TFlatTimes在一个~30万次的float32共享向量中。 TRP是float32共享数字,phase是一个自由参数。基本上,给定一个阶段,这应该返回一个包含在-TRP / 2和+ TRP / 2之间的时间向量。
如果我使用以下标志在CPU上评估此功能20000次:
setenv THEANO_FLAGS 'mode=FAST_RUN,device=cpu,floatX=float32'
它在20秒内完成。
如果我只是将这些标志更改为:
setenv THEANO_FLAGS 'mode=FAST_RUN,device=gpu,floatX=float32'
它在15秒内运行,如果我将功能更改为:
x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))
它在8秒内运行。
这一切都有道理。简单地将该功能放在GPU上可以加快速度,但是如果我指定不再传输x向量,它会加速相当多。
然后我试图让这个例子稍微复杂一点:
x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)
MakeXVec = theano.function([phase], x)
MakeYVec = theano.function([x], y)
并且类似地:
x = ( TFlatTimes - phase + TRP/2) % (TRP ) - TRP/2
y = tt.exp(-0.5*(x)**2/Tg1width**2)
MakeXVec = theano.function([phase], sandbox.cuda.basic_ops.gpu_from_host(x))
MakeYVec = theano.function([x], sandbox.cuda.basic_ops.gpu_from_host(y))
然后比较评估MakeXVec后跟MakeYVec 20,000次,无论有没有数据传输,它们都是20秒!
我觉得这很令人困惑。即使MakeYVec完全没有从传输数据中获益,使用gpu_from_host的版本仍然应该更快。
我认为这必然意味着仍然会有一些数据传输,所以我尝试使用配置文件。但是,我根本不理解输出:
Function profiling
==================
Message: gpu.py:396
Time in 20000 calls to Function.__call__: 4.140420e+00s
Time in Function.fn.__call__: 3.110955e+00s (75.136%)
Time in thunks: 2.404417e+00s (58.072%)
Total compile time: 3.778410e-01s
Number of Apply nodes: 6
Theano Optimizer time: 2.160940e-01s
Theano validate time: 4.644394e-04s
Theano Linker time (includes C, CUDA code generation/compiling): 1.131201e-02s
Import time 3.909349e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.509s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
65.6% 65.6% 1.578s 3.94e-05s C 40000 2 theano.sandbox.cuda.basic_ops.GpuElemwise
32.8% 98.4% 0.789s 1.97e-05s C 40000 2 theano.sandbox.cuda.basic_ops.GpuFromHost
1.6% 100.0% 0.038s 9.44e-07s C 40000 2 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
42.5% 42.5% 1.022s 5.11e-05s C 20000 1 GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
32.8% 75.3% 0.789s 1.97e-05s C 40000 2 GpuFromHost
23.1% 98.4% 0.556s 2.78e-05s C 20000 1 GpuElemwise{mul,no_inplace}
1.6% 100.0% 0.038s 9.44e-07s C 40000 2 GpuDimShuffle{x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
42.5% 42.5% 1.022s 5.11e-05s 20000 5 GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32, vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
23.1% 65.6% 0.556s 2.78e-05s 20000 4 GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
20.3% 85.9% 0.487s 2.44e-05s 20000 1 GpuFromHost(phase)
12.6% 98.4% 0.302s 1.51e-05s 20000 0 GpuFromHost(<TensorType(float32, scalar)>)
1.0% 99.4% 0.024s 1.19e-06s 20000 3 GpuDimShuffle{x}(GpuFromHost.0)
0.6% 100.0% 0.014s 6.96e-07s 20000 2 GpuDimShuffle{x}(GpuFromHost.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: gpu.py:397
Time in 20000 calls to Function.__call__: 1.929911e+01s
Time in Function.fn.__call__: 6.424892e+00s (33.291%)
Time in thunks: 6.052589e+00s (31.362%)
Total compile time: 2.222779e-01s
Number of Apply nodes: 5
Theano Optimizer time: 6.165099e-02s
Theano validate time: 1.147032e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.032209e-02s
Import time 3.772259e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.511s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
84.1% 84.1% 5.088s 1.27e-04s C 40000 2 theano.sandbox.cuda.basic_ops.GpuFromHost
15.6% 99.7% 0.943s 2.36e-05s C 40000 2 theano.sandbox.cuda.basic_ops.GpuElemwise
0.3% 100.0% 0.021s 1.05e-06s C 20000 1 theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
84.1% 84.1% 5.088s 1.27e-04s C 40000 2 GpuFromHost
9.3% 93.3% 0.560s 2.80e-05s C 20000 1 GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
6.3% 99.7% 0.383s 1.92e-05s C 20000 1 GpuElemwise{Sqr}[(0, 0)]
0.3% 100.0% 0.021s 1.05e-06s C 20000 1 GpuDimShuffle{x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
68.5% 68.5% 4.145s 2.07e-04s 20000 1 GpuFromHost(<TensorType(float32, vector)>)
15.6% 84.1% 0.944s 4.72e-05s 20000 0 GpuFromHost(<TensorType(float32, scalar)>)
9.3% 93.3% 0.560s 2.80e-05s 20000 4 GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
6.3% 99.7% 0.383s 1.92e-05s 20000 3 GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
0.3% 100.0% 0.021s 1.05e-06s 20000 2 GpuDimShuffle{x}(GpuFromHost.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: gpu.py:403
Time in 20000 calls to Function.__call__: 5.344449e+00s
Time in Function.fn.__call__: 5.037848e+00s (94.263%)
Time in thunks: 3.975644e+00s (74.388%)
Total compile time: 2.805271e-01s
Number of Apply nodes: 11
Theano Optimizer time: 1.057930e-01s
Theano validate time: 2.398491e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.601004e-02s
Import time 1.127982e-02s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.512s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
43.9% 43.9% 1.747s 2.18e-05s C 80000 4 theano.sandbox.cuda.basic_ops.GpuElemwise
27.1% 71.0% 1.077s 5.38e-05s C 20000 1 theano.sandbox.cuda.basic_ops.GpuJoin
16.5% 87.6% 0.657s 3.29e-05s C 20000 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
8.9% 96.5% 0.353s 1.76e-05s C 20000 1 theano.sandbox.cuda.basic_ops.HostFromGpu
2.7% 99.1% 0.107s 1.78e-06s C 60000 3 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.9% 100.0% 0.034s 1.71e-06s C 20000 1 theano.sandbox.cuda.basic_ops.GpuReshape
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
27.1% 27.1% 1.077s 5.38e-05s C 20000 1 GpuJoin
16.5% 43.6% 0.657s 3.29e-05s C 20000 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
13.9% 57.5% 0.551s 2.75e-05s C 20000 1 GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
13.8% 71.3% 0.549s 2.75e-05s C 20000 1 GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
8.9% 80.2% 0.353s 1.76e-05s C 20000 1 HostFromGpu
8.2% 88.4% 0.327s 1.64e-05s C 20000 1 GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
8.0% 96.5% 0.320s 1.60e-05s C 20000 1 GpuElemwise{Mul}[(0, 1)]
1.9% 98.4% 0.076s 1.89e-06s C 40000 2 GpuSubtensor{:int64:}
0.9% 99.2% 0.034s 1.71e-06s C 20000 1 GpuReshape{1}
0.8% 100.0% 0.031s 1.56e-06s C 20000 1 GpuSubtensor{int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
27.1% 27.1% 1.077s 5.38e-05s 20000 7 GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
16.5% 43.6% 0.657s 3.29e-05s 20000 0 GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
13.9% 57.5% 0.551s 2.75e-05s 20000 5 GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
13.8% 71.3% 0.549s 2.75e-05s 20000 3 GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
8.9% 80.2% 0.353s 1.76e-05s 20000 10 HostFromGpu(GpuReshape{1}.0)
8.2% 88.4% 0.327s 1.64e-05s 20000 6 GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
8.0% 96.5% 0.320s 1.60e-05s 20000 4 GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
1.0% 97.4% 0.039s 1.97e-06s 20000 2 GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
0.9% 98.4% 0.036s 1.81e-06s 20000 8 GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
0.9% 99.2% 0.034s 1.71e-06s 20000 9 GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
0.8% 100.0% 0.031s 1.56e-06s 20000 1 GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 60000 calls to Function.__call__: 2.878398e+01s
Time in Function.fn.__call__: 1.457369e+01s (50.631%)
Time in thunks: 1.243265e+01s (43.193%)
Total compile time: 8.806460e-01s
Number of Apply nodes: 6
Theano Optimizer time: 3.835380e-01s
Theano validate time: 4.009962e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 4.764414e-02s
Import time 1.896143e-02s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 46.516s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
47.3% 47.3% 5.877s 7.35e-05s C 80000 4 theano.sandbox.cuda.basic_ops.GpuFromHost
34.3% 81.6% 4.268s 2.67e-05s C 160000 8 theano.sandbox.cuda.basic_ops.GpuElemwise
8.7% 90.3% 1.077s 5.38e-05s C 20000 1 theano.sandbox.cuda.basic_ops.GpuJoin
5.3% 95.6% 0.657s 3.29e-05s C 20000 1 theano.sandbox.rng_mrg.GPU_mrg_uniform
2.8% 98.4% 0.353s 1.76e-05s C 20000 1 theano.sandbox.cuda.basic_ops.HostFromGpu
0.9% 99.3% 0.107s 1.78e-06s C 60000 3 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.5% 99.7% 0.059s 9.78e-07s C 60000 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.3% 100.0% 0.034s 1.71e-06s C 20000 1 theano.sandbox.cuda.basic_ops.GpuReshape
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
47.3% 47.3% 5.877s 7.35e-05s C 80000 4 GpuFromHost
8.7% 55.9% 1.077s 5.38e-05s C 20000 1 GpuJoin
8.2% 64.2% 1.022s 5.11e-05s C 20000 1 GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}
5.3% 69.4% 0.657s 3.29e-05s C 20000 1 GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}
4.5% 74.0% 0.560s 2.80e-05s C 20000 1 GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)]
4.5% 78.4% 0.556s 2.78e-05s C 20000 1 GpuElemwise{mul,no_inplace}
4.4% 82.8% 0.551s 2.75e-05s C 20000 1 GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}
4.4% 87.3% 0.549s 2.75e-05s C 20000 1 GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}
3.1% 90.3% 0.383s 1.92e-05s C 20000 1 GpuElemwise{Sqr}[(0, 0)]
2.8% 93.2% 0.353s 1.76e-05s C 20000 1 HostFromGpu
2.6% 95.8% 0.327s 1.64e-05s C 20000 1 GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)]
2.6% 98.4% 0.320s 1.60e-05s C 20000 1 GpuElemwise{Mul}[(0, 1)]
0.6% 99.0% 0.076s 1.89e-06s C 40000 2 GpuSubtensor{:int64:}
0.5% 99.5% 0.059s 9.78e-07s C 60000 3 GpuDimShuffle{x}
0.3% 99.7% 0.034s 1.71e-06s C 20000 1 GpuReshape{1}
0.3% 100.0% 0.031s 1.56e-06s C 20000 1 GpuSubtensor{int64::}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
33.3% 33.3% 4.145s 2.07e-04s 20000 1 GpuFromHost(<TensorType(float32, vector)>)
8.7% 42.0% 1.077s 5.38e-05s 20000 7 GpuJoin(TensorConstant{0}, GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}.0, GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)].0)
8.2% 50.2% 1.022s 5.11e-05s 20000 5 GpuElemwise{Composite{((((i0 - i1) + i2) % i3) - i2)},no_inplace}(<CudaNdarrayType(float32,
'''
vector)>, GpuDimShuffle{x}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x}.0)
7.6% 57.8% 0.944s 4.72e-05s 20000 0 GpuFromHost(<TensorType(float32, scalar)>)
5.3% 63.1% 0.657s 3.29e-05s 20000 0 GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}(<CudaNdarrayType(float32, vector)>, TensorConstant{(1,) of 2})
4.5% 67.6% 0.560s 2.80e-05s 20000 4 GpuElemwise{Composite{exp(((i0 * sqr(i1)) / i2))}}[(0, 1)](CudaNdarrayConstant{[-0.5]}, GpuFromHost.0, GpuElemwise{Sqr}[(0, 0)].0)
4.5% 72.1% 0.556s 2.78e-05s 20000 4 GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[ 0.5]}, GpuDimShuffle{x}.0)
4.4% 76.5% 0.551s 2.75e-05s 20000 5 GpuElemwise{Composite{(i0 * cos(i1))},no_inplace}(GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
4.4% 80.9% 0.549s 2.75e-05s 20000 3 GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}(CudaNdarrayConstant{[-2.]}, GpuSubtensor{:int64:}.0)
3.9% 84.8% 0.487s 2.44e-05s 20000 1 GpuFromHost(phase)
3.1% 87.9% 0.383s 1.92e-05s 20000 3 GpuElemwise{Sqr}[(0, 0)](GpuDimShuffle{x}.0)
2.8% 90.8% 0.353s 1.76e-05s 20000 10 HostFromGpu(GpuReshape{1}.0)
2.6% 93.4% 0.327s 1.64e-05s 20000 6 GpuElemwise{Composite{(i0 * sin(i1))}}[(0, 0)](GpuElemwise{Composite{sqrt((i0 * log(i1)))},no_inplace}.0, GpuElemwise{Mul}[(0, 1)].0)
2.6% 96.0% 0.320s 1.60e-05s 20000 4 GpuElemwise{Mul}[(0, 1)](CudaNdarrayConstant{[ 6.28318548]}, GpuSubtensor{int64::}.0)
2.4% 98.4% 0.302s 1.51e-05s 20000 0 GpuFromHost(<TensorType(float32, scalar)>)
0.3% 98.7% 0.039s 1.97e-06s 20000 2 GpuSubtensor{:int64:}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
0.3% 99.0% 0.036s 1.81e-06s 20000 8 GpuSubtensor{:int64:}(GpuJoin.0, Constant{-1})
0.3% 99.3% 0.034s 1.71e-06s 20000 9 GpuReshape{1}(GpuSubtensor{:int64:}.0, TensorConstant{(1,) of 1})
0.3% 99.5% 0.031s 1.56e-06s 20000 1 GpuSubtensor{int64::}(GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1, Constant{1})
0.2% 99.7% 0.024s 1.19e-06s 20000 3 GpuDimShuffle{x}(GpuFromHost.0)
... (remaining 2 Apply instances account for 0.28%(0.03s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
Sorry, no tip for today.
有人可以帮助我理解这个吗?为什么指定不转移结果不会在评估两个函数时获得任何收益而不是改善事物,而只评估一个函数?在这个庞大的配置文件输出中,我应该尝试找出正在发生的事情吗?
非常感谢您的帮助。