我创建了一个简单的python脚本(使用Theano)执行线性回归,应该在GPU上运行。代码启动时会显示“使用gpu device”,但是(根据分析器)所有操作都是CPU特定的(ElemWise,而不是GpuElemWise,没有GpuFromHost等)。
我检查了变量,THEANO_FLAGS,一切看起来都是正确的,我看不到捕获(特别是当使用相同设置的Theano教程在GPU上正确运行时):)。
以下是代码:
# linear regression
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)
for i in range(1000):
train()
- THEANO_FLAGS = cuda.root =的/ usr /本地/ CUDA
- 设备= GPU
- floatX = FLOAT32
- lib.cnmem = 0.5
- 轮廓=真
- CUDA_LAUNCH_BLOCKING = 1
输出:
Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
Time in 1000 calls to Function.__call__: 3.348637e-02s
Time in Function.fn.__call__: 2.419019e-02s (72.239%)
Time in thunks: 1.839781e-02s (54.941%)
Total compile time: 1.350801e-01s
Number of Apply nodes: 18
Theano Optimizer time: 1.101730e-01s
Theano validate time: 2.029657e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
Import time 2.320528e-03s
Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 theano.tensor.basic.Dot
12.3% 83.9% 0.002s 3.22e-07s C 7000 7 theano.tensor.elemwise.Elemwise
5.7% 89.6% 0.001s 3.50e-07s C 3000 3 theano.tensor.elemwise.DimShuffle
4.0% 93.6% 0.001s 3.65e-07s C 2000 2 theano.tensor.subtensor.Subtensor
3.6% 97.2% 0.001s 3.31e-07s C 2000 2 theano.compile.ops.Shape_i
1.7% 98.9% 0.000s 3.06e-07s C 1000 1 theano.tensor.opt.MakeVector
1.1% 100.0% 0.000s 2.10e-07s C 1000 1 theano.tensor.elemwise.Sum
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 dot
4.0% 75.6% 0.001s 3.65e-07s C 2000 2 Subtensor{int64}
3.5% 79.1% 0.001s 6.35e-07s C 1000 1 InplaceDimShuffle{1,0}
3.3% 82.4% 0.001s 6.06e-07s C 1000 1 Elemwise{mul,no_inplace}
2.4% 84.8% 0.000s 4.38e-07s C 1000 1 Shape_i{0}
2.3% 87.1% 0.000s 4.29e-07s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}}
2.3% 89.3% 0.000s 2.08e-07s C 2000 2 InplaceDimShuffle{x,x}
1.8% 91.1% 0.000s 3.25e-07s C 1000 1 Elemwise{Cast{float64}}
1.7% 92.8% 0.000s 3.06e-07s C 1000 1 MakeVector{dtype='int64'}
1.5% 94.3% 0.000s 2.78e-07s C 1000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
1.4% 95.7% 0.000s 2.53e-07s C 1000 1 Elemwise{Sub}[(0, 1)]
1.2% 96.9% 0.000s 2.24e-07s C 1000 1 Shape_i{1}
1.1% 98.0% 0.000s 2.10e-07s C 1000 1 Sum{acc_dtype=float64}
1.1% 99.1% 0.000s 1.98e-07s C 1000 1 Elemwise{Sqr}[(0, 0)]
0.9% 100.0% 0.000s 1.66e-07s C 1000 1 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
37.8% 37.8% 0.007s 6.95e-06s 1000 3 dot(<TensorType(float64, matrix)>, training-set.T)
33.9% 71.7% 0.006s 6.24e-06s 1000 14 dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
3.5% 75.1% 0.001s 6.35e-07s 1000 0 InplaceDimShuffle{1,0}(training-set)
3.3% 78.4% 0.001s 6.06e-07s 1000 11 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
3.0% 81.4% 0.001s 5.58e-07s 1000 8 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
2.4% 83.8% 0.000s 4.38e-07s 1000 2 Shape_i{0}(expected)
2.3% 86.2% 0.000s 4.29e-07s 1000 12 Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
1.8% 87.9% 0.000s 3.25e-07s 1000 6 Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
1.7% 89.6% 0.000s 3.06e-07s 1000 4 MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
1.6% 91.2% 0.000s 3.03e-07s 1000 10 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
1.5% 92.7% 0.000s 2.78e-07s 1000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
1.4% 94.1% 0.000s 2.53e-07s 1000 5 Elemwise{Sub}[(0, 1)](expected, dot.0)
1.2% 95.3% 0.000s 2.24e-07s 1000 1 Shape_i{1}(expected)
1.1% 96.5% 0.000s 2.10e-07s 1000 15 Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
1.1% 97.6% 0.000s 1.98e-07s 1000 13 Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
0.9% 98.5% 0.000s 1.72e-07s 1000 7 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
0.9% 99.4% 0.000s 1.66e-07s 1000 17 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
0.6% 100.0% 0.000s 1.13e-07s 1000 9 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
答案 0 :(得分:2)
正如评论中所述,虽然您已将allow_input_downcast
参数设置为True
,但您需要确保将所有要分配给共享变量的数据都放在float32
中。截至 1月。 06,2016 Theano仍然无法使用任何其他数据类型而不是float32
来在GPU上进行计算,如更详细的here所述。因此,您必须将数据投射到&#39; float32&#39;格式。
因此,这里应该是您需要使用的代码:
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)
for i in range(1000):
train()
train.profile.print_summary()
这将是分析结果:
Message: LearnTheano.py:18
Time in 1000 calls to Function.__call__: 2.642968e-01s
Time in Function.fn.__call__: 2.460811e-01s (93.108%)
Time in thunks: 1.877530e-01s (71.039%)
Total compile time: 2.483290e+01s
Number of Apply nodes: 17
Theano Optimizer time: 2.818849e-01s
Theano validate time: 3.435850e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
Import time 1.241469e-02s
Time in all call to theano.grad() 1.206994e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
34.8% 34.8% 0.065s 3.27e-05s C 2000 2 theano.sandbox.cuda.blas.GpuGemm
28.8% 63.5% 0.054s 1.80e-05s C 3000 3 theano.sandbox.cuda.basic_ops.GpuElemwise
12.9% 76.4% 0.024s 2.42e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
10.3% 86.7% 0.019s 1.93e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuFromHost
7.2% 93.9% 0.014s 1.36e-05s C 1000 1 theano.sandbox.cuda.basic_ops.HostFromGpu
1.8% 95.7% 0.003s 1.13e-06s C 3000 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.5% 97.2% 0.003s 2.81e-06s C 1000 1 theano.tensor.elemwise.Elemwise
1.1% 98.4% 0.002s 1.08e-06s C 2000 2 theano.compile.ops.Shape_i
1.1% 99.5% 0.002s 1.02e-06s C 2000 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.5% 100.0% 0.001s 9.96e-07s C 1000 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
25.3% 25.3% 0.047s 4.74e-05s C 1000 1 GpuGemm{no_inplace}
12.9% 38.1% 0.024s 2.42e-05s C 1000 1 GpuCAReduce{pre=sqr,red=add}{1,1}
12.8% 51.0% 0.024s 2.41e-05s C 1000 1 GpuElemwise{mul,no_inplace}
10.3% 61.3% 0.019s 1.93e-05s C 1000 1 GpuFromHost
9.5% 70.8% 0.018s 1.79e-05s C 1000 1 GpuGemm{inplace}
8.2% 79.0% 0.015s 1.55e-05s C 1000 1 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
7.7% 86.7% 0.014s 1.44e-05s C 1000 1 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
7.2% 93.9% 0.014s 1.36e-05s C 1000 1 HostFromGpu
1.5% 95.4% 0.003s 2.81e-06s C 1000 1 Elemwise{Cast{float32}}
1.1% 96.5% 0.002s 1.02e-06s C 2000 2 GpuSubtensor{int64}
1.0% 97.5% 0.002s 9.00e-07s C 2000 2 GpuDimShuffle{x,x}
0.8% 98.3% 0.002s 1.59e-06s C 1000 1 GpuDimShuffle{1,0}
0.7% 99.1% 0.001s 1.38e-06s C 1000 1 Shape_i{0}
0.5% 99.6% 0.001s 9.96e-07s C 1000 1 MakeVector
0.4% 100.0% 0.001s 7.76e-07s C 1000 1 Shape_i{1}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
25.3% 25.3% 0.047s 4.74e-05s 1000 3 GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
12.9% 38.1% 0.024s 2.42e-05s 1000 5 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
12.8% 51.0% 0.024s 2.41e-05s 1000 13 GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
10.3% 61.3% 0.019s 1.93e-05s 1000 7 GpuFromHost(Elemwise{Cast{float32}}.0)
9.5% 70.8% 0.018s 1.79e-05s 1000 16 GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
8.2% 79.0% 0.015s 1.55e-05s 1000 12 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
7.7% 86.7% 0.014s 1.44e-05s 1000 15 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
7.2% 93.9% 0.014s 1.36e-05s 1000 14 HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
1.5% 95.4% 0.003s 2.81e-06s 1000 6 Elemwise{Cast{float32}}(MakeVector.0)
0.8% 96.3% 0.002s 1.59e-06s 1000 0 GpuDimShuffle{1,0}(training-set)
0.7% 97.0% 0.001s 1.38e-06s 1000 2 Shape_i{0}(expected)
0.7% 97.7% 0.001s 1.30e-06s 1000 8 GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
0.6% 98.3% 0.001s 1.08e-06s 1000 11 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
0.5% 98.8% 0.001s 9.96e-07s 1000 4 MakeVector(Shape_i{0}.0, Shape_i{1}.0)
0.4% 99.2% 0.001s 7.76e-07s 1000 1 Shape_i{1}(expected)
0.4% 99.6% 0.001s 7.40e-07s 1000 9 GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
0.4% 100.0% 0.001s 7.25e-07s 1000 10 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)