我有一些subtensor,由于某种原因,Theano无法将其传输到GPU。
一些示例代码:
import numpy
import theano
import theano.printing
import theano.compile.io
import theano.compile.function_module
import theano.tensor as T
from theano.sandbox.cuda.basic_ops import as_cuda_ndarray_variable
n_copies, n_cells = 5, 10
P = T.constant(numpy.zeros((n_copies, n_cells), dtype="int32")) # (n_copies,n_cells) -> list of indices
meminkey = T.fmatrix() # (batch,n_cells)
meminkey = as_cuda_ndarray_variable(meminkey)
i_t = T.ones((meminkey.shape[0],))
batches = T.arange(0, i_t.shape[0]).dimshuffle(0, 'x', 'x') # (batch,n_copies,n_cells)
P_bc = P.dimshuffle('x', 0, 1) # (batch,n_copies,n_cells)
meminkeyP = meminkey[batches, P_bc] # (batch,n_copies,n_cells)
meminkeyP = as_cuda_ndarray_variable(meminkeyP)
func = theano.function(inputs=[meminkey], outputs=[meminkeyP])
theano.printing.debugprint(func)
我添加了一些as_cuda_ndarray_variable
以使问题更加清晰,因为在输出中,您会看到传输GpuFromHost
和HostFromGpu
,如果它可以在GPU上执行AdvancedSubtensor,它会避免。输出
Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuFromHost [id A] '' 5
|AdvancedSubtensor [id B] '' 4
|HostFromGpu [id C] '' 1
| |<CudaNdarrayType(float32, matrix)> [id D]
|InplaceDimShuffle{0,x,x} [id E] '' 3
| |ARange{dtype='int64'} [id F] '' 2
| |TensorConstant{0} [id G]
| |Shape_i{0} [id H] '' 0
| | |<CudaNdarrayType(float32, matrix)> [id D]
| |TensorConstant{1} [id I]
|TensorConstant{[[[4 0 1 2..5 8 9 7]]]} [id J]
那么,为什么Theano无法将其转换为GPU op?
另外,我如何重写Theano将在GPU上进行计算的代码?
答案 0 :(得分:2)
好的,所以在我关联的Google网上论坛帖子中,很好地解释了为什么它不起作用。 AdvancedSubtensor是最通用的形式,适用于所有疯狂类型的索引变体。然后是AdvancedSubtensor1,它只适用于某种子集。只存在AdvancedSubtensor1的GPU版本,而不是AdvancedSubtensor。我没有完全理解其中的原因,但正在进行讨论。
当存在单个索引列表时,可以使用AdvancedSubtensor1。但是,在我的例子中,情况并非如此。您看到的常见解决方法(在Google网上论坛帖子中的其他一些示例中)是首先展平数组并计算展平数组的索引。
大多数示例都使用某种nonzero()
左右,您也可以在其中展平基本参数,然后获得展平版本的索引。
那么,问题是,如何将其应用于我的代码?
实际上,有一个更简单的解决方案,它将使用我最初没有意识到的AdvancedSubtensor1:
meminkeyP = meminkey[:, P] # (batch,n_copies,n_cells)
然而,在我意识到这一点之前,我想出了一个通用解决方案,它也适用于其他情况。我将我的索引元组(batches, P_bc)
转换为扁平版本的索引。这是通过此功能完成的:
def indices_in_flatten_array(ndim, shape, *args):
"""
We expect that all args can be broadcasted together.
So, if we have some array A with ndim&shape as given,
A[args] would give us a subtensor.
We return the indices so that A[args].flatten()
and A.flatten()[indices] are the same.
"""
assert ndim > 0
assert len(args) == ndim
indices_per_axis = [args[i] for i in range(ndim)]
for i in range(ndim):
for j in range(i + 1, ndim):
indices_per_axis[i] *= shape[j]
indices = indices_per_axis[0]
for i in range(1, ndim):
indices += indices_per_axis[i]
return indices
然后,我这样使用它:
meminkeyP = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]
这似乎有效。
我得到了这个输出:
Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuReshape{3} [id A] '' 11
|GpuAdvancedSubtensor1 [id B] '' 10
| |GpuReshape{1} [id C] '' 2
| | |<CudaNdarrayType(float32, matrix)> [id D]
| | |TensorConstant{(1,) of -1} [id E]
| |Reshape{1} [id F] '' 9
| |Elemwise{second,no_inplace} [id G] '' 8
| | |TensorConstant{(1, 5, 10) of 0} [id H]
| | |Elemwise{Mul}[(0, 0)] [id I] '' 7
| | |InplaceDimShuffle{0,x,x} [id J] '' 6
| | | |ARange{dtype='int64'} [id K] '' 4
| | | |TensorConstant{0} [id L]
| | | |Shape_i{0} [id M] '' 0
| | | | |<CudaNdarrayType(float32, matrix)> [id D]
| | | |TensorConstant{1} [id N]
| | |InplaceDimShuffle{x,x,x} [id O] '' 5
| | |Shape_i{1} [id P] '' 1
| | |<CudaNdarrayType(float32, matrix)> [id D]
| |TensorConstant{(1,) of -1} [id E]
|MakeVector{dtype='int64'} [id Q] '' 3
|Shape_i{0} [id M] '' 0
|TensorConstant{5} [id R]
|TensorConstant{10} [id S]
小测试用例:
def test_indices_in_flatten_array():
n_copies, n_cells = 5, 4
n_complex_cells = n_cells / 2
n_batch = 3
static_rng = numpy.random.RandomState(1234)
def make_permut():
p = numpy.zeros((n_copies, n_cells), dtype="int32")
for i in range(n_copies):
p[i, :n_complex_cells] = static_rng.permutation(n_complex_cells)
# Same permutation for imaginary part.
p[i, n_complex_cells:] = p[i, :n_complex_cells] + n_complex_cells
return T.constant(p)
P = make_permut() # (n_copies,n_cells) -> list of indices
meminkey = T.as_tensor_variable(static_rng.rand(n_batch, n_cells).astype("float32"))
i_t = T.ones((meminkey.shape[0],)) # (batch,)
n_batch = i_t.shape[0]
batches = T.arange(0, n_batch).dimshuffle(0, 'x', 'x') # (batch,n_copies,n_cells)
P_bc = P.dimshuffle('x', 0, 1) # (batch,n_copies,n_cells)
meminkeyP1 = meminkey[batches, P_bc] # (batch,n_copies,n_cells)
meminkeyP2 = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]
numpy.testing.assert_allclose(meminkeyP1.eval(), meminkeyP2.eval())