如何在GPU上获得AdvancedSubtensor

时间:2016-03-10 14:10:57

标签: gpgpu theano theano-cuda

我有一些subtensor,由于某种原因,Theano无法将其传输到GPU。

一些示例代码:

import numpy
import theano
import theano.printing
import theano.compile.io
import theano.compile.function_module
import theano.tensor as T
from theano.sandbox.cuda.basic_ops import as_cuda_ndarray_variable

n_copies, n_cells = 5, 10
P = T.constant(numpy.zeros((n_copies, n_cells), dtype="int32"))  # (n_copies,n_cells) -> list of indices

meminkey = T.fmatrix()  # (batch,n_cells)
meminkey = as_cuda_ndarray_variable(meminkey)
i_t = T.ones((meminkey.shape[0],))
batches = T.arange(0, i_t.shape[0]).dimshuffle(0, 'x', 'x')  # (batch,n_copies,n_cells)
P_bc = P.dimshuffle('x', 0, 1)  # (batch,n_copies,n_cells)
meminkeyP = meminkey[batches, P_bc]  # (batch,n_copies,n_cells)
meminkeyP = as_cuda_ndarray_variable(meminkeyP)

func = theano.function(inputs=[meminkey], outputs=[meminkeyP])
theano.printing.debugprint(func)

我添加了一些as_cuda_ndarray_variable以使问题更加清晰,因为在输出中,您会看到传输GpuFromHostHostFromGpu,如果它可以在GPU上执行AdvancedSubtensor,它会避免。输出

Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuFromHost [id A] ''   5
 |AdvancedSubtensor [id B] ''   4
   |HostFromGpu [id C] ''   1
   | |<CudaNdarrayType(float32, matrix)> [id D]
   |InplaceDimShuffle{0,x,x} [id E] ''   3
   | |ARange{dtype='int64'} [id F] ''   2
   |   |TensorConstant{0} [id G]
   |   |Shape_i{0} [id H] ''   0
   |   | |<CudaNdarrayType(float32, matrix)> [id D]
   |   |TensorConstant{1} [id I]
   |TensorConstant{[[[4 0 1 2..5 8 9 7]]]} [id J]

那么,为什么Theano无法将其转换为GPU op?

另外,我如何重写Theano将在GPU上进行计算的代码?

Google网上论坛中的相关问题:hereherehere

1 个答案:

答案 0 :(得分:2)

好的,所以在我关联的Google网上论坛帖子中,很好地解释了为什么它不起作用。 AdvancedSubtensor是最通用的形式,适用于所有疯狂类型的索引变体。然后是AdvancedSubtensor1,它只适用于某种子集。只存在AdvancedSubtensor1的GPU版本,而不是AdvancedSubtensor。我没有完全理解其中的原因,但正在进行讨论。

当存在单个索引列表时,可以使用AdvancedSubtensor1。但是,在我的例子中,情况并非如此。您看到的常见解决方法(在Google网上论坛帖子中的其他一些示例中)是首先展平数组并计算展平数组的索引。

大多数示例都使用某种nonzero()左右,您也可以在其中展平基本参数,然后获得展平版本的索引。

那么,问题是,如何将其应用于我的代码?

实际上,有一个更简单的解决方案,它将使用我最初没有意识到的AdvancedSubtensor1:

meminkeyP = meminkey[:, P]  # (batch,n_copies,n_cells)

然而,在我意识到这一点之前,我想出了一个通用解决方案,它也适用于其他情况。我将我的索引元组(batches, P_bc)转换为扁平版本的索引。这是通过此功能完成的:

def indices_in_flatten_array(ndim, shape, *args):
  """
  We expect that all args can be broadcasted together.
  So, if we have some array A with ndim&shape as given,
  A[args] would give us a subtensor.
  We return the indices so that A[args].flatten()
  and A.flatten()[indices] are the same.
  """
  assert ndim > 0
  assert len(args) == ndim
  indices_per_axis = [args[i] for i in range(ndim)]
  for i in range(ndim):
    for j in range(i + 1, ndim):
      indices_per_axis[i] *= shape[j]
  indices = indices_per_axis[0]
  for i in range(1, ndim):
    indices += indices_per_axis[i]
  return indices

然后,我这样使用它:

meminkeyP = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]

这似乎有效。

我得到了这个输出:

Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuReshape{3} [id A] ''   11
 |GpuAdvancedSubtensor1 [id B] ''   10
 | |GpuReshape{1} [id C] ''   2
 | | |<CudaNdarrayType(float32, matrix)> [id D]
 | | |TensorConstant{(1,) of -1} [id E]
 | |Reshape{1} [id F] ''   9
 |   |Elemwise{second,no_inplace} [id G] ''   8
 |   | |TensorConstant{(1, 5, 10) of 0} [id H]
 |   | |Elemwise{Mul}[(0, 0)] [id I] ''   7
 |   |   |InplaceDimShuffle{0,x,x} [id J] ''   6
 |   |   | |ARange{dtype='int64'} [id K] ''   4
 |   |   |   |TensorConstant{0} [id L]
 |   |   |   |Shape_i{0} [id M] ''   0
 |   |   |   | |<CudaNdarrayType(float32, matrix)> [id D]
 |   |   |   |TensorConstant{1} [id N]
 |   |   |InplaceDimShuffle{x,x,x} [id O] ''   5
 |   |     |Shape_i{1} [id P] ''   1
 |   |       |<CudaNdarrayType(float32, matrix)> [id D]
 |   |TensorConstant{(1,) of -1} [id E]
 |MakeVector{dtype='int64'} [id Q] ''   3
   |Shape_i{0} [id M] ''   0
   |TensorConstant{5} [id R]
   |TensorConstant{10} [id S]

小测试用例:

def test_indices_in_flatten_array():
  n_copies, n_cells = 5, 4
  n_complex_cells = n_cells / 2
  n_batch = 3
  static_rng = numpy.random.RandomState(1234)
  def make_permut():
    p = numpy.zeros((n_copies, n_cells), dtype="int32")
    for i in range(n_copies):
      p[i, :n_complex_cells] = static_rng.permutation(n_complex_cells)
      # Same permutation for imaginary part.
      p[i, n_complex_cells:] = p[i, :n_complex_cells] + n_complex_cells
    return T.constant(p)
  P = make_permut()  # (n_copies,n_cells) -> list of indices

  meminkey = T.as_tensor_variable(static_rng.rand(n_batch, n_cells).astype("float32"))
  i_t = T.ones((meminkey.shape[0],))  # (batch,)
  n_batch = i_t.shape[0]
  batches = T.arange(0, n_batch).dimshuffle(0, 'x', 'x')  # (batch,n_copies,n_cells)
  P_bc = P.dimshuffle('x', 0, 1)  # (batch,n_copies,n_cells)
  meminkeyP1 = meminkey[batches, P_bc]  # (batch,n_copies,n_cells)
  meminkeyP2 = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]

  numpy.testing.assert_allclose(meminkeyP1.eval(), meminkeyP2.eval())