Question

我正在寻找使用GPU加速计算bincount的可能性。

numpy中的参考代码：

x_new = numpy.random.randint(0, 1000, 1000000)
%timeit numpy.bincount(x_new)
100 loops, best of 3: 2.33 ms per loop

我只想测量操作速度，而不是测量传递数组所花费的时间，所以我创建了一个共享变量：

x = theano.shared(numpy.random.randint(0, 1000, 1000000))
theano_bincount = theano.function([], T.extra_ops.bincount(x))

此操作当然是高度可并行化的，但在GPU上实际使用此代码的速度比CPU版本慢：

%timeit theano_bincount()
10 loops, best of 3: 25.7 ms per loop

所以我的问题是：

这种低性能的原因是什么？
我可以使用theano编写并行版本的bincount吗？

Answer 1

我认为你无法在GPU上进一步增加这个操作，除非你能以某种方式手动告诉Theano以并行方式进行，这似乎是不可能的。在GPU上，与CPU相比，不会并行完成的计算将以相同的速度或更慢的速度完成。

引自Daniel Renshaw：

在某种程度上，Theano希望您更专注于您想要的东西计算而不是计算你想要的方式。这个想法是 Theano优化编译器将自动并行化可能（在GPU或使用OpenMP的CPU上）。

另一个引用：

您需要能够根据Theano操作指定计算。如果这些操作可以在GPU上并行化，那么它们应该自动并行化。

来自Theano的网页引用：

索引，尺寸改组和恒定时间整形将是   在GPU上和在CPU上一样快。

对张量的行/列进行求和   可以在GPU上比在CPU上慢一点。

我认为您唯一能做的就是在openmp文件中将True标记设置为.theanorc。

无论如何，我尝试了一个想法。它现在不起作用，但希望有人可以帮助我们使它工作。如果有效，您可以在GPU上并行化操作。下面的代码尝试使用CUDA API在GPU中执行所有操作。但是，有两个瓶颈不允许进行操作：1）当前（截至2016年1月4日） Theano和CUDA不支持任何数据类型的任何操作，而不是 float32 和2）T.extra_ops.bincount()仅适用于int数据类型。因此，可能是Theano无法完全并行化操作的瓶颈。

import theano.tensor as T
from theano import shared, Out, function
import numpy as np
import theano.sandbox.cuda.basic_ops as sbasic

shared_var = shared(np.random.randint(0, 1000, 1000000).astype(T.config.floatX), borrow = True)
x = T.vector('x');
computeFunc = T.extra_ops.bincount(sbasic.as_cuda_ndarray_variable(T.cast(x, 'int16')))
func = function([], Out(sbasic.gpu_from_host(computeFunc), borrow = True), givens = {x: shared_var})

<强>来源

1- How do I set many elements in parallel in theano

2- http://deeplearning.net/software/theano/tutorial/using_gpu.html#what-can-be-accelerated-on-the-gpu

3- http://deeplearning.net/software/theano/tutorial/multi_cores.html

如何强制Theano在GPU上并行化操作（测试用例：numpy.bincount）

1 个答案: