Question

我的最终目标是通过使用支持CUDA的GPU加速Python中矩阵矢量产品的计算。矩阵A约为15k x 15k且稀疏（密度约为0.05），矢量x为15k元素且密集，我正在计算Ax。我必须多次执行此计算，因此尽可能快地进行此计算将是理想的。

我目前的非GPU“优化”是将A表示为scipy.sparse.csc_matrix对象，然后简单地计算A.dot（x），但我希望能够在具有几个NVIDIA的VM上加快速度附加GPU，如果可能，仅使用Python（即不手动写出详细的内核函数）。我已成功使用cudamat库加速密集矩阵矢量产品，但不适用于稀疏情况。有一些关于稀疏案例在线的建议，例如使用pycuda，或scikit-cuda，或者anaconda的加速包，但是没有大量的信息，所以很难知道从哪里开始。

我不需要非常详细的说明，但如果有人之前已经解决了这个问题，并且可以为最简单的方法提供“大图”路线图，或者想要加快稀疏GPU的速度 - 基于矩阵向量的产品将超过scipy的稀疏算法，这将非常有用。

Answer 1

正如评论中所指出的，NVIDIA发布了cuSPARSE库，其中包含具有密集向量的稀疏矩阵产品的函数。

Numba现在通过pyculib包为cuSparse库提供Python绑定。

Answer 2

感谢您的建议。

我设法让pyculib的csrmm（压缩稀疏行格式化矩阵的矩阵乘法）操作使用以下工作（在Google云平台上使用2个NVIDIA K80 GPU），但遗憾的是无法实现加速。

我认为这是因为csrmm函数中的大部分时间用于向/从GPU传输数据，而不是实际进行计算。不幸的是，我无法找到任何简单的pyculib方法来首先将数组放到GPU上并在迭代中保持它们。我使用的代码是：

import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time


def spmv_cuda(a_sparse, b, sp, count):
    """Compute a_sparse x b."""

    # args to csrmm call
    trans_a = 'N'  # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
    m = a_sparse.shape[0]  # num rows in a
    n = b.shape[1]  # num cols in b, c
    k = a_sparse.shape[1]  # num cols in a
    nnz = len(a_sparse.data)  # num nonzero in a
    alpha = 1  # no scaling
    descr_a = sp.matdescr(  # matrix descriptor
        indexbase=0,  # 0-based indexing
        matrixtype='G',  # 'general': no symmetry or triangular structure
    )
    csr_val_a = a_sparse.data  # csr data
    csr_row_ptr_a = a_sparse.indptr  # csr row pointers
    csr_col_ind_a = a_sparse.indices  # csr col idxs
    ldb = b.shape[0]
    beta = 0
    c = np.empty((m, n), dtype=a_sparse.dtype)
    ldc = b.shape[0]

    # call function
    tic = time()
    for ii in range(count):
        sp.csrmm(
            transA=trans_a,
            m=m,
            n=n,
            k=k,
            nnz=nnz,
            alpha=alpha,
            descrA=descr_a,
            csrValA=csr_val_a,
            csrRowPtrA=csr_row_ptr_a,
            csrColIndA=csr_col_ind_a,
            B=b,
            ldb=ldb,
            beta=beta,
            C=c,
            ldc=ldc)
    toc = time()

    return c, toc - tic

# run benchmark
COUNT = 20
N = 5000
P = 0.1

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(np.float32)

sp = Sparse()

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))

# pyculib sparse

c, t = spmv_cuda(a_sparse, b, sp, COUNT)

print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))

产生输出：

Constructing objects...

scipy sparse matrix multiplication took 0.05158638954162598 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

Testing pyculib sparse matrix multiplication...

pyculib sparse matrix multiplication took 0.12598299980163574 seconds

c = [ 122.29483032  127.83659363  128.75003052  130.6912384   124.98326111]

正如您所看到的，即使矩阵乘法在GPU上，pyculib也是慢两倍。同样，可能是因为在每次迭代时向GPU传输数据或从GPU传输数据所涉及的开销。

然而，我找到的另一种解决方案是使用Andreas Kloeckner的pycuda库，它的速度提高了50倍！

import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time


def spmv_cuda(a_sparse, b, count):

    dtype = a_sparse.dtype
    m = a_sparse.shape[0]

    print('moving objects to GPU...')

    spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)

    dev_pool = DeviceMemoryPool()
    d_b = gpuarray.to_gpu(b, dev_pool.allocate)
    d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)

    print('executing spmv operation...\n')

    tic = time()
    for ii in range(count):
        d_c.fill(0)
        d_c = spmv(d_b, d_c)
    toc = time()

    return d_c.get(), toc - tic


# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(DTYPE)

# numpy dense
tic = time()
for ii in range(COUNT):
    c = np.dot(a_dense, b)
toc = time()

print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))

产生此输出：

numpy dense matrix multiplication took 0.2290663719177246 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

scipy sparse matrix multiplication took 0.24468040466308594 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

moving objects to GPU...
executing spmv operation...

pycuda sparse matrix multiplication took 0.004545450210571289 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

注1：pycuda需要以下依赖项：

pymetis：使用安装：pip install pymetis
nvcc：使用安装：sudo apt install nvidia-cuda-toolkit

注2：由于某些原因pip install pycuda无法安装文件pkt_build_cython.pyx，因此您需要自己从https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx下载/复制文件。

Answer 3

另一种解决方案是使用tensorflow的矩阵乘法函数。一旦GPU启用的张量流启动并运行，这些就可以开箱即用。

安装CUDA和tensorflow-gpu（一些参与但简单明了的教程here和here）之后，您可以使用tensorflow的SparseTensor类和sparse_tensor_dense_matmul函数，如下所示：

import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time

确保检测到GPU：

gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n  {}'.format(gpus))

输出：

GPU DEVICES:
  ['/device:GPU:0']

基准：

from scipy.sparse import csr_matrix

ITERS = 30
N = 20000
P = 0.1  # matrix density

使用scipy：

np.random.seed(0)

a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N)

tic = time()
for ii in range(ITERS):
    c = a_sparse.dot(b)
toc = time()

elapsed = toc - tic

print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

输出：

Scipy spmv product took 0.06693172454833984 seconds per iteration.

使用支持GPU的张量流：

with tf.device('/device:GPU:0'):

    np.random.seed(0)

    a_dense = np.random.rand(N, N)
    a_dense[a_dense > P] = 0

    indices = np.transpose(a_dense.nonzero())
    values = a_dense[indices[:, 0], indices[:, 1]]
    dense_shape = a_dense.shape

    a_sparse = tf.SparseTensor(indices, values, dense_shape)

    b = tf.constant(np.random.rand(N, 1))

    tic = time()
    for ii in range(ITERS):
        c = tf.sparse_tensor_dense_matmul(a_sparse, b)
    toc = time()

elapsed = toc - tic

print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

输出：

GPU spmv product took 0.0011811971664428711 seconds per iteration.

事实证明，加速非常好。

Answer 4

另一种替代方法是使用CuPy包。它具有与numpy / scipy相同的界面（非常不错），并且（至少对我而言），它的安装比pycuda容易得多。新代码看起来像这样：

import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu

A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector  #(numpy.ndarray)

A_gpu = csr_gpu(A)  #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu

for i in range(niter):
    x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing

如何通过密集矢量产品加速稀疏矩阵，目前通过scipy.sparse.csc_matrix.dot使用CUDA实现？

4 个答案: