如何通过密集矢量产品加速稀疏矩阵,目前通过scipy.sparse.csc_matrix.dot使用CUDA实现?

时间:2018-02-27 22:45:46

标签: python matrix cuda gpu sparse-matrix

我的最终目标是通过使用支持CUDA的GPU加速Python中矩阵矢量产品的计算。矩阵A约为15k x 15k且稀疏(密度约为0.05),矢量x为15k元素且密集,我正在计算Ax。我必须多次执行此计算,因此尽可能快地进行此计算将是理想的。

我目前的非GPU“优化”是将A表示为scipy.sparse.csc_matrix对象,然后简单地计算A.dot(x),但我希望能够在具有几个NVIDIA的VM上加快速度附加GPU,如果可能,仅使用Python(即不手动写出详细的内核函数)。我已成功使用cudamat库加速密集矩阵矢量产品,但不适用于稀疏情况。有一些关于稀疏案例在线的建议,例如使用pycuda,或scikit-cuda,或者anaconda的加速包,但是没有大量的信息,所以很难知道从哪里开始。

我不需要非常详细的说明,但如果有人之前已经解决了这个问题,并且可以为最简单的方法提供“大图”路线图,或者想要加快稀疏GPU的速度 - 基于矩阵向量的产品将超过scipy的稀疏算法,这将非常有用。

4 个答案:

答案 0 :(得分:2)

正如评论中所指出的,NVIDIA发布了cuSPARSE库,其中包含具有密集向量的稀疏矩阵产品的函数。

Numba现在通过pyculib包为cuSparse库提供Python绑定。

答案 1 :(得分:0)

感谢您的建议。

我设法让pyculib的csrmm(压缩稀疏行格式化矩阵的矩阵乘法)操作使用以下工作(在Google云平台上使用2个NVIDIA K80 GPU),但遗憾的是无法实现加速。

我认为这是因为csrmm函数中的大部分时间用于向/从GPU传输数据,而不是实际进行计算。不幸的是,我无法找到任何简单的pyculib方法来首先将数组放到GPU上并在迭代中保持它们。我使用的代码是:

import numpy as np
from scipy.sparse import csr_matrix
from pyculib.sparse import Sparse
from time import time


def spmv_cuda(a_sparse, b, sp, count):
    """Compute a_sparse x b."""

    # args to csrmm call
    trans_a = 'N'  # non-transpose, use 'T' for transpose or 'C' for conjugate transpose
    m = a_sparse.shape[0]  # num rows in a
    n = b.shape[1]  # num cols in b, c
    k = a_sparse.shape[1]  # num cols in a
    nnz = len(a_sparse.data)  # num nonzero in a
    alpha = 1  # no scaling
    descr_a = sp.matdescr(  # matrix descriptor
        indexbase=0,  # 0-based indexing
        matrixtype='G',  # 'general': no symmetry or triangular structure
    )
    csr_val_a = a_sparse.data  # csr data
    csr_row_ptr_a = a_sparse.indptr  # csr row pointers
    csr_col_ind_a = a_sparse.indices  # csr col idxs
    ldb = b.shape[0]
    beta = 0
    c = np.empty((m, n), dtype=a_sparse.dtype)
    ldc = b.shape[0]

    # call function
    tic = time()
    for ii in range(count):
        sp.csrmm(
            transA=trans_a,
            m=m,
            n=n,
            k=k,
            nnz=nnz,
            alpha=alpha,
            descrA=descr_a,
            csrValA=csr_val_a,
            csrRowPtrA=csr_row_ptr_a,
            csrColIndA=csr_col_ind_a,
            B=b,
            ldb=ldb,
            beta=beta,
            C=c,
            ldc=ldc)
    toc = time()

    return c, toc - tic

# run benchmark
COUNT = 20
N = 5000
P = 0.1

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(np.float32)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(np.float32)

sp = Sparse()

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}'.format(c[:5, 0]))

# pyculib sparse

c, t = spmv_cuda(a_sparse, b, sp, COUNT)

print('pyculib sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}'.format(c[:5, 0]))

产生输出:

Constructing objects...

scipy sparse matrix multiplication took 0.05158638954162598 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

Testing pyculib sparse matrix multiplication...

pyculib sparse matrix multiplication took 0.12598299980163574 seconds

c = [ 122.29483032  127.83659363  128.75003052  130.6912384   124.98326111]

正如您所看到的,即使矩阵乘法在GPU上,pyculib也是慢两倍。同样,可能是因为在每次迭代时向GPU传输数据或从GPU传输数据所涉及的开销。

然而,我找到的另一种解决方案是使用Andreas Kloeckner的pycuda库,它的速度提高了50倍!

import numpy as np
import pycuda.autoinit
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
from pycuda.sparse.packeted import PacketedSpMV
from pycuda.tools import DeviceMemoryPool
from scipy.sparse import csr_matrix
from time import time


def spmv_cuda(a_sparse, b, count):

    dtype = a_sparse.dtype
    m = a_sparse.shape[0]

    print('moving objects to GPU...')

    spmv = PacketedSpMV(a_sparse, is_symmetric=False, dtype=dtype)

    dev_pool = DeviceMemoryPool()
    d_b = gpuarray.to_gpu(b, dev_pool.allocate)
    d_c = gpuarray.zeros(m, dtype=dtype, allocator=d_b.allocator)

    print('executing spmv operation...\n')

    tic = time()
    for ii in range(count):
        d_c.fill(0)
        d_c = spmv(d_b, d_c)
    toc = time()

    return d_c.get(), toc - tic


# run benchmark
COUNT = 100
N = 5000
P = 0.1
DTYPE = np.float32

print('Constructing objects...\n\n')
np.random.seed(0)
a_dense = np.random.rand(N, N).astype(DTYPE)
a_dense[np.random.rand(N, N) >= P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N, 1).astype(DTYPE)

# numpy dense
tic = time()
for ii in range(COUNT):
    c = np.dot(a_dense, b)
toc = time()

print('numpy dense matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# scipy sparse
tic = time()
for ii in range(COUNT):
    c = a_sparse.dot(b)
toc = time()

print('scipy sparse matrix multiplication took {} seconds\n'.format(toc - tic))
print('c = {}\n'.format(c[:5, 0]))

# pycuda sparse
c, t = spmv_cuda(a_sparse, b, COUNT)
print('pycuda sparse matrix multiplication took {} seconds\n'.format(t))
print('c = {}\n'.format(c[:5]))

产生此输出:

numpy dense matrix multiplication took 0.2290663719177246 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

scipy sparse matrix multiplication took 0.24468040466308594 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

moving objects to GPU...
executing spmv operation...

pycuda sparse matrix multiplication took 0.004545450210571289 seconds

c = [ 122.29484558  127.83656311  128.75004578  130.69120789  124.98323059]

注1:pycuda需要以下依赖项:

  • pymetis:使用安装:pip install pymetis
  • nvcc:使用安装:sudo apt install nvidia-cuda-toolkit

注2:由于某些原因pip install pycuda无法安装文件pkt_build_cython.pyx,因此您需要自己从https://github.com/inducer/pycuda/blob/master/pycuda/sparse/pkt_build_cython.pyx下载/复制文件。

答案 2 :(得分:0)

另一种解决方案是使用tensorflow的矩阵乘法函数。一旦GPU启用的张量流启动并运行,这些就可以开箱即用。

安装CUDA和tensorflow-gpu(一些参与但简单明了的教程herehere)之后,您可以使用tensorflow的SparseTensor类和sparse_tensor_dense_matmul函数,如下所示:

import numpy as np
import tensorflow as tf
from tensorflow.python.client import device_lib
from time import time

确保检测到GPU:

gpus = [x.name for x in device_lib.list_local_devices() if x.device_type == 'GPU']
print('GPU DEVICES:\n  {}'.format(gpus))

输出:

GPU DEVICES:
  ['/device:GPU:0']

基准:

from scipy.sparse import csr_matrix

ITERS = 30
N = 20000
P = 0.1  # matrix density

使用scipy:

np.random.seed(0)

a_dense = np.random.rand(N, N)
a_dense[a_dense > P] = 0
a_sparse = csr_matrix(a_dense)

b = np.random.rand(N)

tic = time()
for ii in range(ITERS):
    c = a_sparse.dot(b)
toc = time()

elapsed = toc - tic

print('Scipy spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

输出:

Scipy spmv product took 0.06693172454833984 seconds per iteration.

使用支持GPU的张量流:

with tf.device('/device:GPU:0'):

    np.random.seed(0)

    a_dense = np.random.rand(N, N)
    a_dense[a_dense > P] = 0

    indices = np.transpose(a_dense.nonzero())
    values = a_dense[indices[:, 0], indices[:, 1]]
    dense_shape = a_dense.shape

    a_sparse = tf.SparseTensor(indices, values, dense_shape)

    b = tf.constant(np.random.rand(N, 1))

    tic = time()
    for ii in range(ITERS):
        c = tf.sparse_tensor_dense_matmul(a_sparse, b)
    toc = time()

elapsed = toc - tic

print('GPU spmv product took {} seconds per iteration.'.format(elapsed/ITERS))

输出:

GPU spmv product took 0.0011811971664428711 seconds per iteration.
事实证明,加速非常好。

答案 3 :(得分:0)

另一种替代方法是使用CuPy包。它具有与numpy / scipy相同的界面(非常不错),并且(至少对我而言),它的安装比pycuda容易得多。 新代码看起来像这样:

import cupy as cp
from cupyx.scipy.sparse import csr_matrix as csr_gpu

A = some_sparse_matrix #(scipy.sparse.csr_matrix)
x = some_dense_vector  #(numpy.ndarray)

A_gpu = csr_gpu(A)  #moving A to the gpu
x_gpu = cp.array(x) #moving x to the gpu

for i in range(niter):
    x_gpu = A_gpu.dot(x_gpu)
x = cp.asnumpy(x_gpu) #back to numpy object for fast indexing