Numpy:当一些向量元素等于零时,矩阵向量乘法不会跳过计算吗?

时间:2016-02-09 01:01:35

标签: python numpy matrix-multiplication blas

我最近一直致力于一个项目,我的大部分时间花在一个密集矩阵A和稀疏矢量v上(见here)。在我尝试减少计算时,我注意到A.dot(v)的运行时间不受v的零条目数的影响。

要解释为什么我希望在这种情况下改进运行时,请result = A.dot.v使result[j] = sum_i(A[i,j]*v[j]) for j = 1...v.shape[0]成为v[j] = 0。如果result[j] = 0明确A[::,j],则无论值result[j] = 0如何。在这种情况下,我希望numpy只设置sum_i(A[i,j]*v[j]),但似乎它继续并计算import time import numpy as np np.__config__.show() #make sure BLAS/LAPACK is being used np.random.seed(seed = 0) n_rows, n_cols = 1e5, 1e3 #initialize matrix and vector A = np.random.rand(n_rows, n_cols) u = np.random.rand(n_cols) u = np.require(u, dtype=A.dtype, requirements = ['C']) #time start_time = time.time() A.dot(u) print "time with %d non-zero entries: %1.5f seconds" % (sum(u==0.0), (time.time() - start_time)) #set all but one entry of u to zero v = u set_to_zero = np.random.choice(np.array(range(0, u.shape[0])), size = (u.shape[0]-2), replace=False) v[set_to_zero] = 0.0 start_time = time.time() A.dot(v) print "time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time)) #what I would really expect it to take non_zero_index = np.squeeze(v != 0.0) A_effective = A[::,non_zero_index] v_effective = v[non_zero_index] start_time = time.time() A_effective.dot(v_effective) print "expected time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time))

我继续编写了一个简短的示例脚本来确认下面的这种行为。

u

运行这个,我得到矩阵向量乘法的运行时是相同的,无论我使用密集矩阵v还是稀疏矩阵time with 0 non-zero entries: 0.04279 seconds time with 999 non-zero entries: 0.04050 seconds expected time with 999 non-zero entries: 0.00466 seconds

numpy

我想知道这是否符合设计要求?或者我错过了运行矩阵向量乘法的方法。正如完整性检查一样:我确保C_CONTIGUOUS链接到我的BLAS库机器和两个数组都是COPY . /this/folder (因为这显然需要numpy来调用BLAS)。

2 个答案:

答案 0 :(得分:1)

如何尝试像?

这样的简单函数
def dot2(A,v):
    ind = np.where(v)[0]
    return np.dot(A[:,ind],v[ind])

In [352]: A=np.ones((100,100))

In [360]: timeit v=np.zeros((100,));v[::60]=1;dot2(A,v)
10000 loops, best of 3: 35.4 us per loop

In [362]: timeit v=np.zeros((100,));v[::40]=1;dot2(A,v)
10000 loops, best of 3: 40.1 us per loop

In [364]: timeit v=np.zeros((100,));v[::20]=1;dot2(A,v)
10000 loops, best of 3: 46.5 us per loop

In [365]: timeit v=np.zeros((100,));v[::60]=1;np.dot(A,v)
10000 loops, best of 3: 29.2 us per loop

In [366]: timeit v=np.zeros((100,));v[::20]=1;np.dot(A,v)
10000 loops, best of 3: 28.7 us per loop

完全迭代的Python实现将是:

def dotit(A,v, test=False):
    n,m = A.shape  
    res = np.zeros(n)
    if test:
        for i in range(n):
            for j in range(m):
                if v[j]:
                    res[i] += A[i,j]*v[j]
    else:
        for i in range(n):
            for j in range(m):
                res[i] += A[i,j]*v[j]
    return res

显然这不会像编译dot一样快,但我希望测试的相对优势仍然适用。如需进一步测试,您可以在cython

中实施

请注意,v[j]测试发生在迭代的深处。

对于稀疏v(100个元素中的3个)测试可以节省时间:

In [374]: timeit dotit(A,v,True)
100 loops, best of 3: 3.81 ms per loop

In [375]: timeit dotit(A,v,False)
10 loops, best of 3: 21.1 ms per loop

但如果v密集,则需要花费时间:

In [376]: timeit dotit(A,np.arange(100),False)
10 loops, best of 3: 22.7 ms per loop

In [377]: timeit dotit(A,np.arange(100),True)
10 loops, best of 3: 25.6 ms per loop

答案 1 :(得分:0)

对于简单数组,Numpy不执行此类优化,但如果需要,您可以使用稀疏矩阵,这可能会改善点产品计时。 有关该主题的更多信息,请参阅:http://docs.scipy.org/doc/scipy/reference/sparse.html