Question

我正在尝试在多层贝叶斯模型中计算矩阵矩阵乘积。数据和模型变量的形状如下：

X_train: (n_samples, n_predictors)
alpha: (n_groups, )
beta: (n_predictors, n_groups)

X_train是密集的。 alpha是组截距，beta是斜率（同样，每个组的变量）。样本分为几组；数组group_index（大小为(n_samples,)）表示每个样本所属的组。简而言之，线性模型是

y[n] = alpha[group_index[n]] + < X_train[n, 1:K], beta[1:K, group_index[n]] >

对于所有n = 1 ... n_samples，其中< , >表示内部乘积。

这是在Python中实现的方式：

# an element-wise product between two matrices of size [n_samples, n_predictors]
y = alpha[group_index] + (X_train * beta[:, group_index].T).sum(axis=-1)

问题：能否使此实现的内存/ CPU效率更高？

这里n_samples最多可以运行一百万，而n_predictors和n_groups大约有100个左右。我正在寻找的是一种矢量化的公式，该公式（1）所需的存储空间较低，而（2）的运行速度与我上面提出的速度差不多。

Answer 1

如果将X_train重塑为(N_samples,1,N_predictors)，将beta重塑为(1,N_predictors,N_groups)，则可以计算它们之间的矩阵乘法，得到(N_samples,1,1)，可以在添加之前将其展平到alpha[group_index]。

X_train_r = np.expand_dims(X_train,1)
beta_r = np.expand_dims(beta,0)
y = alpha[group_index] + (X_train_r @ beta_r[:,:, group_index].T).ravel()

它的计算速度也应该快将近两倍。

Answer 2

另一种尝试是：

alpha[group_index] + np.einsum('ij,ji->i', X_train_r, beta[:, group_index])

通过一个小例子进行测试：

In [192]: A = np.arange(10)                                                              
In [193]: B = np.arange(50).reshape(5,10)                                                
In [194]: X = np.arange(50).reshape(10,5)                                                

In [195]: A + (X * B.T).sum(axis=-1)                                                     
Out[195]: array([ 300,  836, 1422, 2058, 2744, 3480, 4266, 5102, 5988, 6924])
In [196]: A + np.einsum('ij,ji->i',X,B)                                                  
Out[196]: array([ 300,  836, 1422, 2058, 2744, 3480, 4266, 5102, 5988, 6924])

和时间：

In [197]: timeit A + (X * B.T).sum(axis=-1)                                              
14.1 µs ± 68 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [198]: timeit A + np.einsum('ij,ji->i',X,B)                                           
9.66 µs ± 9.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

以及Mercury的解决方案：

In [208]: A + (X[:,None,:]@B.T[:,:,None]).ravel()                                        
Out[208]: array([ 300,  836, 1422, 2058, 2744, 3480, 4266, 5102, 5988, 6924])
In [209]: timeit A + (X[:,None,:]@B.T[:,:,None]).ravel()                                 
7.3 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

matmul针对矩阵产品的批处理进行了优化，将数组传递到快速数学库（BLAS等）。

高效的矩阵矩阵产品

2 个答案: