Matrix / Tensor Triple产品?

时间:2015-05-13 05:20:35

标签: matlab numpy matrix matrix-multiplication blas

我正在研究的算法需要在几个地方计算一种矩阵三元产品。

该操作采用具有相同尺寸的三个方形矩阵,并产生3指数张量。标记操作数ABC,结果的(i,j,k) - 元素是

X[i,j,k] = \sum_a A[i,a] B[a,j] C[k,a]

在numpy中,您可以使用einsum('ia,aj,ka->ijk', A, B, C)计算此内容。

问题:

  • 此操作是否具有标准名称?
  • 我可以通过一次BLAS通话来计算吗?
  • 是否还有其他可以计算此类表达式的重度优化数值C / Fortran库?

3 个答案:

答案 0 :(得分:6)

density=True x n为矩阵大小。在Matlab中,您可以

  1. nA分组到C x n^2矩阵n,以便AC行对应于所有行的组合ACA
  2. C后乘AC。这给出了期望的结果,只是形状不同。
  3. 重塑和置换尺寸以获得所需形式的结果。
  4. 代码:

    B

    使用基于逐字循环的方法进行检查:

    AC = reshape(bsxfun(@times, permute(A, [1 3 2]), permute(C, [3 1 2])), n^2, n); % // 1
    X = permute(reshape((AC*B).', n, n, n), [2 1 3]);                               %'// 2, 3
    

    最大相对差异的大小为%// Example data: n = 3; A = rand(n,n); B = rand(n,n); C = rand(n,n); %// Proposed approach: AC = reshape(bsxfun(@times, permute(A, [1 3 2]), permute(C, [3 1 2])), n^2, n); X = permute(reshape((AC*B).', n, n, n), [2 1 3]); %' %// Loop-based approach: Xloop = NaN(n,n,n); %// initiallize for ii = 1:n for jj = 1:n for kk = 1:n Xloop(ii,jj,kk) = sum(A(ii,:).*B(:,jj).'.*C(kk,:)); %' end end end %// Compute maximum relative difference: max(max(max(abs(X./Xloop-1)))) ans = 2.2204e-16 ,因此结果在数值精度范围内是正确的。

答案 1 :(得分:5)

简介和解决方案代码

np.einsum,真的很难被击败,在极少数情况下,如果你可以将matrix-multiplication引入计算中,你仍然可以击败它。经过几次试验后,您似乎可以通过np.einsum('ia,aj,ka->ijk', A, B, C)引入matrix-multiplication with np.dot以超越效果。

基本思想是我们打破所有的einsum"操作为np.einsumnp.dot的组合,如下所示:

  • A:[i,a]B:[a,j]的摘要通过np.einsum完成,以便我们3D array:[i,j,a]
  • 然后将此3D数组重新整形为2D array:[i*j,a],第三个数组C[k,a]转置为[a,k],目的是在这两个数组之间执行matrix-multiplication,我们[i*j,k]作为矩阵产品,因为我们在那里丢失了索引[a]
  • 产品重新整形为3D array:[i,j,k]以获得最终输出。

这是迄今为止讨论的第一个版本的实现 -

import numpy as np

def tensor_prod_v1(A,B,C):   # First version of proposed method
    # Shape parameters
    m,d = A.shape
    n = B.shape[1]
    p = C.shape[0]

    # Calculate \sum_a A[i,a] B[a,j] to get a 3D array with indices as (i,j,a)
    AB = np.einsum('ia,aj->ija', A, B)

    # Calculate entire summation losing a-ith index & reshaping to desired shape
    return np.dot(AB.reshape(m*n,d),C.T).reshape(m,n,p)

由于我们在所有三个输入数组中求和a-th索引,因此可以有三种不同的方法来沿第a个索引求和。前面列出的代码是(A,B)。因此,我们还可以让(A,C)(B,C)为我们提供两个变体,如下所示:

def tensor_prod_v2(A,B,C):
    # Shape parameters
    m,d = A.shape
    n = B.shape[1]
    p = C.shape[0]

    # Calculate \sum_a A[i,a] C[k,a] to get a 3D array with indices as (i,k,a)
    AC = np.einsum('ia,ja->ija', A, C)

    # Calculate entire summation losing a-ith index & reshaping to desired shape
    return np.dot(AC.reshape(m*p,d),B).reshape(m,p,n).transpose(0,2,1)

def tensor_prod_v3(A,B,C):
    # Shape parameters
    m,d = A.shape
    n = B.shape[1]
    p = C.shape[0]

    # Calculate \sum_a B[a,j] C[k,a] to get a 3D array with indices as (a,j,k)
    BC = np.einsum('ai,ja->aij', B, C)

    # Calculate entire summation losing a-ith index & reshaping to desired shape
    return np.dot(A,BC.reshape(d,n*p)).reshape(m,n,p)

根据输入数组的形状,不同的方法会相互产生不同的加速,但我们希望所有方法都优于all-einsum方法。性能数字列在下一节中。

运行时测试

这可能是最重要的部分,因为我们试图通过提出的方法的三种变体来研究加速数 最初在问题中提出的all-einsum方法。

数据集#1(等形数组):

In [494]: L1 = 200
     ...: L2 = 200
     ...: L3 = 200
     ...: al = 200
     ...: 
     ...: A = np.random.rand(L1,al)
     ...: B = np.random.rand(al,L2)
     ...: C = np.random.rand(L3,al)
     ...: 

In [495]: %timeit tensor_prod_v1(A,B,C)
     ...: %timeit tensor_prod_v2(A,B,C)
     ...: %timeit tensor_prod_v3(A,B,C)
     ...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
     ...: 
1 loops, best of 3: 470 ms per loop
1 loops, best of 3: 391 ms per loop
1 loops, best of 3: 446 ms per loop
1 loops, best of 3: 3.59 s per loop

数据集#2(更大的A):

In [497]: L1 = 1000
     ...: L2 = 100
     ...: L3 = 100
     ...: al = 100
     ...: 
     ...: A = np.random.rand(L1,al)
     ...: B = np.random.rand(al,L2)
     ...: C = np.random.rand(L3,al)
     ...: 

In [498]: %timeit tensor_prod_v1(A,B,C)
     ...: %timeit tensor_prod_v2(A,B,C)
     ...: %timeit tensor_prod_v3(A,B,C)
     ...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
     ...: 
1 loops, best of 3: 442 ms per loop
1 loops, best of 3: 355 ms per loop
1 loops, best of 3: 303 ms per loop
1 loops, best of 3: 2.42 s per loop

数据集#3(更大的B):

In [500]: L1 = 100
     ...: L2 = 1000
     ...: L3 = 100
     ...: al = 100
     ...: 
     ...: A = np.random.rand(L1,al)
     ...: B = np.random.rand(al,L2)
     ...: C = np.random.rand(L3,al)
     ...: 

In [501]: %timeit tensor_prod_v1(A,B,C)
     ...: %timeit tensor_prod_v2(A,B,C)
     ...: %timeit tensor_prod_v3(A,B,C)
     ...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
     ...: 
1 loops, best of 3: 474 ms per loop
1 loops, best of 3: 247 ms per loop
1 loops, best of 3: 439 ms per loop
1 loops, best of 3: 2.26 s per loop

数据集#4(更大的C):

In [503]: L1 = 100
     ...: L2 = 100
     ...: L3 = 1000
     ...: al = 100
     ...: 
     ...: A = np.random.rand(L1,al)
     ...: B = np.random.rand(al,L2)
     ...: C = np.random.rand(L3,al)

In [504]: %timeit tensor_prod_v1(A,B,C)
     ...: %timeit tensor_prod_v2(A,B,C)
     ...: %timeit tensor_prod_v3(A,B,C)
     ...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
     ...: 
1 loops, best of 3: 250 ms per loop
1 loops, best of 3: 358 ms per loop
1 loops, best of 3: 362 ms per loop
1 loops, best of 3: 2.46 s per loop

数据集#5(更大的第a维长度):

In [506]: L1 = 100
     ...: L2 = 100
     ...: L3 = 100
     ...: al = 1000
     ...: 
     ...: A = np.random.rand(L1,al)
     ...: B = np.random.rand(al,L2)
     ...: C = np.random.rand(L3,al)
     ...: 

In [507]: %timeit tensor_prod_v1(A,B,C)
     ...: %timeit tensor_prod_v2(A,B,C)
     ...: %timeit tensor_prod_v3(A,B,C)
     ...: %timeit np.einsum('ia,aj,ka->ijk', A, B, C)
     ...: 
1 loops, best of 3: 373 ms per loop
1 loops, best of 3: 269 ms per loop
1 loops, best of 3: 299 ms per loop
1 loops, best of 3: 2.38 s per loop

结论:我们看到 8x-10x 的加速,其中提议方法的变化超过了问题中列出的all-einsum方法。

答案 2 :(得分:0)

我知道现在有点老了,但这个话题很多。在Matlab中很难击败tprod,这是由Jason Farquhar在这里编写的MEX文件

https://www.mathworks.com/matlabcentral/fileexchange/16275-tprod-arbitary-tensor-products-between-n-d-arrays

虽然它仅限于二进制操作(2个张量),但是tprod的工作方式与einsum非常相似。这可能不是真正的限制,因为我怀疑einsum只执行一系列二进制操作。这些操作的顺序有很大的不同,我的理解是einsum只是按照数组传递的顺序执行它们,并且不允许多个中间产品。

tprod也仅限于密集(完整)数组。 Kolda的Tensor工具箱(在之前的文章中提到)确实支持稀疏张量,但其功能比tprod更受限制(它不允许输出中的重复索引)。我正在努力填补这些空白,但如果Mathworks做到了这不是很好吗?