If M
is an n x m matrix and v
and u
are vectors, then in terms of indices, matrix-vector multiplication looks like u[i] = sum(M[i,j] v_j, 1 <= j <= m)
. Since v
is a vector, its elements are presumably stored in consecutive memory locations for numerical-computation-oriented languages. If M
is stored in row-major order (as in C, Mathematica, and Pascal), then the subsequent M[i,j]
in the sum are also stored in consecutive memory locations as j
is incremented, making the iteration very efficient. If it's stored in column-major order (as in Fortran, Matlab, R, and Julia), then incrementing j
requires moving over by a number of memory locations equal to the outer matrix stride, which in this case equals n
. This naively seems less efficient for matrices with many rows. (For matrix-matrix multiplication the problem doesn't come up, because under either ordering convention, incrementing the summed index requires moving by the major stride in one matrix's memory or the other.)
Is the difference between moving over in memory by one unit and by many units appreciable or negligible in most computer architectures, compared to the multiplication and addition operations? (I'm guessing "negligible", since in practice Fortran is typically at least as fast as C, but can anyone elaborate why?)
答案 0 :(得分:1)
在大多数计算机体系结构中,预计差异很大,至少在原则上是这样。
矩阵向量乘法是一种内存限制计算,因为内存的重用率很低。 v的所有(N)分量被重新用于计算u的每个元素,但矩阵的每个元素(N ^ 2)仅被使用一次。如果我们将典型存储器的延迟(参见例如https://gist.github.com/hellerbarde/2843375)视为(小于)100ns与执行浮点运算所需的时间(小于1ns),我们会看到大部分时间都花在了从数组加载和存储值。
我们仍然可以实现缓存友好,即尽可能地拥有数据局部性。由于内存作为行加载到缓存中,我们必须尽可能使用加载的缓存行。这就是访问连续内存区域减少从内存加载数据所花费的时间的原因。
为了支持这一点,让我们尝试一个非常简单的代码:
program mv
integer, parameter :: n=10000
real, allocatable :: M(:,:), v(:), u(:)
real :: start, finish
integer :: i, j
allocate(M(n,n),v(n),u(n))
call random_number(M)
call random_number(v)
u(:)=0.
call cpu_time(start)
do i=1,n
do j=1,n
! non-contiguous order
u(i)=u(i)+M(i,j)*v(j)
! contiguous order
! u(i)=u(i)+M(j,i)*v(j)
enddo
enddo
call cpu_time(finish)
print*,'elapsed time: ',finish-start
end program mv
一些结果:
non-contiguous order contiguous order
gfortran -O0 1. 0.5
gfortran -O3 0.3 0.1
ifort -O0 1.5 0.85
ifort -O3 0.037 0.035
正如您所看到的,差异在于没有优化的重要编译。启用优化gfortran仍然显示出显着差异,而使用ifort只有很小的差异。看一下编译器报告,似乎编译器互换了循环,从而导致对内循环的连续访问。
但是,我们可以说具有行主要排序的语言对于矩阵向量计算更有效吗?不,我不能这么说。不仅因为编译器可以弥补差异。代码本身并不知道M的行和列的所有内容:它基本上知道M有两个索引,其中一个 - 取决于语言 - 在内存中是连续的。对于矩阵向量,数据局部性的最佳方法是将“快速”索引映射到矩阵行索引。您可以使用“行主要”和“列主要”语言来实现此目的。您只需根据此存储M的值。例如,如果你有“代数”矩阵
[ M11 M12 ]
M = [ ]
[ M21 M22 ]
您将其存储为“计算矩阵”
C ==> M[1,1] = M11 ; M[1,2] = M12 ; M[2,1] = M21 ; M[2,2] = M22
Fortran ==> M[1,1] = M11 ; M[2,1] = M12 ; M[1,2] = M21 ; M[2,2] = M22
这样你就可以在“代数矩阵”行中连续排列。计算机对初始矩阵一无所知,但我们知道计算矩阵是代数矩阵的转置版本。在这两种情况下,我都会让内循环遍历连续索引,最终结果将是相同的向量。
在一个复杂的代码中,如果我已经用值分配并填充了矩阵,并且我无法决定存储转置矩阵,那么“行主要”语言可能会提供最佳性能。但是,交换循环(参见https://en.wikipedia.org/wiki/Loop_interchange)由英特尔编译器自动完成并由BLAS实现完成(参见http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f_source.html),将差异减少到非常小的差异值。因此,使用Fortran您可以更喜欢:
do j=1,n
do i=1,n
u(i)=u(i)+M(i,j)*v(j)
enddo
enddo