Question

我有兴趣了解在处理大型矩阵时如何有效地执行矩阵乘法。

我遇到了以下尺寸的矩阵问题。

T = A * P * B
A: 2601 x 40000
P: 40000 x 40000
B: 40000 x 2601

P是对角稀疏矩阵。所有这些都是dtype np.longcomplex。

谢谢

编辑：

乘法我遇到问题：

A.H1*P*A.H2 # takes forever
# A is a class I've defined, A.H1 and A.H2 are a np.matrix, dtype np.longcomplex
# P is a scipy.sparse.dia_matrix of dtype np.longcomplex
print A.H1.shape
    (2601, 40000)
print A.H2.shape
    (40000, 2601)
print A.H1.flags
    C_CONTIGUOUS : False
    F_CONTIGUOUS : True
    OWNDATA : False
    WRITEABLE : True
    ALIGNED : True
    UPDATEIFCOPY : False
print A.H2.flags
    C_CONTIGUOUS : True
    F_CONTIGUOUS : False
    OWNDATA : True
    WRITEABLE : True
    ALIGNED : True
    UPDATEIFCOPY : False
%timeit A.H1*A.H2[:,0]
    1 loops, best of 3: 922ms per loop

Answer 1

这不是python，但我使用PARPACK进行了矩阵对角化，矩阵的维数高达220K。

对于密集矩阵，我不能超过100K左右，因为当你超过100K维度时，PARPACK的dsyevd例程（我认为就是那个）似乎存在问题。

Numpy只会使用这样的库，所以我不明白为什么它不起作用。你需要大量的内存。

我所看到的矩阵的密度在1％到10％之间，对于所有这些矩阵，密集对角化的速度要快得多，但需要更多的内存。

就CPU资源而言，我一直在使用最高1TB内存的机器，最多48个内核。大约需要一周的时间来对最大的矩阵进行对角化，使用迭代对角线对于120K维度下降到几天，为20K特征值。对角化需要很多乘法运算，因此需要更长的时间。

我不明白为什么你会遇到只有40K维矩阵的问题。

here是一个不同的比较，至少你可以取消你可能达到的速度的球场上限。

我现在没有时间准备，我现在做了一段时间，知道单个乘法应该花多长时间。不久之前，我也想找到这些东西的基准，试着估计一些较大的工作需要花多长时间，并发现它们中的大多数确实没有达到很高的尺寸。

如果你可以节省内存，我建议你尝试将矩阵视为密集的，有时它会更快。

好的，所以这里有一些来自我最近做过的跑步的数字;

20000 Eigenvectors to be computed for
128252 dimensional final Hamiltonian matrix
 ======================================================
  inside subroutine --> 'mkarp' --> begins. 

  1. nstat -->                  20000
  2. nhdim -->                 128252
  3. nelm  -->             1132432736
  4. row_index_array(10)    -->          325
  5. column_index_array(10) -->            1
  6. ham_elem_array(10)     -->  -4.394871986531061E-004
 ===========================================================================
  Inside subroutine: 'arpack_coo_openmp_auto_fortran'. 

  total_time_matrix_vector_products (seconds) -->    621072.255066099     
  total_time_dsaupd_subroutine (seconds) -->    4954412.38241189     
  total_time_dseupd -->    157470.167896000 


 ===========================================================================
 ===========================================================================
  Inside subroutine: 'arpack_coo_openmp_auto_fotran'. 

  _sdrv1 
  ====== 

  size of the matrix is -->                 128252
  the number of ritz values requested is -->                  20000
  the number of arnoldi vectors generated (ncv) is -->        40001
  the portion of the spectrum is --> SA
  the number of converged ritz values is -->        20000
  the number of implicit arnoldi update iterations taken is -->           6
  the number of op*x is -->        86080
  the convergence criterion is -->   1.110223024625157E-016

 ===========================================================================

所以，我认为你能从中得到什么，是它执行了86080次矩阵乘法，并花费了621072秒（大约一周） - 但是它分布在许多核心上。这是每次操作7.2秒 - 这似乎与你有2-3秒的球场号码一致。我也认为，这种事情的很大一部分只是被内存带宽所阻碍（它总是在基准测试中提到）。

numpy / python - 你做过的最大矩阵乘法是什么？你是怎么做到的？

1 个答案: