Question

我在GSL中编写了一个非常简单的函数，从结构中的现有矩阵中选择一个子矩阵。

编辑：我的时间非常不正确，没有注意到前面零的数量变化。我希望这可以加速

对于10000x10000矩阵的100x100子矩阵，需要1.2E-5秒。因此，重复1E4次，比我对100x100矩阵对角化所需的时间长50倍。编辑：我意识到，即使我注释掉除了return（0）之外的一切，它也会发生; 因此，我理论化，它必须是关于struct TOWER的东西。这就是TOWER的样子：

struct TOWER 
{
    int array_level[TOWERSIZE];
    int array_window[TOWERSIZE];
    gsl_matrix *matrix_ordered_covariance;
    gsl_matrix *matrix_peano_covariance;

    double array_angle_tw[XISTEP];
    double array_correl_tw[XISTEP]; 
    gsl_interp_accel *acc_correl;   // interpolating for correlation
    gsl_spline *spline_correl;

    double array_all_eigenvalues[TOWERSIZE]; //contains all eiv. of whole matrix

    std::vector< std::vector<double> > cropped_peano_covariance, peano_mask;

};

下面是我的功能！

/* --- --- */
int monolevelsubmatrix(int i, int j, struct TOWER *tower, gsl_matrix *result)  //relying on spline!! //must addd auto vanishing
{
    int firstrow, firstcol,mu,nu,a,b;
    double aux, correl;

    firstrow = helix*i;
    firstcol = helix*j;

    gsl_matrix_view Xi = gsl_matrix_submatrix (tower ->matrix_ordered_covariance, firstrow, firstcol, helix, helix);
    gsl_matrix_memcpy (result, &(Xi.matrix));

    return(0);  
}
/* --- --- */

Answer 1

问题几乎可以肯定是gls_matric_memcpy。其来源是copy_source.c，其中包含：

    const size_t src_tda = src->tda ;
    const size_t dest_tda = dest->tda ;
    size_t i, j;

    for (i = 0; i < src_size1 ; i++)
      {
        for (j = 0; j < MULTIPLICITY * src_size2; j++)
          {
            dest->data[MULTIPLICITY * dest_tda * i + j] 
              = src->data[MULTIPLICITY * src_tda * i + j];
          }
      }

这会很慢。请注意，如果矩阵的大小不同，gls_matrix_memcpy将返回GLS_ERROR，因此很可能数据成员可以在dest和src的数据成员上使用CRT memcpy。

这个循环很慢。每个细胞都通过dest＆amp; amp; src结构为数据成员，然后索引。

您可以选择为该库编写替代品，或者编写您自己的此矩阵副本的个人版本，例如（此处未经测试的建议代码）：

unsigned int cellsize = sizeof( src->data[0] ); // just psuedocode here

memcpy( dest->data, src->data, cellsize * src_size1 * src_size2 * MULTIPLICITY )

请注意，MULTIPLICITY是一个定义，通常为1或2，可能取决于库配置 - 可能不适用于您的使用（如果它是1）

现在，重要的警告....如果源矩阵是一个子视图，那么你必须按行...即，i中的行循环，其中crt的memcpy一次限制为行，而不是我在上面展示的整个矩阵。

换句话说，您必须考虑从中获取子视图的源矩阵几何...这可能是他们为每个单元格编制索引的原因（使其变得简单）。

但是，如果您了解几何图形，则很可能将此方法优化为您所看到的性能。

如果您所做的就是取出src / dest derefence，您会看到一些性能提升，如：

        const size_t src_tda = src->tda ;
        const size_t dest_tda = dest->tda ;
        size_t i, j;

        float * dest_data = dest->data; // psuedocode here
        float * src_data  = src->data; // psuedocode here

        for (i = 0; i < src_size1 ; i++)
          {
            for (j = 0; j < MULTIPLICITY * src_size2; j++)
              {
                dest_data[MULTIPLICITY * dest_tda * i + j] 
                  = src_data[MULTIPLICITY * src_tda * i + j];
              }
          }

无论如何，我们希望编译器认识到，但......有时......

如何加快这个GSL代码选择子矩阵？

1 个答案: