Question

我有大约3000个文件。每个文件有大约55000行/标识符和大约100列。我需要为每个文件计算行方向相关或加权协方差（取决于文件中的列数）。所有文件中的行数相同。我想知道计算每个文件的相关矩阵的最有效方法是什么？我已经尝试过Perl和C ++，但它需要花费大量时间来处理文件 - Perl需要6天，C需要一天以上。通常，我不希望每个文件花费超过15-20分钟。

现在，我想知道我是否可以使用某些技巧或其他东西更快地处理它。这是我的伪代码：

while (using the file handler)
  reading the file line by line
  Storing the column values in hash1 where the key is the identifier
  Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler

for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
  calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
  calculate correlation coefficient.
end

现在，我只计算一对的相关系数一次，而我不计算相同标识符的相关系数。我使用嵌套的for循环来处理这个问题。你认为是否有办法更快地计算相关系数？任何提示/建议都会很棒。谢谢！

EDIT1：我的输入文件看起来像这样 - 前10个标识符：

"Ident_01"  6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02"  449.803 367.757 302.633 318.037 331.55
"Ident_03"  16.4878 198.937 220.376 91.352  237.983
"Ident_04"  26.4878 398.937 130.376 92.352  177.983
"Ident_05"  36.4878 298.937 430.376 93.352  167.983
"Ident_06"  46.4878 498.937 560.376 94.352  157.983
"Ident_07"  56.4878 598.937 700.376 95.352  147.983
"Ident_08"  66.4878 698.937 990.376 96.352  137.983
"Ident_09"  76.4878 798.937 120.376 97.352  117.983
"Ident_10"  86.4878 898.937 450.376 98.352  127.983

EDIT2：这是我在perl中编写的片段/子例程或函数

## Pearson Correlation Coefficient
sub correlation {
    my( $arr1, $arr2) = @_;
    my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
    my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
    return $cor ;
}

## Mean
sub mean {
    my $arr1 = shift;
    my $mu_x = sum( @$arr1) /scalar(@$arr1);
    return($mu_x);
}

## Sum of Squared Deviations of x to the mean i.e. ssxx  
sub ssxx {
    my ( $arr1, $mean_x ) = @_;
    my $ssxx = 0;

    ## looping over all the samples
    for( my $i = 0; $i < @$arr1; $i++ ){
        $ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
    }
    return($ssxx); 
}

## Sum of Squared Deviations of xy to the mean i.e. ssxy 
sub ssxy {
    my( $arr1, $arr2, $mean_x, $mean_y ) = @_;
    my $ssxy = 0;

    ## looping over all the samples
    for( my $i = 0; $i < @$arr1; $i++ ){
        $ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
    }
    return ($ssxy);
}

Answer 1

您搜索过CPAN吗？方法gsl_stats_correlation用于计算Pearsons相关性。这个是Math::GSL::Statisics。该模块与GNU Scientific Library绑定。

gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - 此函数有效地计算数组引用$data1和$data2之间的Pearson相关系数，它们必须具有相同的长度$n。 r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

Answer 2

虽然可能会有微小的改进，但我建议投资学习PDL。 documentation on matrix operations可能很有用。

Answer 3

@Sinan和@Praveen对于如何在perl中执行此操作有正确的想法。我建议perl固有的开销意味着你永远无法获得你想要的效率。我建议你努力优化你的C代码。

第一步是设置-O3标志以进行最大程度的代码优化。

从那里开始，我会更改您的ssxx代码，以便从每个数据点中减去均值：x[i] -= mean。这意味着您不再需要减去ssxy代码中的平均值，这样您就可以减去一次55001次减法。

我会检查反汇编以确保将(x-mean)**2编译为乘法而不是2^(2 * log(x - mean))，或者只是将其编写为。{/ p>

您使用哪种数据结构来处理数据？为每行分配内存的double**将导致对（慢速函数）malloc的额外调用。此外，更有可能导致内存抖动，分配的内存位于不同的位置。理想情况下，您应该尽可能少地调用malloc来获取尽可能大的内存块，并使用指针算法来遍历数据。

应该可以进行更多优化。如果您发布代码，我可以提出一些建议。

逐行计算相关/协方差矩阵的有效方法

3 个答案: