我有两个浮点数向量x和y,我想计算Pearson相关系数。因为我必须在很多数据上做这些(例如1000万个不同的向量x和2万个不同的向量y),我使用的是C ++,更具体地说是GSL的gsl_stats_correlation函数。
这是我的C ++代码:
#include <iostream>
#include <vector>
using namespace std;
#include <gsl/gsl_vector.h>
#include <gsl/gsl_statistics.h>
int main (int argc, char ** argv)
{
vector<double> x, y;
size_t n = 5;
x.push_back(1.0); y.push_back(1.0);
x.push_back(3.1); y.push_back(3.2);
x.push_back(2.0); y.push_back(1.9);
x.push_back(5.0); y.push_back(4.9);
x.push_back(2.0); y.push_back(2.1);
for(size_t i=0; i<n; ++i)
printf ("x[%ld]=%.1f y[%ld]=%.1f\n", i, x[i], i, y[i]);
gsl_vector_const_view gsl_x = gsl_vector_const_view_array( &x[0], x.size() );
gsl_vector_const_view gsl_y = gsl_vector_const_view_array( &y[0], y.size() );
double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
(double*) gsl_y.vector.data, sizeof(double),
n );
printf ("Pearson correlation = %f\n", pearson);
return 0;
}
它成功编译(gcc -Wall -g pearson.cpp -lstdc ++ -lgsl -lgslcblas -o pearson)但是当我在这里运行它是输出:
x[0]=1.0 y[0]=1.0
x[1]=3.1 y[1]=3.2
x[2]=2.0 y[2]=1.9
x[3]=5.0 y[3]=4.9
x[4]=2.0 y[4]=2.1
Pearson correlation = 1.000000
显然结果不应该是1,如下面的R代码所示:
x <- c(1.0,3.1,2.0,5.0,2.0); y <-c(1.0,3.2,1.9,4.9,2.1)
cor(x, y, method="pearson") # 0.99798
我错过了什么?
答案 0 :(得分:2)
更改行:
double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, sizeof(double),
(double*) gsl_y.vector.data, sizeof(double),
n );
为:
double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, 1,
(double*) gsl_y.vector.data, 1,
n );
或者,如果你想避免重复“幻数”1:
const size_t stride = 1;
double pearson = gsl_stats_correlation( (double*) gsl_x.vector.data, stride,
(double*) gsl_y.vector.data, stride,
n );
gsl_stats_correlation假定为double
,第二个和第四个参数是“stride”的双精度数,因此通过给它sizeof(double)
它会跳过sizeof(double)*sizeof(double)
个字节。