我正在使用这个函数(见下)从两个数据帧开始计算Pearson和Pval,但我对Pval结果没有信心:似乎有太多的负相关很重要。
是否有更优雅的方式(如单行代码),以便与Pearson一起计算Pval?
这两个答案(pandas.DataFrame corrwith() method)和(correlation matrix of one dataframe with another)提供了优雅的解决方案,但缺少P值计算。
以下是代码:
def longestCommonParent(s1: String, s2: String): String = {
val maxSize = scala.math.min(s1.length, s2.length)
var i: Int = 0;
while (i < maxSize && s1(i) == s2(i)) i += 1;
parentFolder(s1.take(i));
}
def parentFolder(path: String) = {
path.substring(0, path.lastIndexOf("/"))
}
谢谢!
答案 0 :(得分:4)
您需要Pearson相关性测试,而不仅仅是相关性计算。因此,使用scipy.stats.pearsonr方法返回估计的Pearson系数和双尾pvalue。
由于该方法需要一系列输入,因此请考虑迭代两个数据帧的每一列以更新预先指定的矩阵。甚至使用所需的列和索引转换为数据框:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
df2 = pd.DataFrame(np.random.rand(10, 5), columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))
for i in range(df1.shape[1]):
for j in range(df2.shape[1]):
corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])
coeffmat[i,j] = corrtest[0]
pvalmat[i,j] = corrtest[1]
dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
# Col1 Col2 Col3 Col4 Col5
# Col1 -0.791083 0.459101 -0.488463 -0.289265 0.494897
# Col2 0.059446 -0.395072 0.310900 0.297532 0.201669
# Col3 -0.062592 0.391469 -0.450600 -0.136554 0.299579
# Col4 -0.470203 0.797971 -0.193561 -0.338896 -0.244132
# Col5 -0.057848 -0.037053 0.042798 0.176966 -0.157344
dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
# Col1 Col2 Col3 Col4 Col5
# Col1 0.006421 0.181967 0.152007 0.417574 0.145871
# Col2 0.870421 0.258506 0.381919 0.403770 0.576357
# Col3 0.863615 0.263268 0.191245 0.706796 0.400385
# Col4 0.170260 0.005666 0.592096 0.338101 0.496668
# Col5 0.873881 0.919058 0.906551 0.624783 0.664206
答案 1 :(得分:0)
你可以将它与bootstrap显着性进行比较(例如,如果你随机抽取一个系列,你获得相同或更高相关性的概率是多少)。这与Pearson的p值不同,因为后者是在假设您的数据是正态分布的情况下得出的,所以如果不是这样的话,你可能得到一些不同的结果。
bootstrapLen = 1000
leng= 10000
X, Y= [np.random.randn(leng) for _ in [1,2]]
correlation = np.correlate(X,Y)/leng
bootstrap = [ abs(np.correlate(X,Y[np.random.permutation(leng)])/leng) for _ in range(bootstrapLen)]
bootstrap = np.sort(np.ravel(bootstrap))
significance = np.searchsorted(bootstrap, abs(correlation)) / bootstrapLen
print("correlation is {} with significance {}".format(correlation,significance))