pandas系列与整个DataFrame之间的相关性

时间:2017-01-23 12:41:38

标签: python pandas correlation

我有一系列值,我正在寻找与给定表格的每一行计算皮尔森相关性。

我该怎么做?

示例:

import pandas as pd

v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]

s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

# Here I expect ot do df.corrwith(s) - but won't work

使用Series.corr()进行计算,预期输出为

-0.1666666666666666  # correlation with the first row
0.83914639167827343  # correlation with the second row
-0.35355339059327379 # correlation with the third row

2 个答案:

答案 0 :(得分:4)

indexSeries columns DataFrame Series DataFrame axis=1 s1 = pd.Series(s.values, index=df.columns) print (s1) a -1 b 5 c 0 d 0 e 10 f 0 g -7 dtype: int64 print (df.corrwith(s1, axis=1)) 0 -0.166667 1 0.839146 2 -0.353553 dtype: float64 print (df.corrwith(pd.Series(v, index=df.columns), axis=1)) 0 -0.166667 1 0.839146 2 -0.353553 dtype: float64 {}} {} corrwith用于行相关:

cols = ['a','b','e']

print (df[cols])
   a  b  e
0  1  0  0
1  0  1  1
2  1  1  0

print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0   -0.891042
1    0.891042
2   -0.838628
dtype: float64
public User { 
    private String name;
    // ... all the fields with getters and setters
}

编辑:

您可以指定列并使用子集:

// create a nice List for the users.
List<User> userList = new ArrayList<>();

while ((line = br.readLine()) != null) {
            User user = new User();
            String nums[] = line.split(SplitBy);
            user.setName(nums[0]);
            // create nice method to convert String to Date
            user.setDate(convertStringToDate(nums[1]));
            // add the user to the list
            userList.add(user);

}

// Then finally sort the data according to the desired field.
Arrays.sort(userList, (a,b) -> a.name.compareTo(b.name));

答案 1 :(得分:0)

这可能对那些关心性能的人有用。 我发现与熊猫corrwith相比,运行时间减少了一半。

您的数据:

import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]    
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

解决方案(请注意,v不会转换为序列):

from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)