我有一系列值,我正在寻找与给定表格的每一行计算皮尔森相关性。
我该怎么做?
示例:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
s = pd.Series(v)
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
# Here I expect ot do df.corrwith(s) - but won't work
使用Series.corr()
进行计算,预期输出为
-0.1666666666666666 # correlation with the first row
0.83914639167827343 # correlation with the second row
-0.35355339059327379 # correlation with the third row
答案 0 :(得分:4)
index
需Series
columns
DataFrame
Series
DataFrame
axis=1
s1 = pd.Series(s.values, index=df.columns)
print (s1)
a -1
b 5
c 0
d 0
e 10
f 0
g -7
dtype: int64
print (df.corrwith(s1, axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
print (df.corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.166667
1 0.839146
2 -0.353553
dtype: float64
{}} {} corrwith
用于行相关:
cols = ['a','b','e']
print (df[cols])
a b e
0 1 0 0
1 0 1 1
2 1 1 0
print (df[cols].corrwith(pd.Series(v, index=df.columns), axis=1))
0 -0.891042
1 0.891042
2 -0.838628
dtype: float64
public User {
private String name;
// ... all the fields with getters and setters
}
编辑:
您可以指定列并使用子集:
// create a nice List for the users.
List<User> userList = new ArrayList<>();
while ((line = br.readLine()) != null) {
User user = new User();
String nums[] = line.split(SplitBy);
user.setName(nums[0]);
// create nice method to convert String to Date
user.setDate(convertStringToDate(nums[1]));
// add the user to the list
userList.add(user);
}
// Then finally sort the data according to the desired field.
Arrays.sort(userList, (a,b) -> a.name.compareTo(b.name));
答案 1 :(得分:0)
这可能对那些关心性能的人有用。 我发现与熊猫corrwith相比,运行时间减少了一半。
您的数据:
import pandas as pd
v = [-1, 5, 0, 0, 10, 0, -7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame([v1, v2, v3], columns=['a', 'b', 'c', 'd', 'e', 'f', 'g'])
解决方案(请注意,v不会转换为序列):
from scipy.stats.stats import pearsonr
s_corrs = df.apply(lambda x: pearsonr(x.values, v)[0], axis=1)