是否有一种简单的方法可以对熊猫数据帧中的分布进行所有成对统计比较?

时间:2016-06-06 16:06:29

标签: python python-3.x pandas scipy statistics

我有一个包含5个发行版的pandas数据帧。我能够将它切片并使用rankums或等效物进行成对比较,如下所示:

case_1 = df[df['Symmetric Division Rate']=='1' and df['test']=='sackin']['value']
case_2 = df[df['Symmetric Division Rate']=='0.8' and df['test']=='sackin']['value']
case_3 = df[df['Symmetric Division Rate']=='0.6' and df['test']=='sackin']['value']
case_4 = df[df['Symmetric Division Rate']=='0.4' and df['test']=='sackin']['value']
case_5 = df[df['Symmetric Division Rate']=='0.2' and df['test']=='sackin']['value']

z_stat_12, p_val_12 = stats.ranksums(case1, case2)
z_stat_13, p_val_13 = stats.ranksums(case1, case3)
z_stat_14, p_val_14 = stats.ranksums(case1, case4)
z_stat_15, p_val_15 = stats.ranksums(case1, case5)
z_stat_23, p_val_23 = stats.ranksums(case2, case3)
z_stat_24, p_val_24 = stats.ranksums(case2, case4)
z_stat_25, p_val_25 = stats.ranksums(case2, case5)
z_stat_34, p_val_34 = stats.ranksums(case3, case4)
z_stat_35, p_val_35 = stats.ranksums(case3, case5)
z_stat_45, p_val_45 = stats.ranksums(case4, case5)

我得到了我想要的数字,但这看起来非常类似于unpython,我确信有一种更简单的使用熊猫的方法。

这是一个示例数据集(我之前从未发布过数据,如果这很笨重,很抱歉)。

    SymmetricDivisionRate   iteration   test    value
0   1   1   B1  205.0345238
1   1   1   Nbar    3.24545051
2   1   1   sackin  7312
3   1   1   sackin_yule -11.34946052
4   1   1   sackin_pda  0.068374536
5   1   2   B1  216.1595238
6   1   2   Nbar    3.182567216
7   1   2   sackin  7339
8   1   2   sackin_yule -11.45883714
9   1   2   sackin_pda  0.066274725
10  1   3   B1  209.1
11  1   3   Nbar    3.110472824
12  1   3   sackin  7039
13  1   3   sackin_yule -11.49329366
14  1   3   sackin_pda  0.065385904
15  1   4   B1  209.5678571
16  1   4   Nbar    3.215731371
17  1   4   sackin  6991
18  1   4   sackin_yule -11.30780804
19  1   4   sackin_pda  0.068968375
20  1   5   B1  218.1789683
21  1   5   Nbar    3.248949089
22  1   5   sackin  6956
23  1   5   sackin_yule -11.24400585
24  1   5   sackin_pda  0.070215755
25  0.8 1   B1  109.5333333
26  0.8 1   Nbar    2.789264414
27  0.8 1   sackin  4209
28  0.8 1   sackin_yule -11.00423445
29  0.8 1   sackin_pda  0.071803409
30  0.8 2   B1  137.5761905
31  0.8 2   Nbar    3.071715818
32  0.8 2   sackin  4583
33  0.8 2   sackin_yule -10.69913124
34  0.8 2   sackin_pda  0.079523708
35  0.8 3   B1  125.0428571
36  0.8 3   Nbar    3.630173565
37  0.8 3   sackin  5438
38  0.8 3   sackin_yule -10.14869758
39  0.8 3   sackin_pda  0.093793228
40  0.8 4   B1  119.45
41  0.8 4   Nbar    3.045751634
42  0.8 4   sackin  4660
43  0.8 4   sackin_yule -10.77537925
44  0.8 4   sackin_pda  0.077866162
45  0.8 5   B1  134.9511905
46  0.8 5   Nbar    3.507385999
47  0.8 5   sackin  5461
48  0.8 5   sackin_yule -10.34871987
49  0.8 5   sackin_pda  0.088887207
50  0.6 1   B1  113.6456349
51  0.6 1   Nbar    3.610369207
52  0.6 1   sackin  4596
53  0.6 1   sackin_yule -9.843110763
54  0.6 1   sackin_pda  0.101189958
55  0.6 2   B1  112.5384921
56  0.6 2   Nbar    4.176514032
57  0.6 2   sackin  5655
58  0.6 2   sackin_yule -9.400292666
59  0.6 2   sackin_pda  0.113502287
60  0.6 3   B1  109.9595238
61  0.6 3   Nbar    3.630434783
62  0.6 3   sackin  4843
63  0.6 3   sackin_yule -9.916620532
64  0.6 3   sackin_pda  0.099398705
65  0.6 4   B1  104.0289683
66  0.6 4   Nbar    4.133131619
67  0.6 4   sackin  5464
68  0.6 4   sackin_yule -9.395858086
69  0.6 4   sackin_pda  0.113674619
70  0.6 5   B1  98.8
71  0.6 5   Nbar    3.447641886
72  0.6 5   sackin  4313
73  0.6 5   sackin_yule -9.970985718
74  0.6 5   sackin_pda  0.097475056
75  0.4 1   B1  107.3107143
76  0.4 1   Nbar    3.649173955
77  0.4 1   sackin  3755
78  0.4 1   sackin_yule -9.378914506
79  0.4 1   sackin_pda  0.113759292
80  0.4 2   B1  105.1011905
81  0.4 2   Nbar    3.51625239
82  0.4 2   sackin  3678
83  0.4 2   sackin_yule -9.5445921
84  0.4 2   sackin_pda  0.10872119
85  0.4 3   B1  97.53452381
86  0.4 3   Nbar    3.655306719
87  0.4 3   sackin  3754
88  0.4 3   sackin_yule -9.368892583
89  0.4 3   sackin_pda  0.114061375
90  0.4 4   B1  98.34285714
91  0.4 4   Nbar    3.333010649
92  0.4 4   sackin  3443
93  0.4 4   sackin_yule -9.702833517
94  0.4 4   sackin_pda  0.103701859
95  0.4 5   B1  115.8261905
96  0.4 5   Nbar    3.275482094
97  0.4 5   sackin  3567
98  0.4 5   sackin_yule -9.865897615
99  0.4 5   sackin_pda  0.099257033
100 0.2 1   B1  90.50119048
101 0.2 1   Nbar    3.901939655
102 0.2 1   sackin  3621
103 0.2 1   sackin_yule -8.919632533
104 0.2 1   sackin_pda  0.128087444
105 0.2 2   B1  87.61666667
106 0.2 2   Nbar    3.126728111
107 0.2 2   sackin  2714
108 0.2 2   sackin_yule -9.561238501
109 0.2 2   sackin_pda  0.106128067
110 0.2 3   B1  87.70952381
111 0.2 3   Nbar    3.72
112 0.2 3   sackin  3162
113 0.2 3   sackin_yule -8.926080269
114 0.2 3   sackin_pda  0.127594947
115 0.2 4   B1  88.03333333
116 0.2 4   Nbar    3.089449541
117 0.2 4   sackin  2694
118 0.2 4   sackin_yule -9.607707206
119 0.2 4   sackin_pda  0.104621963
120 0.2 5   B1  89.45
121 0.2 5   Nbar    3.711306257
122 0.2 5   sackin  3381
123 0.2 5   sackin_yule -9.073308361
124 0.2 5   sackin_pda  0.122961062

1 个答案:

答案 0 :(得分:3)

你可以沿着这些方向做点什么:

from itertools import combinations
from scipy.stats import ranksums

创建选择器:

df.SymmetricDivisionRate = df.SymmetricDivisionRate.astype(str)
selectors = df.SymmetricDivisionRate.unique()

创建所有combinations选择器:

cases = combinations(selectors, 2)

[('1.0', '0.8'), ('1.0', '0.6'), ('1.0', '0.4'), ('1.0', '0.2'), ('0.8', '0.6'), ('0.8', '0.4'), ('0.8', '0.2'), ('0.6', '0.4'), ('0.6', '0.2'), ('0.4', '0.2')]

将相关数据保存在dictionary中(可选,您可以在下一步中动态选择数据,但会变得不那么可读):

means = {s: df.loc[(df['SymmetricDivisionRate']==s) & (df.test=='sackin'), 'value'] for s in selectors}

使用ranksums计算dictionary comprehension(您可以转换为pd.DataFrame

results = pd.DataFrame({c: ranksums(means[c[0]], means[c[1]]) for c in cases}).T
results.columns = ['z_stat', 'p_val']

得到:

           z_stat     p_val
0.4 0.2  2.193378  0.028280
0.6 0.2  2.611165  0.009023
    0.4  2.611165  0.009023
0.8 0.2  2.611165  0.009023
    0.4  2.611165  0.009023
    0.6 -0.731126  0.464702
1.0 0.2  2.611165  0.009023
    0.4  2.611165  0.009023
    0.6  2.611165  0.009023
    0.8  2.611165  0.009023