鉴于下面的示例数据,我试图计算条件概率。列表示在序列(S1,S2 ......)中发生的事件A-E。
前两个示例按预期工作并计算P(S2 | S1)和P(S2,S3 | S1)。当条件包括多个列时,该方法会中断,如示例三所示,我预期会计算P(S3 | S1,S2)。
我很欣赏有关为什么这不起作用以及哪种替代方法可以获得P(S3 | S1,S2)的期望结果的见解。例如,我希望输出包含行A,D,B,0.25
和A,D,C,0.75
。
谢谢!
MWE代码:
import pandas as pd
data = { 'S1' : ['A','A','A','B','B','A','A'],
'S2' : ['B','D','D','A','D','D','D'],
'S3' : ['C','C','C','D','C','B','C'],
'S4' : ['D','B','E','C','A','C','E'] }
df = pd.DataFrame(data)
print (df)
print ((df.groupby(['S1','S2']).agg({'S4':'count'}) /
df.groupby('S1').agg({'S4':'count'})).rename(columns={'S4':'Freq'}))
print ((df.groupby(['S1','S2','S3']).agg({'S4':'count'}) /
df.groupby('S1').agg({'S4':'count'})).rename(columns={'S4':'Freq'}))
print ((df.groupby(['S1','S2','S3']).agg({'S4':'count'}) /
df.groupby(['S1','S2']).agg({'S4':'count'})).rename(columns={'S4':'Freq'}))
输出:
S1 S2 S3 S4
0 A B C D
1 A D C B
2 A D C E
3 B A D C
4 B D C A
5 A D B C
6 A D C E
Freq
S1 S2
A B 0.2
D 0.8
B A 0.5
D 0.5
Freq
S1 S2 S3
A B C 0.2
D B 0.2
C 0.6
B A D 0.5
D C 0.5
Traceback (most recent call last):
File "test.py", line 13, in <module>
print ((df.groupby(['S1','S2','S3']).agg({'S4':'count'}) / df.groupby(['S1','S2']).agg({'S4':'count'})).rename(columns={'S4':'Freq'}))
NotImplementedError: merging with more than one level overlap on a multi-index is not implemented