如何计算变量的摘要并保存为python中每个变量的数据框
我有一个pandas数据帧
Age_Bin Cat_Bin Outcome
Age1 Cat2 0
Age1 Cat1 1
Age2 Cat2 1
Age1 Cat1 1
Age2 Cat1 0
Age3 Cat1 0
Age3 Cat2 0
Age1 Cat1 1
Age3 Cat2 1
使用下面给出的代码计算每个变量的结果分布的摘要,如下所示。
Age_Bin变量
的示例Age_Bin Outcome_0_cnt Outcome_1_cnt Total_cnt Outcome_0_cnt% Outcome_1_cnt%
Age1 1 3 4 1/4 3/5
Age2 1 1 2 1/4 1/5
Age3 2 1 3 2/4 1/5
使用以下代码
实现了这一目标 df1 = ( df.groupby(['Age_Bin','Outcome'])['Cat_Bin'] .size() .unstack(fill_value=0) .add_prefix('Outcome_') ) df = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%'))
print (df)
Outcome Outcome_0 Outcome_1 Total_cnt Outcome_0% Outcome_1%
Age_Bin
Age1 1 3 4 0.25 0.6
Age2 1 1 2 0.25 0.2
Age3 2 1 3 0.50 0.2
除上述输出外,我还需要在Outcome_1%旁边再添加一列Z.
Z_Age= log(Outcome_1%/Outcome_0%).
然后根据给定的每个类别将每个变量的Z值映射到原始df
Age_Bin Cat_Bin Outcome Z_Age Z_Cat
Age1 Cat2 0
Age1 Cat1 1
Age2 Cat2 1
Age1 Cat1 1
Age2 Cat1 0
Age3 Cat1 0
Age3 Cat2 0
Age1 Cat1 1
Age3 Cat2 1
答案 0 :(得分:0)
使用:
df1 = (
df.groupby(['Age_Bin','Outcome'])['Cat_Bin']
.size()
.unstack(fill_value=0)
.add_prefix('Outcome_')
)
df2 = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%'))
print (df2)
Outcome Outcome_0 Outcome_1 Total_cnt Outcome_0% Outcome_1%
Age_Bin
Age1 1 3 4 0.25 0.6
Age2 1 1 2 0.25 0.2
Age3 2 1 3 0.50 0.2
然后添加:
df2 = df2.assign(Z_age=np.log(df2['Outcome_0%'] / df2['Outcome_1%']))
print (df2)
Outcome Outcome_0 Outcome_1 Total_cnt Outcome_0% Outcome_1% Z_age
Age_Bin
Age1 1 3 4 0.25 0.6 -0.875469
Age2 1 1 2 0.25 0.2 0.223144
Age3 2 1 3 0.50 0.2 0.916291
#map new column by Age, not possible category because no information about it in df2
df['Z_Age'] = df['Age_Bin'].map(df2['Z_age'])
print (df)
Age_Bin Cat_Bin Outcome Z_Age
0 Age1 Cat2 0 -0.875469
1 Age1 Cat1 1 -0.875469
2 Age2 Cat2 1 0.223144
3 Age1 Cat1 1 -0.875469
4 Age2 Cat1 0 0.223144
5 Age3 Cat1 0 0.916291
6 Age3 Cat2 0 0.916291
7 Age1 Cat1 1 -0.875469
8 Age3 Cat2 1 0.916291