如何计算变量的摘要并在python中保存为dataframe

时间:2018-01-24 12:19:01

标签: python pandas numpy

如何计算变量的摘要并保存为python中每个变量的数据框

我有一个pandas数据帧

Age_Bin Cat_Bin Outcome
  Age1    Cat2     0
  Age1    Cat1     1
  Age2    Cat2     1
  Age1    Cat1     1
  Age2    Cat1     0
  Age3    Cat1     0
  Age3    Cat2     0
  Age1    Cat1     1
  Age3    Cat2     1

使用下面给出的代码计算每个变量的结果分布的摘要,如下所示。

Age_Bin变量

的示例
Age_Bin Outcome_0_cnt Outcome_1_cnt Total_cnt Outcome_0_cnt% Outcome_1_cnt%
  Age1         1         3           4           1/4            3/5
  Age2         1         1           2           1/4            1/5
  Age3         2         1           3           2/4            1/5

使用以下代码

实现了这一目标
    df1 = ( df.groupby(['Age_Bin','Outcome'])['Cat_Bin'] .size() .unstack(fill_value=0) .add_prefix('Outcome_') ) df = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%')) 

    print (df) 

    Outcome Outcome_0 Outcome_1 Total_cnt Outcome_0% Outcome_1%
   Age_Bin 
    Age1         1       3          4         0.25     0.6 
    Age2         1       1          2         0.25     0.2 
    Age3         2       1          3         0.50     0.2

除上述输出外,我还需要在Outcome_1%旁边再添加一列Z.

Z_Age= log(Outcome_1%/Outcome_0%).

然后根据给定的每个类别将每个变量的Z值映射到原始df

     Age_Bin Cat_Bin Outcome Z_Age Z_Cat
      Age1    Cat2     0
      Age1    Cat1     1
      Age2    Cat2     1
      Age1    Cat1     1
      Age2    Cat1     0
      Age3    Cat1     0
      Age3    Cat2     0
      Age1    Cat1     1
      Age3    Cat2     1

1 个答案:

答案 0 :(得分:0)

使用:

df1 = (
       df.groupby(['Age_Bin','Outcome'])['Cat_Bin']
         .size()
         .unstack(fill_value=0)
         .add_prefix('Outcome_')
      )

df2 = df1.assign(Total_cnt=lambda x: x.sum(1)).join(df1.div(df1.sum()).add_suffix('%'))
print (df2)
Outcome  Outcome_0  Outcome_1  Total_cnt  Outcome_0%  Outcome_1%
Age_Bin                                                         
Age1             1          3          4        0.25         0.6
Age2             1          1          2        0.25         0.2
Age3             2          1          3        0.50         0.2

然后添加:

df2 = df2.assign(Z_age=np.log(df2['Outcome_0%'] / df2['Outcome_1%']))
print (df2)
Outcome  Outcome_0  Outcome_1  Total_cnt  Outcome_0%  Outcome_1%     Z_age
Age_Bin                                                                   
Age1             1          3          4        0.25         0.6 -0.875469
Age2             1          1          2        0.25         0.2  0.223144
Age3             2          1          3        0.50         0.2  0.916291

#map new column by Age, not possible category because no information about it in df2
df['Z_Age'] = df['Age_Bin'].map(df2['Z_age'])
print (df)
  Age_Bin Cat_Bin  Outcome     Z_Age
0    Age1    Cat2        0 -0.875469
1    Age1    Cat1        1 -0.875469
2    Age2    Cat2        1  0.223144
3    Age1    Cat1        1 -0.875469
4    Age2    Cat1        0  0.223144
5    Age3    Cat1        0  0.916291
6    Age3    Cat2        0  0.916291
7    Age1    Cat1        1 -0.875469
8    Age3    Cat2        1  0.916291