使用pandas创建汇总表

时间:2016-11-05 20:49:20

标签: python pandas dataframe group-by

如何使用pandas从以下数据中获取汇总表:

ID  Condition   Confirmed
D0119   Bad Yes
D0119   Good    No
D0117   Bad Yes
D0110   Bad Undefined
D1011   Bad Yes
D1011   Good    Yes
D1001   Bad Yes
D1001   Bad Yes

必需的输出:

ID  Condition   Confirmed   %Bad
D0119   Bad,Good    Yes, No 50
D0117   Bad,Yes 100
D0110   Bad,Undefined   0
D1011   Bad,Good    Yes, Yes
D1001   Bad,Bad Yes, Yes    100

有人可以帮忙吗?感谢

2 个答案:

答案 0 :(得分:1)

你可以这样做:

In [123]: (df.assign(Bad=df.Condition=='Bad')
     ...:    .groupby('ID')
     ...:    .agg({'Condition':pd.Series.tolist,
     ...:          'Confirmed':pd.Series.tolist,
     ...:          'Bad':'mean'})
     ...: )
     ...:
Out[123]:
       Bad    Condition    Confirmed
ID
D0110  1.0        [Bad]  [Undefined]
D0117  1.0        [Bad]        [Yes]
D0119  0.5  [Bad, Good]    [Yes, No]
D1001  1.0   [Bad, Bad]   [Yes, Yes]
D1011  0.5  [Bad, Good]   [Yes, Yes]

垂直变体:

In [113]: df
Out[113]:
      ID Condition  Confirmed
0  D0119       Bad        Yes
1  D0119      Good         No
2  D0117       Bad        Yes
3  D0110       Bad  Undefined
4  D1011       Bad        Yes
5  D1011      Good        Yes
6  D1001       Bad        Yes
7  D1001       Bad        Yes

In [114]: g = df.assign(Bad=df.Condition=='Bad').groupby('ID')

In [115]: df['Bad'] = df['ID'].map((g.sum().div(g.size(), 0)*100).Bad)

In [116]: df
Out[116]:
      ID Condition  Confirmed    Bad
0  D0119       Bad        Yes   50.0
1  D0119      Good         No   50.0
2  D0117       Bad        Yes  100.0
3  D0110       Bad  Undefined  100.0
4  D1011       Bad        Yes   50.0
5  D1011      Good        Yes   50.0
6  D1001       Bad        Yes  100.0
7  D1001       Bad        Yes  100.0

答案 1 :(得分:1)

考虑以下内容。

import pandas as pd

df = pd.DataFrame({'ID':['D0119', 'D0119', 'D0117', 'D0110', 'D1011', 'D1011', 'D1001', 'D1001'],
                   'Condition':['Bad', 'Good', 'Bad', 'Bad', 'Bad', 'Good', 'Bad', 'Bad'],
                   'Confirmed':['Yes', 'No', 'Yes', 'Undefined', 'Yes', 'Yes', 'Yes', 'Yes']})

df_grp = df.loc[df['Confirmed'] != 'Undefined'].groupby('ID')
summary = pd.DataFrame({'Condition':df_grp['Condition'],
                        'pnt_bad':df_grp['Condition'].apply(lambda x: sum(x=='Bad')/len(x))})

请注意,此方法不会保留仅具有“未定义”状态的记录的外观。