Pandas多索引具有多个聚合函数

时间:2014-09-02 10:52:45

标签: python pandas

使用此示例数据框数据:

+------+--------+------+-------+------+--------+
| NAME |  JOB   | YEAR | MONTH | DAYS | SALARY |
+------+--------+------+-------+------+--------+
| Bob  | Worker | 2013 |    12 |    3 |     17 |
| Mary | Employ | 2013 |    12 |    5 |     23 |
| Bob  | Worker | 2014 |     1 |   10 |    100 |
| Bob  | Worker | 2014 |     1 |   11 |    110 |
| Mary | Employ | 2014 |     1 |   15 |    200 |
| Bob  | Worker | 2014 |     2 |    8 |     80 |
| Mary | Employ | 2014 |     2 |    5 |    190 |
+------+--------+------+-------+------+--------+

有没有一种简单的方法可以获得这样的输出而无需手动创建所有的枢轴部件?

index=JOB,MAX(YEAR),NAME,SUM(DAYS)  
columns=MONTH  
values=SUM(SALARY)

                                +-----------+-------------+-------------+
                                |     MONTH |           1 |           2 |
    +--------+-----------+------+-----------+-------------+-------------+
    |  JOB   | MAX(YEAR) | NAME | SUM(DAYS) | SUM(SALARY) | SUM(SALARY) |
    +--------+-----------+------+-----------+-------------+-------------+
    | Employ |      2014 | Mary |        29 |         210 |         190 |
    | Worker |      2014 | Bob  |        20 |         200 |          80 |
    +--------+-----------+------+-----------+-------------+-------------+

1 个答案:

答案 0 :(得分:1)

从:

开始
In [179]: df
Out[179]: 
   NAME     JOB  YEAR  MONTH  DAYS  SALARY
0   Bob  Worker  2013     12     3      17
1  Mary  Employ  2013     12     5      23
2   Bob  Worker  2014      1    10     100
3   Bob  Worker  2014      1    11     110
4  Mary  Employ  2014      1    15     200
5   Bob  Worker  2014      2     8      80
6  Mary  Employ  2014      2     5     190

我们可以使用

获取我们想要的大部分数据
result = df.groupby(['JOB', 'NAME', 'MONTH', 'YEAR']).sum().reset_index(['MONTH'])

#                   MONTH  DAYS  SALARY
# JOB    NAME YEAR                     
# Employ Mary 2014      1    15     200
#             2014      2     5     190
#             2013     12     5      23
# Worker Bob  2014      1    21     210
#             2014      2     8      80
#             2013     12     3      17

为此我们添加天数之和:

total_days = df.groupby(['JOB', 'NAME', 'YEAR'])[['DAYS']].sum()
total_days.columns = ['SUM(DAYS)']

#                   SUM(DAYS)
# JOB    NAME YEAR           
# Employ Mary 2013          5
#             2014         20
# Worker Bob  2013          3
#             2014         29

result = result.join(total_days)
del result['DAYS']
#                   MONTH  SALARY  SUM(DAYS)
# JOB    NAME YEAR                          
# Employ Mary 2013     12      23          5
#             2014      1     200         20
#             2014      2     190         20
# Worker Bob  2013     12      17          3
#             2014      1     210         29
#             2014      2      80         29

要选择与max(YEAR)相关联的行,我们会计算

max_year = df.groupby(['JOB', 'NAME'])[['YEAR']].max()
max_year = max_year.set_index('YEAR', drop=False, append=True)

#                   YEAR
# JOB    NAME YEAR      
# Employ Mary 2014  2014
# Worker Bob  2014  2014

所以选择可以表示为左连接:

result = max_year.join(result)
del result['YEAR']

#                   MONTH  SALARY  SUM(DAYS)
# JOB    NAME YEAR                          
# Employ Mary 2014      1     200         20
#             2014      2     190         20
# Worker Bob  2014      1     210         29
#             2014      2      80         29

现在我们可以将MONTH移动到这样的分层列级别:

result = result.set_index(['SUM(DAYS)', 'MONTH'], append=True)
result = result.unstack('MONTH')
result = result.reset_index(['SUM(DAYS)'])

产生

                  SUM(DAYS)  SALARY     
MONTH                             1    2
JOB    NAME YEAR                        
Employ Mary 2014         20     200  190
Worker Bob  2014         29     210   80