使用此示例数据框数据:
+------+--------+------+-------+------+--------+
| NAME | JOB | YEAR | MONTH | DAYS | SALARY |
+------+--------+------+-------+------+--------+
| Bob | Worker | 2013 | 12 | 3 | 17 |
| Mary | Employ | 2013 | 12 | 5 | 23 |
| Bob | Worker | 2014 | 1 | 10 | 100 |
| Bob | Worker | 2014 | 1 | 11 | 110 |
| Mary | Employ | 2014 | 1 | 15 | 200 |
| Bob | Worker | 2014 | 2 | 8 | 80 |
| Mary | Employ | 2014 | 2 | 5 | 190 |
+------+--------+------+-------+------+--------+
有没有一种简单的方法可以获得这样的输出而无需手动创建所有的枢轴部件?
index=JOB,MAX(YEAR),NAME,SUM(DAYS)
columns=MONTH
values=SUM(SALARY)
+-----------+-------------+-------------+
| MONTH | 1 | 2 |
+--------+-----------+------+-----------+-------------+-------------+
| JOB | MAX(YEAR) | NAME | SUM(DAYS) | SUM(SALARY) | SUM(SALARY) |
+--------+-----------+------+-----------+-------------+-------------+
| Employ | 2014 | Mary | 29 | 210 | 190 |
| Worker | 2014 | Bob | 20 | 200 | 80 |
+--------+-----------+------+-----------+-------------+-------------+
答案 0 :(得分:1)
从:
开始In [179]: df
Out[179]:
NAME JOB YEAR MONTH DAYS SALARY
0 Bob Worker 2013 12 3 17
1 Mary Employ 2013 12 5 23
2 Bob Worker 2014 1 10 100
3 Bob Worker 2014 1 11 110
4 Mary Employ 2014 1 15 200
5 Bob Worker 2014 2 8 80
6 Mary Employ 2014 2 5 190
我们可以使用
获取我们想要的大部分数据result = df.groupby(['JOB', 'NAME', 'MONTH', 'YEAR']).sum().reset_index(['MONTH'])
# MONTH DAYS SALARY
# JOB NAME YEAR
# Employ Mary 2014 1 15 200
# 2014 2 5 190
# 2013 12 5 23
# Worker Bob 2014 1 21 210
# 2014 2 8 80
# 2013 12 3 17
为此我们添加天数之和:
total_days = df.groupby(['JOB', 'NAME', 'YEAR'])[['DAYS']].sum()
total_days.columns = ['SUM(DAYS)']
# SUM(DAYS)
# JOB NAME YEAR
# Employ Mary 2013 5
# 2014 20
# Worker Bob 2013 3
# 2014 29
result = result.join(total_days)
del result['DAYS']
# MONTH SALARY SUM(DAYS)
# JOB NAME YEAR
# Employ Mary 2013 12 23 5
# 2014 1 200 20
# 2014 2 190 20
# Worker Bob 2013 12 17 3
# 2014 1 210 29
# 2014 2 80 29
要选择与max(YEAR)
相关联的行,我们会计算
max_year = df.groupby(['JOB', 'NAME'])[['YEAR']].max()
max_year = max_year.set_index('YEAR', drop=False, append=True)
# YEAR
# JOB NAME YEAR
# Employ Mary 2014 2014
# Worker Bob 2014 2014
所以选择可以表示为左连接:
result = max_year.join(result)
del result['YEAR']
# MONTH SALARY SUM(DAYS)
# JOB NAME YEAR
# Employ Mary 2014 1 200 20
# 2014 2 190 20
# Worker Bob 2014 1 210 29
# 2014 2 80 29
现在我们可以将MONTH移动到这样的分层列级别:
result = result.set_index(['SUM(DAYS)', 'MONTH'], append=True)
result = result.unstack('MONTH')
result = result.reset_index(['SUM(DAYS)'])
产生
SUM(DAYS) SALARY
MONTH 1 2
JOB NAME YEAR
Employ Mary 2014 20 200 190
Worker Bob 2014 29 210 80