我有以下数据框:
duid start_date end_date
0 b2919f1eb 2019-08-26 2019-09-05
1 e372dedd4 2019-08-26 NaT
2 ba8147ce9 2019-09-09 2019-11-05
3 902c56036 2019-09-13 2019-10-01
4 16ec096a7 2019-09-17 2019-10-02
5 1faac1a15 2019-09-17 NaT
6 319fb59f5 2019-09-24 2020-01-20
7 2a3f1dac5 2019-10-01 NaT
8 aecbcf0c5 2019-10-01 2019-11-05
9 0ee088b63 2019-10-08 2019-10-03
10 c0c02fa4c 2019-10-31 2019-10-31
12 aac5fbc7d 2019-11-05 2019-11-05
11 c76bc248a 2019-11-05 2019-11-29
13 20dcef410 2019-11-12 NaT
14 bc7ea631d 2019-11-12 NaT
15 786af275b 2019-11-12 2019-11-12
16 005ec00c8 2019-11-15 NaT
17 482462695 2019-11-19 NaT
18 ecba54e5d 2019-11-26 NaT
19 28490c52f 2019-12-17 NaT
20 02f2f7f4b 2020-01-15 NaT
21 0ea659d1a 2020-01-29 NaT
22 0b78caca1 2020-01-29 NaT
23 368cc8744 2020-01-29 2020-01-29
此表描述了员工的雇用和离职日期。到目前为止,我已经设法计算出每月的计数:
df.groupby(df['start_date'].dt.strftime('%Y %B')) \
.agg(hired=('start_date', 'size'), left=('end_date', 'count')) \
.reset_index()
start_date hired left
0 2019 August 2 1
1 2019 December 1 0
2 2019 November 8 3
3 2019 October 4 3
4 2019 September 5 4
5 2020 January 4 1
此外,我尝试计算每个日期的累积总和,但它返回的结果很奇怪
ds = df.groupby(df['start_date'].dt.strftime('%Y %B'))
ds.size().cumsum()
start_date
2019 August 2
2019 December 3
2019 November 11
2019 October 15
2019 September 20
2020 January 24
dtype: int64
还有累积的剩余...
de = df.groupby(df['end_date'].dt.strftime('%Y %B'))
de.size().cumsum()
end_date
2019 November 5
2019 October 9
2019 September 10
2020 January 12
dtype: int64
有一个排序的事情,我不知道为什么表格不按start_date
排序,但是这个问题并不像计算两个值之间的差那样重要:>
df = df.sort_values('start_date')
我如何求和两列start_date
和end_date
的累加值,以获得以下结果
start_date hired left rooster
0 2019 August 2 1 1
1 2019 September 5 4 2
2 2019 October 4 3 3
3 2019 November 8 3 8
4 2019 December 1 0 9
5 2020 January 4 1 12
答案 0 :(得分:2)
您可能会发现更容易将分组键保留为类似于对象的日期时间,然后在最后重新格式化,以便正确进行排序。 (因此,带有频率或.to_period(...)等的pd.Grouper ...)
首先获取您的初始汇总数据,然后按分组索引排序,以确保您的数据按排序顺序:
agg = (
df.groupby(pd.Grouper(key='start_date', freq='M'))['end_date']
.agg(hired='size', left='count')
.sort_index()
)
然后为名单的总运行量分配一个新列...
agg['roster'] = agg['hired'].cumsum() - agg['left'].cumsum()
然后重新格式化索引并重置它,例如:
agg = agg.set_index(agg.index.strftime('%Y %B')).reset_index()
会给您:
start_date hired left roster
0 2019 August 2 1 1
1 2019 September 5 4 2
2 2019 October 4 3 3
3 2019 November 8 3 8
4 2019 December 1 0 9
5 2020 January 4 1 12