如何使用熊猫计算两个cumsum列之间的差异

时间:2020-02-17 19:24:06

标签: python pandas

我有以下数据框:

         duid start_date   end_date
0   b2919f1eb 2019-08-26 2019-09-05
1   e372dedd4 2019-08-26        NaT
2   ba8147ce9 2019-09-09 2019-11-05
3   902c56036 2019-09-13 2019-10-01
4   16ec096a7 2019-09-17 2019-10-02
5   1faac1a15 2019-09-17        NaT
6   319fb59f5 2019-09-24 2020-01-20
7   2a3f1dac5 2019-10-01        NaT
8   aecbcf0c5 2019-10-01 2019-11-05
9   0ee088b63 2019-10-08 2019-10-03
10  c0c02fa4c 2019-10-31 2019-10-31
12  aac5fbc7d 2019-11-05 2019-11-05
11  c76bc248a 2019-11-05 2019-11-29
13  20dcef410 2019-11-12        NaT
14  bc7ea631d 2019-11-12        NaT
15  786af275b 2019-11-12 2019-11-12
16  005ec00c8 2019-11-15        NaT
17  482462695 2019-11-19        NaT
18  ecba54e5d 2019-11-26        NaT
19  28490c52f 2019-12-17        NaT
20  02f2f7f4b 2020-01-15        NaT
21  0ea659d1a 2020-01-29        NaT
22  0b78caca1 2020-01-29        NaT
23  368cc8744 2020-01-29 2020-01-29

此表描述了员工的雇用和离职日期。到目前为止,我已经设法计算出每月的计数:

df.groupby(df['start_date'].dt.strftime('%Y %B')) \
   .agg(hired=('start_date', 'size'), left=('end_date', 'count')) \
   .reset_index()
       start_date  hired  left
0     2019 August      2     1
1   2019 December      1     0
2   2019 November      8     3
3    2019 October      4     3
4  2019 September      5     4
5    2020 January      4     1

此外,我尝试计算每个日期的累积总和,但它返回的结果很奇怪

ds = df.groupby(df['start_date'].dt.strftime('%Y %B'))
ds.size().cumsum()
start_date
2019 August        2
2019 December      3
2019 November     11
2019 October      15
2019 September    20
2020 January      24
dtype: int64

还有累积的剩余...

de = df.groupby(df['end_date'].dt.strftime('%Y %B'))
de.size().cumsum()
end_date
2019 November      5
2019 October       9
2019 September    10
2020 January      12
dtype: int64

有一个排序的事情,我不知道为什么表格不按start_date排序,但是这个问题并不像计算两个值之间的差那样重要:

df = df.sort_values('start_date')

我如何求和两列start_dateend_date的累加值,以获得以下结果

       start_date  hired  left  rooster
0     2019 August      2     1        1
1  2019 September      5     4        2
2    2019 October      4     3        3
3   2019 November      8     3        8
4   2019 December      1     0        9
5    2020 January      4     1       12

1 个答案:

答案 0 :(得分:2)

您可能会发现更容易将分组键保留为类似于对象的日期时间,然后在最后重新格式化,以便正确进行排序。 (因此,带有频率或.to_period(...)等的pd.Grouper ...)

首先获取您的初始汇总数据,然后按分组索引排序,以确保您的数据按排序顺序:

agg = (
    df.groupby(pd.Grouper(key='start_date', freq='M'))['end_date']
    .agg(hired='size', left='count')
    .sort_index()
)

然后为名单的总运行量分配一个新列...

agg['roster'] = agg['hired'].cumsum() - agg['left'].cumsum()

然后重新格式化索引并重置它,例如:

agg = agg.set_index(agg.index.strftime('%Y %B')).reset_index()

会给您:

       start_date  hired  left  roster
0     2019 August      2     1       1
1  2019 September      5     4       2
2    2019 October      4     3       3
3   2019 November      8     3       8
4   2019 December      1     0       9
5    2020 January      4     1      12