有没有一种方法可以按顺序对熊猫数据帧进行分组汇总?

时间:2020-01-27 19:08:31

标签: python pandas

我有一个看起来像这样的数据框:

      emp    job phase   cat  hours equipnum equipcode  equiphours   equipdate
0  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
1  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
2  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
3  OO003  19713   95L  9512      1     None      None         0.0  2020-01-24
4  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
5  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
6  OO003  19526   OH   MAT       1   AIR012     E-REV         1.0  2020-01-24
7  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
8  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
9  OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
10 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
11 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
12 UR003  19713   95L  9512      1     None      None         0.0  2020-01-24
13 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
14 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
15 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
16 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
17 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
18 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24
19 UR003  19526   OH   MAT       1     None      None         0.0  2020-01-24

是否有一种方法可以对前8行的小时数列进行汇总,然后对每个唯一的雇员编号(emp)进行后2行分组?

最终数据框应如下所示:


     emp    job phase   cat  hours equipnum equipcode  equiphours   equipdate
0   OO003  19713   95L  9512      4     None      None         0.0  2020-01-24
1   OO003  19526   OH   MAT       3   AIR012     E-REV         1.0  2020-01-24
2   OO003  19486   52L  5212      1     None      None         0.0  2020-01-24
3   OO003  19486   52L  5212      2     None      None         0.0  2020-01-24
4   UR003  19713   95L  9512      3     None      None         0.0  2020-01-24
5   UR003  19526    OH   MAT      5     None      None         0.0  2020-01-24
6   UR003  19526    OH   MAT      2     None      None         0.0  2020-01-24

谢谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您需要2个groupby。第一个创建员工内部累计工作时间。然后,按员工,工作以及累计工作小时数是否为<= 8分组。相应地汇总列。

s = df.groupby('emp').hours.cumsum()
#s = df.groupby('emp').cumcount()+1 # If truly rows, not hours

# `first` for everything but hours and group keys. `sum` for hours
agg_d = {x: 'first' for x in df.columns.difference(['hours', 'job', 'emp'])}
agg_d['hours'] = 'sum'

res = (df.groupby(['job', 'emp', s.le(8).rename('drop')], sort=False)
         .agg(agg_d)
         .reset_index()
         .drop(columns='drop'))

print(res)
     job    emp   cat equipcode   equipdate  equiphours equipnum phase  hours
0  19713  OO003  9512      None  2020-01-24         0.0     None   95L      4
1  19526  OO003   MAT     E-REV  2020-01-24         1.0   AIR012    OH      3
2  19486  OO003  5212      None  2020-01-24         0.0     None   52L      1
3  19486  OO003  5212      None  2020-01-24         0.0     None   52L      2
4  19713  UR003  9512      None  2020-01-24         0.0     None   95L      3
5  19526  UR003   MAT      None  2020-01-24         0.0     None    OH      5
6  19526  UR003   MAT      None  2020-01-24         0.0     None    OH      2