熊猫按连续日期和多个条件分组

时间:2020-11-11 17:48:16

标签: python pandas dataframe

我有这张桌子,我想将两行合并在一起,其中
end_date[-1]= start_date[0] AND trained[-1] = trained[0]

可选:我还想保留diff最大的re,id和name的值。

use trained start       end          diff   re  id  name
a   FALSE   01/12/2010  03/01/2018   2,590  0   4   25
a   TRUE    03/01/2018  08/02/2019   401    0   4   25
a   TRUE    08/02/2019  09/02/2019   1      0   4   25
a   TRUE    09/02/2019  31/12/2019   325    1   4   25
b   FALSE   01/08/2016  15/05/2018   652    0   5   8
c   FALSE   01/07/2019  06/08/2019   36     0   4   4
c   TRUE    06/08/2019  18/05/2020   286    0   4   4
c   TRUE    18/05/2020  19/05/2020   1      0   4   4
c   TRUE    19/05/2020  01/09/2020   105    0   4   4
c   TRUE    01/09/2020  31/12/2019   (245)  1   4   15

目标:

use trained start       end          diff   re  id  name
a   FALSE   01/12/2010  03/01/2018   2,590  0   4   25
a   TRUE    03/01/2018  31/12/2019   727    0   4   25
b   FALSE   01/08/2016  15/05/2018   652    0   5   8
c   FALSE   01/07/2019  06/08/2019   36     0   4   4
c   TRUE    06/08/2019  31/12/2019   147    0   4   4

2 个答案:

答案 0 :(得分:1)

根据您的逻辑,我们可以在否定条件下使用cumsum()来识别块。然后我们可以使用groupby

blocks = (df['trained'].ne(df['trained'].shift())
          |df['start'].ne(df['end'].shift())
         ).cumsum()
df.groupby([blocks,'use']).agg({   # change the functions to fit your need
    'trained':'first',
    'start':'first',
    'end':'last',
    'diff':'sum',
    're':'min',
    'id':'first',
    'name':'first'
}).reset_index('use')

  use  trained       start         end          diff  re  id  name
1   a    False  01/12/2010  03/01/2018         2,590   0   4    25
2   a     True  03/01/2018  31/12/2019       4011325   0   4    25
3   b    False  01/08/2016  15/05/2018           652   0   5     8
4   c    False  01/07/2019  06/08/2019            36   0   4     4
5   c     True  06/08/2019  31/12/2019  2861105(245)   0   4     4

答案 1 :(得分:1)

听起来您只是希望groupby中的“开始”的first值和“结束”的last值:

假设您的数据帧称为df

grouped = df.groupby(['use', 'trained'], as_index=False).agg({
    'start': 'first', 
    'end': 'last'})

您可以再次使用groupby来获取每组“使用”和“训练有素”的“ diff”最大值的索引。

max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values

现在,您可以获取字段're','id','name'的值:

re_id_name_df = df.loc[df.index.isin(max_idx_values), 
                       ['use', 'trained', 're',  'id',  'name']]

最后,您可以将两个结果合并在一起,以将所有结果合并到一个数据框中:

final = grouped.merge(re_id_name_df, on=['use', 'trained'])

所有代码都放在一个块中:

grouped = df.groupby(['use', 'trained'], as_index=False).agg({
    'start': 'first', 
    'end': 'last'})
max_idx_values = df.groupby(['use', 'trained'])['diff'].idxmax().values
re_id_name_df = df.loc[df.index.isin(max_idx_values), 
                       ['use', 'trained', 're',  'id',  'name']]
final = grouped.merge(re_id_name_df, on=['use', 'trained'])
print(final)

  use  trained      start        end  re  id  name
0   a    False 2010-01-12 2018-03-01   0   4    25
1   a     True 2018-03-01 2019-12-31   0   4    25
2   b    False 2016-01-08 2018-05-15   0   5     8
3   c    False 2019-01-07 2019-06-08   0   4     4
4   c     True 2019-06-08 2019-12-31   0   4     4