聚合熊猫数据框并删除不需要的行

时间:2020-06-25 18:05:21

标签: python-3.x pandas max aggregate pandas-groupby

我有一个数据框,我想在该数据框上进行聚合并除去某些条件下不需要的行

ID  Type                     Band      Event            Date        Function       Title                Country 
1   Lead  Jr                   L       Hire             07/06/2016  PM          Lead Product Specialist India 
1   Lead  Jr                   L       Job Change       01/03/2019  PM          Lead Product Specialist India
1   Lead  Jr                   L       Job Change       01/03/2019  PM          Lead Product Specialist India
1   Lead  Sr                   S       Promotion        25/07/2019  PM          Lead Project Manager    India
2   Trainee                    P       Job Change       25/07/2016  AM          Trainee                 Australia
2   SW Developer               L       Promotion        25/07/2017  AM          Developer Lead          Australia
2   SW Developer               L       Job Change       25/07/2018  AM          Developer Lead          Australia
2   Lead  Specialist           S       Promotion        25/07/2019  AM          Lead Project Manager    Australia
3   Lead  Specialist           S       Promotion        25/10/2019  AM          Lead Project Manager    Australia
4   Sr  Specialist             S       Promotion        25/11/2019  AM          Lead Project Manager    Australia

,我希望从数据中获得以下输出:

ID  Type                Band       Event            Date        Function       Title               Country 
1   Lead  Jr             L         Job Change    01/03/2019     PM       Lead Product Specialist     India
1   Lead  Sr             S         Promotion     25/07/2019     PM       Lead Project Manager        India
2   Trainee              P         Job Change    25/07/2016     AM       Trainee                   Australia
2   SW Developer         L         Job Change    25/07/2018     AM       Developer Lead            Australia
2   Lead  Specialist     L         Promotion     25/07/2019     AM       Lead Project Manager      Australia
3   Lead  Specialist     S         Promotion     25/10/2019     AM       Lead Project Manager      Australia
4   Sr  Specialist       S         Promotion     25/11/2019     AM       Lead Project Manager      Australia 

因此,从本质上讲,逻辑是我需要获取类型和乐队级别分组的唯一记录基础,并以最新日期(即最新记录)获取该记录。因此,如果有三个记录,其中Band =“ L”和Type =“ Lead Jr”,三个日期不同,那么我需要以这三个日期中的最新记录为基础,依此类推。

2 个答案:

答案 0 :(得分:2)

# date to datetime
df.Date = pd.to_datetime(df.Date)

# depending on the data, optionally sort
df.sort_values(['ID', 'Type', 'Date'], inplace=True)

# drop_duplicates with keep='last'
df.drop_duplicates(['ID', 'Type', 'Band'], keep='last')  # optionally add .reset_index(drop=True)

排序和drop_duplicates作为单行

df.sort_values(['ID', 'Type', 'Date']).drop_duplicates(['ID', 'Type', 'Band'], keep='last')

结果

   ID              Type Band       Event       Date Function                    Title   Country 
2   1          Lead  Jr    L  Job Change 2019-01-03       PM  Lead Product Specialist      India
3   1          Lead  Sr    S   Promotion 2019-07-25       PM     Lead Project Manager      India
7   2  Lead  Specialist    S   Promotion 2019-07-25       AM     Lead Project Manager  Australia
6   2      SW Developer    L  Job Change 2018-07-25       AM           Developer Lead  Australia
4   2           Trainee    P  Job Change 2016-07-25       AM                  Trainee  Australia
8   3  Lead  Specialist    S   Promotion 2019-10-25       AM     Lead Project Manager  Australia
9   4    Sr  Specialist    S   Promotion 2019-11-25       AM     Lead Project Manager  Australia

答案 1 :(得分:0)

如果按日期对数据框进行反向排序,则在每个组中,数据也将以这种方式进行排序,因此您可以安全地获取第一个数据框。

df.sort_values("Date", ascending=False).groupby(["Type", "Band"]).first()