我有一个数据框,我想在该数据框上进行聚合并除去某些条件下不需要的行
ID Type Band Event Date Function Title Country
1 Lead Jr L Hire 07/06/2016 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Promotion 25/07/2017 AM Developer Lead Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist S Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
,我希望从数据中获得以下输出:
ID Type Band Event Date Function Title Country
1 Lead Jr L Job Change 01/03/2019 PM Lead Product Specialist India
1 Lead Sr S Promotion 25/07/2019 PM Lead Project Manager India
2 Trainee P Job Change 25/07/2016 AM Trainee Australia
2 SW Developer L Job Change 25/07/2018 AM Developer Lead Australia
2 Lead Specialist L Promotion 25/07/2019 AM Lead Project Manager Australia
3 Lead Specialist S Promotion 25/10/2019 AM Lead Project Manager Australia
4 Sr Specialist S Promotion 25/11/2019 AM Lead Project Manager Australia
因此,从本质上讲,逻辑是我需要获取类型和乐队级别分组的唯一记录基础,并以最新日期(即最新记录)获取该记录。因此,如果有三个记录,其中Band =“ L”和Type =“ Lead Jr”,三个日期不同,那么我需要以这三个日期中的最新记录为基础,依此类推。
答案 0 :(得分:2)
# date to datetime
df.Date = pd.to_datetime(df.Date)
# depending on the data, optionally sort
df.sort_values(['ID', 'Type', 'Date'], inplace=True)
# drop_duplicates with keep='last'
df.drop_duplicates(['ID', 'Type', 'Band'], keep='last') # optionally add .reset_index(drop=True)
df.sort_values(['ID', 'Type', 'Date']).drop_duplicates(['ID', 'Type', 'Band'], keep='last')
ID Type Band Event Date Function Title Country
2 1 Lead Jr L Job Change 2019-01-03 PM Lead Product Specialist India
3 1 Lead Sr S Promotion 2019-07-25 PM Lead Project Manager India
7 2 Lead Specialist S Promotion 2019-07-25 AM Lead Project Manager Australia
6 2 SW Developer L Job Change 2018-07-25 AM Developer Lead Australia
4 2 Trainee P Job Change 2016-07-25 AM Trainee Australia
8 3 Lead Specialist S Promotion 2019-10-25 AM Lead Project Manager Australia
9 4 Sr Specialist S Promotion 2019-11-25 AM Lead Project Manager Australia
答案 1 :(得分:0)
如果按日期对数据框进行反向排序,则在每个组中,数据也将以这种方式进行排序,因此您可以安全地获取第一个数据框。
df.sort_values("Date", ascending=False).groupby(["Type", "Band"]).first()