我想删除重复项并保留最后一个时间戳。想要删除的重复项是customer_id
和var_name
。这是我的数据
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 3 apple 2018-03-23 08:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
2 5 orange 2018-03-23 08:00:00.000
所以结果将是
customer_id value var_name timestamp
1 1 apple 2018-03-22 00:00:00.000
2 4 apple 2018-03-24 08:00:00.000
1 1 orange 2018-03-22 08:00:00.000
2 3 orange 2018-03-24 08:00:00.000
答案 0 :(得分:5)
我认为sort_values
需要drop_duplicates
:
df = df.sort_values('timestamp').drop_duplicates(['customer_id','var_name'], keep='last')
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00.000
3 1 1 orange 2018-03-22 08:00:00.000
2 2 4 apple 2018-03-24 08:00:00.000
4 2 3 orange 2018-03-24 08:00:00.000
如果不需要排序 - 订单很重要:
df = df.loc[df.groupby(['customer_id','var_name'], sort=False)['timestamp'].idxmax()]
print (df)
customer_id value var_name timestamp
0 1 1 apple 2018-03-22 00:00:00
2 2 4 apple 2018-03-24 08:00:00
3 1 1 orange 2018-03-22 08:00:00
4 2 3 orange 2018-03-24 08:00:00