Question

我想删除重复项并保留最后一个时间戳。想要删除的重复项是customer_id和var_name。这是我的数据

    customer_id  value   var_name     timestamp
    1            1       apple        2018-03-22 00:00:00.000        
    2            3       apple        2018-03-23 08:00:00.000
    2            4       apple        2018-03-24 08:00:00.000
    1            1       orange       2018-03-22 08:00:00.000
    2            3       orange       2018-03-24 08:00:00.000
    2            5       orange       2018-03-23 08:00:00.000

所以结果将是

    customer_id  value   var_name     timestamp
    1            1       apple        2018-03-22 00:00:00.000        
    2            4       apple        2018-03-24 08:00:00.000
    1            1       orange       2018-03-22 08:00:00.000
    2            3       orange       2018-03-24 08:00:00.000

Answer 1

我认为sort_values需要drop_duplicates：

df = df.sort_values('timestamp').drop_duplicates(['customer_id','var_name'], keep='last')
print (df)
   customer_id  value var_name                timestamp
0            1      1    apple  2018-03-22 00:00:00.000
3            1      1   orange  2018-03-22 08:00:00.000
2            2      4    apple  2018-03-24 08:00:00.000
4            2      3   orange  2018-03-24 08:00:00.000

如果不需要排序 - 订单很重要：

df = df.loc[df.groupby(['customer_id','var_name'], sort=False)['timestamp'].idxmax()]
print (df)
   customer_id  value var_name           timestamp
0            1      1    apple 2018-03-22 00:00:00
2            2      4    apple 2018-03-24 08:00:00
3            1      1   orange 2018-03-22 08:00:00
4            2      3   orange 2018-03-24 08:00:00

如何删除重复项并保留熊猫的最后时间戳

1 个答案: