Question

我有一个程序可以完全运行，但是由于基础数据的结构方式，不幸的是返回了重复项。结果看起来像这样：

   Date      Amount   Source   Type
  7/16/2019  10        A       B
  7/17/2019  10        A       B
  7/15/2019  10        A       B
  7/15/2019  10        B       B

I'd like to return:
   Date      Amount   Source   Type
  7/17/2019   10        A       B
  7/15/2019   10        B       B

选择

7/17/2019是因为它的最后日期是我们从来源A和类型B收到10。

我尝试过：

df.drop_duplicates(subset='a','b','date', keep="last")

，但效果不佳。有更好的方法吗？

这有效

df[df.Date.eq(df.groupby(['Source','Type'])['Date'].transform('max'))]

Answer 1

如This post中所述：

non_duplicate_index = ~df.index.duplicated(keep='first')
result = df.loc[non_duplicate_index]

df.index.duplicated(keep='first')返回包含True / False值的索引。如果值重复，则为True，否则为False。然后，~df.index.duplicated(keep='first')返回True，其中值不重复。

最后，df.loc[non_duplicate_index]是切片pandas方法，它返回df行，其中non_duplicate_index为True。

Answer 2

drop_duplicates也会很好

df.sort_values('Date').drop_duplicates(subset=['Source','Type'], keep="last") 
Out[566]: 
        Date  Amount Source Type
3 2019-07-15      10      B    B
1 2019-07-17      10      A    B

Python数据框-删除重复值？

2 个答案: