我创建了一个具有重复行的Dataframe,如下所示:
df = pd.DataFrame({"Order Date": ["January 1, 2017", "March 15, 2017", "April 20, 2017", "June 23, 2017", "December 12, 2017", None, "April 20, 2017", "April 20, 2017"],
"Sales Person": ["John", "John", "Rick", "Mary", "Mary", "Rick", "Rick", "Rick"],
"Items Sold": [4, -999, 1, np.nan, 7, 3, 1, 1],
"Item Price": [4.99, 1.99, 9.99, 19.99, 0.99, 2.99, 9.99, 9.99]})
如果我得到重复项,它会正确显示两行重复。
df[df.duplicated()]
然后我调用drop_duplicates
删除第二个副本并保留第一个副本。
df.drop_duplicates()
然而,看起来它正在删除两行而不是保留第一行。我错过了drop_duplicates
方法中的内容吗? docstring表示keep
参数默认为first
,即使我明确地输入了该参数,这仍然会发生。
答案 0 :(得分:1)
您的示例中有三个重复的行,使用keep= False
查看所有行
df[df.duplicated(keep=False)]
Out[661]:
Item Price Items Sold Order Date Sales Person
2 9.99 1.0 April 20, 2017 Rick
6 9.99 1.0 April 20, 2017 Rick
7 9.99 1.0 April 20, 2017 Rick
然后,drop_duplicates
只保留第1行第3行索引= 2
df.drop_duplicates()
Out[659]:
Item Price Items Sold Order Date Sales Person
0 4.99 4.0 January 1, 2017 John
1 1.99 -999.0 March 15, 2017 John
2 9.99 1.0 April 20, 2017 Rick
3 19.99 NaN June 23, 2017 Mary
4 0.99 7.0 December 12, 2017 Mary
5 2.99 3.0 None Rick