我有一个熊猫数据框(在连接两个数据帧之后),它有一些重复的行,除了两列,其中写了一些行标识符。 E.g。
A B C D E F
Peter 1 c d e f
Paula 2 g h i j
Frank 3 c d e f
Robert 4 k l m n
Sarah 5 g h i j
用于测试:
df= pd.DataFrame({"A":["Peter", "Paula", "Frank", "Robert", "Sara"],
"B":[1,2,3,4,5],
"C":["c","g","c","k","g"],
"D":["d","h","d","l","h"],
"E":["e","i","e","m","i"],
"F":["f","j","f","n","j"]})
我想只保留字母C到F中重复项的第一次出现,并保留该行的名称和编号(列" A"和#34; B")。因此,我们会获得
A B C D E F
Peter 1 c d e f
Paula 2 g h i j
Robert 4 k l m n
我用df.drop_duplicates尝试了一些东西,但这不适用于排除行" A"和" B"。此外,当分为两个数据帧,分别为A和B,C到D,drop_duplicate,以及之后通过索引合并不起作用,因为drop_duplicates会重置索引。那么,如何实现呢?谢谢。
答案 0 :(得分:1)
df2 = df.drop_duplicates(subset=["C", "D", "E", "F"])
输出:
A B C D E F
0 Peter 1 c d e f
1 Paula 2 g h i j
3 Robert 4 k l m n
请参阅here。