删除重复项,但在每组给定列中保留具有最大值的行

时间:2018-12-19 04:10:12

标签: python pandas dataframe group-by

我有一个这样的DF:

    Name        Gender         Age      Level
  Pikachu        Male           4         8
 Charmander     Female          5         7
 Charmander     Female          5         7
 Squirtle        Male           3         6
 Squirtle        Male           3         9
 Squirtle       Female          4         9

我希望它看起来像这样:

   Name        Gender         Age      Level
  Pikachu        Male           4         8
 Charmander     Female          5         7
 Squirtle        Male           3         9
 Squirtle       Female          4         9

我不知道该怎么用英语解释我要用伪代码写出来。

基本上:

If Name, Gender and Age are the same:
      If there is a difference in levels:
            Keep the row with higher level
      If there is a tie:
            Keep a random one

任何想法都值得赞赏!

2 个答案:

答案 0 :(得分:3)

使用sort_values + drop_duplicates进行确认

df=df.sort_values('Level').drop_duplicates(['Name','Gender','Age'],keep='last')
df
         Name  Gender  Age  Level
2  Charmander  Female    5      7
0     Pikachu    Male    4      8
4    Squirtle    Male    3      9
5    Squirtle  Female    4      9

答案 1 :(得分:2)

使用argsortduplicated

df[~df.iloc[np.argsort(-df.Level)].drop('Level', 1).duplicated()]

         Name  Gender  Age  Level
0     Pikachu    Male    4      8
1  Charmander  Female    5      7
4    Squirtle    Male    3      9
5    Squirtle  Female    4      9

groupby + idxmax解决方案(尽管速度较慢):

df.iloc[df.groupby(['Name','Gender', 'Age']).Level.idxmax()]

         Name  Gender  Age  Level
1  Charmander  Female    5      7
0     Pikachu    Male    4      8
5    Squirtle  Female    4      9
4    Squirtle    Male    3      9