Question

我正在尝试保留除非最大数量的重复行以外的所有行。因此，最终我将使所有行都具有非重复项。

输入

RouterModule.forRoot(routes, { useHash: true });

预期输出：

df
   name  amount
0     a    1000
1     a    2000
2     a    5000
3     b    1000
4     b    2000
5     c    3000
6     d    4000
7     e    5000
8     f    6000
9     g    7000
10    h    8000
11    h   10000

这给了我一个没有索引值w.r.t df的序列。如何获得预期的输出？

Answer 1

您可以避免使用groupby和sort_values来保留drop_duplicates并保留索引：

df.sort_values('amount', ascending=False).drop_duplicates('name').sort_index()


   name  amount
2     a    5000
4     b    2000
5     c    3000
6     d    4000
7     e    5000
8     f    6000
9     g    7000
11    h   10000

您可以使用以下方法避免上一次sort_index通话：

df[~df.sort_values('amount', ascending=False).name.duplicated()]

   name  amount
2     a    5000
4     b    2000
5     c    3000
6     d    4000
7     e    5000
8     f    6000
9     g    7000
11    h   10000

了解布尔索引会为DataFrame重新索引。不过，您必须对UserWarning感到满意，

UserWarning: Boolean Series key will be reindexed to match DataFrame index.

特殊情况
由于您的数据似乎已经被排序，因此您可以

df[~df.duplicated('name', keep='last')]

   name  amount
2     a    5000
4     b    2000
5     c    3000
6     d    4000
7     e    5000
8     f    6000
9     g    7000
11    h   10000

但是，这通常无法正常工作。

Answer 2

可以检查idxmax

df.loc[df.groupby('name').amount.idxmax()]
   name  amount
2     a    5000
4     b    2000
5     c    3000
6     d    4000
7     e    5000
8     f    6000
9     g    7000
11    h   10000

熊猫在GroupBy之后丢失索引，同时删除重复项

2 个答案: