Question

我在A列中有一个重复值的数据框。我想删除重复项，保留B列中值最高的行。

所以这个：

应该变成这个：

Wes添加了一些很好的功能来删除重复项：http://wesmckinney.com/blog/?p=340。但是AFAICT，它是专为完全重复而设计的，所以没有提到选择保留哪些行的标准。

我猜这可能是一种简单的方法 - 可能就像在删除重复项之前对数据帧进行排序一样简单 - 但我不知道groupby的内部逻辑是否足够清楚。有什么建议吗？

Answer 1

这是最后一次。虽然不是最大值：

In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]: 
   A   B
1  1  20
3  2  40
4  3  10

你也可以这样做：

In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]: 
   A   B
A       
1  1  20
2  2  40
3  3  10

Answer 2

最重要的答案是做了太多的工作，对于大型数据集看起来非常慢。 apply很慢，如果可能应该避免。 ix已被弃用，也应予以避免。

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()

   A   B
1  1  20
3  2  40
4  3  10

或者只是按所有其他列分组并获取所需列的最大值。 df.groupby('A', as_index=False).max()

Answer 3

我会先对数据框进行排序，然后将B列降序，然后删除A列的重复项并保留在第一位

df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")

没有任何分组依据

Answer 4

试试这个：

df.groupby(['A']).max()

Answer 5

你也可以试试这个

df.drop_duplicates(subset='A', keep='last')

我是从https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html

提到的

Answer 6

我认为在你的情况下，你真的不需要一个团队。我会按降序排序你的B列，然后在A列删除重复项，如果你想要你也可以有一个新的好的和这样的干净指数：

df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)

Answer 7

最简单的解决方案：

要基于一列删除重复项：

var bot;
try {
    bot = new BasicBot(conversationState, userState, botConfig);
} catch (err) {
    console.error(`[botInitializationError]: ${ err }`);
    process.exit();
}

// Create HTTP server
// let server = restify.createServer();
let server = express();
server.listen(process.env.port || process.env.PORT || 3978, function() {
    console.log(`\n${ server.name } listening to ${ server.url }`);
    console.log(`\nGet Bot Framework Emulator: https://aka.ms/botframework-emulator`);
    console.log(`\nTo talk to your bot, open basic-bot.bot file in the Emulator`);
});

// Listen for incoming activities and route them to your bot main dialog.
server.post('/api/messages', (req, res) => {
    // Route received a request to adapter for processing
    adapter.processActivity(req, res, async (turnContext) => {
        // route to bot activity handler.
        await bot.onTurn(turnContext);
    });
});

要基于多个列删除重复项：

df = df.drop_duplicates('column_name', keep='last')

Answer 8

这是我必须解决的一个变化，值得分享：对于status=0中的每个唯一字符串，我想在columnA中找到最常见的关联字符串。

columnB

如果该模式有平局，则df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()将选择一个。（请注意，在一系列.any()上使用.any()会返回布尔值，而不是选择其中一个。）

对于原始问题，相应的方法简化为

int。

Answer 9

当已有的帖子回答了这个问题时，我做了一点改动，添加了应用max（）函数的列名，以提高代码的可读性。

df.groupby('A', as_index=False)['B'].max()

Answer 10

最简单的方法：

# First you need to sort this DF as Column A as ascending and column B as descending 
# Then you can drop the duplicate values in A column 
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step. 

d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df

    A   B
0   1   30
1   1   40
2   2   50
3   3   42
4   1   38
5   2   30
6   3   25
7   1   32


df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)

df

    A   B
0   1   40
1   2   50
2   3   42

Answer 11

这也有效：

a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A')       ['B'].max().values})

Answer 12

我不会给你完整的答案（我不认为你正在寻找解析和写入文件部分），但一个关键的提示应该足够了：使用python的set()函数，然后sorted()或.sort()加上.reverse()：

>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]

python pandas：删除列A的重复项，保持列B中具有最高值的行

12 个答案: