删除基于group by的条目

时间:2017-06-09 11:35:16

标签: python pandas

我有一个如下所示的数据集:

venue_id,latitude,longitude,venue_category,country_code,user_id,uct_time,time_offset
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,4337,Tue Apr 03 20:35:48 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,101773,Tue Apr 03 20:46:53 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,105093,Tue Apr 03 22:39:56 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,58835,Tue Apr 03 22:54:52 +0000 2012,420
....

我需要删除少于100次出现的venue_id。

我尝试使用以下代码:

joined = joined[joined.groupby("venue_id").venue_id.transform(len) >= 100]

受到ID 13446480问题答案的启发。

问题是它给了我以下错误:

AttributeError: 'DataFrameGroupBy' object has no attribute 'venue_id'

请记住,我是熊猫新手,我想学习,所以如果你能给出一些解释,我将不胜感激。

干杯,

1 个答案:

答案 0 :(得分:1)

似乎第一列是索引,所以帮助reset_index

所以需要:

joined = joined.reset_index()
joined = joined[joined.groupby("venue_id")['venue_id'].transform(len) >= 100]

如果第一列是索引而且不需要reset_index

,对我来说也是有效的
joined = joined[joined.groupby("venue_id").transform(len) >= 100]

如果不使用最新版本的pandas(0.20.1),则需要添加一些列:

joined = joined[joined.groupby(level="venue_id")['latitude'].transform(len) >= 100]

EDIT1:

Faster使用size作为len

joined = joined[joined.groupby("venue_id")['latitude'].transform('size') >= 100]