我有一个如下所示的数据集:
venue_id,latitude,longitude,venue_category,country_code,user_id,uct_time,time_offset
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,4337,Tue Apr 03 20:35:48 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,101773,Tue Apr 03 20:46:53 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,105093,Tue Apr 03 22:39:56 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,58835,Tue Apr 03 22:54:52 +0000 2012,420
....
我需要删除少于100次出现的venue_id。
我尝试使用以下代码:
joined = joined[joined.groupby("venue_id").venue_id.transform(len) >= 100]
受到ID 13446480问题答案的启发。
问题是它给了我以下错误:
AttributeError: 'DataFrameGroupBy' object has no attribute 'venue_id'
请记住,我是熊猫新手,我想学习,所以如果你能给出一些解释,我将不胜感激。
干杯,
丹
答案 0 :(得分:1)
似乎第一列是索引,所以帮助reset_index
。
所以需要:
joined = joined.reset_index()
joined = joined[joined.groupby("venue_id")['venue_id'].transform(len) >= 100]
如果第一列是索引而且不需要reset_index
:
joined = joined[joined.groupby("venue_id").transform(len) >= 100]
如果不使用最新版本的pandas(0.20.1
),则需要添加一些列:
joined = joined[joined.groupby(level="venue_id")['latitude'].transform(len) >= 100]
EDIT1:
Faster使用size
作为len
。
joined = joined[joined.groupby("venue_id")['latitude'].transform('size') >= 100]