Question

我有groupby对象

grouped = df.groupby('name')
for k,group in grouped:    
    print group

有3组 bar ， foo 和 foobar

  name  time  
2  bar     5  
3  bar     6  


  name  time  
0  foo     5  
1  foo     2  

  name      time  
4  foobar     20  
5  foobar     1

我需要过滤这些组并删除所有时间不超过5的组。在我的例子中，应该删除组foo。我正在尝试使用函数 filter（）

grouped.filter(lambda x: (x.max()['time']>5))

但x显然不仅是数据帧格式的组。

Answer 1

假设您的最后一行代码确实应该是>5而不是>20，那么您可以执行以下操作：

grouped.filter(lambda x: (x.time > 5).any())

正确发现x DataFrame实际上是name所有索引的k列与您在for循环中的(x.time > 5).any()中的密钥相匹配。

因此，您希望根据时间列中是否有任何大于5的时间进行过滤，以执行上述{{1}}来测试它。

Answer 2

我还不习惯python，numpy或pandas。但是我正在研究类似问题的解决方案，所以让我以这个问题为例来报告我的答案。

import pandas as pd

df = pd.DataFrame()
df['name'] = ['foo', 'foo', 'bar', 'bar', 'foobar', 'foobar']
df['time'] = [5, 2, 5, 6, 20, 1]

grouped = df.groupby('name')
for k, group in grouped:
    print(group)

我的答案1：

indexes_should_drop = grouped.filter(lambda x: (x['time'].max() <= 5)).index
result1 = df.drop(index=indexes_should_drop)

我的回答2：

filter_time_max = grouped['time'].max() > 5
groups_should_keep = filter_time_max.loc[filter_time_max].index
result2 = df.loc[df['name'].isin(groups_should_keep)]

我的答案3：

filter_time_max = grouped['time'].max() <= 5
groups_should_drop = filter_time_max.loc[filter_time_max].index
result3 = df.drop(df[df['name'].isin(groups_should_drop)].index)

结果

    name    time
2   bar     5
3   bar     6
4   foobar  20
5   foobar  1

积分

我的Answer1不使用组名删除组。如果需要组名，可以通过写以下内容来获得它们：df.loc[indexes_should_drop].name.unique()。

grouped['time'].max() <= 5和grouped.apply(lambda x: (x['time'].max() <= 5)).index返回相同的结果。

filter_time_max的索引是一个组名。它不能用作直接删除的索引或标签。

name
foo        True
bar       False
foobar    False
Name: time, dtype: bool

Answer 3

通过返回过滤组列表/字典的条件过滤 GroupBy。例如返回长度 >= 5 的组的列表/字典。

返回元组列表：

[(name,gdf) for name,gdf in df.groupby('Declarer') if len(gdf) >= 5]

返回一个字典：

{name:gdf for name,gdf in df.groupby('Declarer') if len(gdf) >= 5}

pandas groupby过滤器，放弃一些组

3 个答案:

我的答案1：

我的回答2：

我的答案3：

结果

积分