Question

标题令人困惑。

所以，假设我有一个包含一列id的数据框，这在我的数据框中多次出现。然后我有另一列，我们称之为cumulativeOccurrences。

如何选择所有唯一身份ID，以便其他列符合特定条件，对cumulativeOccurrences > 20的每个实例说id？

代码的开头可能是这样的：

dataframe.groupby('id')

但我无法弄清楚其余部分。

以下是一个应返回零值的小型数据集示例：

id            cumulativeOccurrences
5494178       136
5494178        71
5494178        18
5494178        83
5494178        57
5494178       181
5494178        13
5494178        10
5494178        90
5494178      4484

好的，这是我在更加混乱之后得到的结果：

res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
ids = res[res.cumulativeOccurrences['<lambda>']==True].index

这给了我一份满足条件的id列表。但是，对于agg函数，可能有比list comprehension lambda函数更好的方法。有什么想法吗？

Answer 1

首先过滤，然后使用DataFrameGroupBy.all：

res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')

EDIT1：

第一个时间用于非排序id，第二个时间用于排序。

np.random.seed(123)
N = 10000000

df = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
                   columns=['id','cumulativeOccurrences'])
print (df.head())

In [125]: %%timeit
     ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 1.22 s per loop

In [126]: %%timeit
     ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
     ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
     ...: 
1 loop, best of 3: 3.69 s per loop

In [127]: %timeit

In [128]: %%timeit
     ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 3.63 s per loop

np.random.seed(123)
N = 10000000

df = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
                   columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())

In [130]: %%timeit
     ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 795 ms per loop

In [131]: %%timeit
     ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
     ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
     ...: 
1 loop, best of 3: 3.23 s per loop

In [132]: %%timeit
     ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 3.15 s per loop

结论 - 排序id和唯一索引可以提高性能。此外，还在0.20.3下的python 3版本中测试了数据。

如何根据这些行值在一列中的pandas中选择行值，这些行值在其出现的任何位置满足另一列中的某些条件

1 个答案: