标题令人困惑。
所以,假设我有一个包含一列id
的数据框,这在我的数据框中多次出现。然后我有另一列,我们称之为cumulativeOccurrences
。
如何选择所有唯一身份ID,以便其他列符合特定条件,对cumulativeOccurrences > 20
的每个实例说id
?
代码的开头可能是这样的:
dataframe.groupby('id')
但我无法弄清楚其余部分。
以下是一个应返回零值的小型数据集示例:
id cumulativeOccurrences
5494178 136
5494178 71
5494178 18
5494178 83
5494178 57
5494178 181
5494178 13
5494178 10
5494178 90
5494178 4484
好的,这是我在更加混乱之后得到的结果:
res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
ids = res[res.cumulativeOccurrences['<lambda>']==True].index
这给了我一份满足条件的id列表。但是,对于agg函数,可能有比list comprehension lambda函数更好的方法。有什么想法吗?
答案 0 :(得分:2)
首先过滤,然后使用DataFrameGroupBy.all
:
res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')
EDIT1:
第一个时间用于非排序id
,第二个时间用于排序。
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences'])
print (df.head())
In [125]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 1.22 s per loop
In [126]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.69 s per loop
In [127]: %timeit
In [128]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.63 s per loop
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())
In [130]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 795 ms per loop
In [131]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.23 s per loop
In [132]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.15 s per loop
结论 - 排序id
和唯一索引可以提高性能。此外,还在0.20.3
下的python 3
版本中测试了数据。