我有一个pandas数据框,如:
a b id
1 10 6 1
2 6 -3 1
3 -3 12 1 # First time id 1 has a b value over 10
4 4 23 2 # First time id 2 has a b value over 10
5 12 11 2
6 3 -5 2
如何创建一个新的数据框,它首先获取id
列,然后第一次获取列b
超过10,以便结果如下所示:
a b id
1 -3 12 1
2 4 23 2
我有一个包含2,000,000行和大约10,000 id
个值的数据帧,因此for循环非常慢。
答案 0 :(得分:4)
先使用快速boolean indexing
进行过滤,然后使用groupby
+ first
:
df = df[df['b'] > 10].groupby('id', as_index=False).first()
print (df)
id a b
0 1 -3 12
1 2 4 23
如果某些组中的值不是10
,则解决方案有点复杂 - 需要使用duplicated
展开掩码:
print (df)
a b id
1 7 6 3 <- no value b>10 for id=3
1 10 6 1
2 6 -3 1
3 -3 12 1
4 4 23 2
5 12 11 2
6 3 -5 2
mask = ~df['id'].duplicated(keep=False) | (df['b'] > 10)
df = df[mask].groupby('id', as_index=False).first()
print (df)
id a b
0 1 -3 12
1 2 4 23
2 3 7 6
<强>计时强>:
#[2000000 rows x 3 columns]
np.random.seed(123)
N = 2000000
df = pd.DataFrame({'id': np.random.randint(10000, size=N),
'a':np.random.randint(10, size=N),
'b':np.random.randint(15, size=N)})
#print (df)
In [284]: %timeit (df[df['b'] > 10].groupby('id', as_index=False).first())
10 loops, best of 3: 67.6 ms per loop
In [285]: %timeit (df.query("b > 10").groupby('id').head(1))
10 loops, best of 3: 107 ms per loop
In [286]: %timeit (df[df['b'] > 10].groupby('id').head(1))
10 loops, best of 3: 90 ms per loop
In [287]: %timeit df.query("b > 10").groupby('id', as_index=False).first()
10 loops, best of 3: 83.3 ms per loop
#without sorting a bit faster
In [288]: %timeit (df[df['b'] > 10].groupby('id', as_index=False, sort=False).first())
10 loops, best of 3: 62.9 ms per loop
答案 1 :(得分:4)
In [146]: df.query("b > 10").groupby('id').head(1)
Out[146]:
a b id
3 -3 12 1
4 4 23 2
答案 2 :(得分:1)
对于最后一列被排序的情况,这是使用np.searchsorted
的NumPy解决方案 -
def numpy_searchsorted(df, thresh=10):
a = df.values
m = a[:,1] > thresh
mask_idx = np.flatnonzero(m)
b = a[mask_idx,2]
unq_ids = b[np.concatenate(( [True], b[1:] != b[:-1] ))]
idx = np.searchsorted(b, unq_ids)
out = a[mask_idx[idx]]
return pd.DataFrame(out, columns = df.columns)
运行时测试 -
In [2]: np.random.seed(123)
...: N = 2000000
...: df = pd.DataFrame({'id': np.sort(np.random.randint(10000, size=N)),
...: 'a':np.random.randint(10, size=N),
...: 'b':np.random.randint(15, size=N)})
...:
# @MaxU's soln
In [3]: %timeit df.query("b > 10").groupby('id').head(1)
10 loops, best of 3: 44.8 ms per loop
# @jezrael's best soln that assumes last col as sorted too
In [4]: %timeit (df[df['b'] > 10].groupby('id', as_index=False, sort=False).first())
10 loops, best of 3: 30.1 ms per loop
# Proposed in this post
In [5]: %timeit numpy_searchsorted(df)
100 loops, best of 3: 12.6 ms per loop