我有一个下面的数据框
id value dt first_val_expected
1 null 2018-01-01 2
1 0 2019-01-01 2
1 2 2020-01-01 2
1 8 2021-01-01 2
2 1 2018-01-01 1
2 null 2019-01-01 1
2 2 2020-01-01 1
2 8 2021-01-01 1
df['first_val'] = df[df['value'] > 0].groupby('id')['value'].transform('first')
此查询的问题在于,它仅填充值> 0的值,并且我需要为每一行填充该值。
有一个类似的线程,但是不会创建额外的颜色 pandas: How to get first positive number after grouping by a column?
答案 0 :(得分:2)
第一个想法是用Series.where
将不匹配的值替换为缺失的值,然后用transform
和first
的解决方案很好地工作:
print (df['value'].where(df['value'] > 0))
0 NaN
1 NaN
2 2.0
3 8.0
4 1.0
5 NaN
6 2.0
7 8.0
Name: value, dtype: float64
df['first_val'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')
print (df)
id value dt first_val_expected first_val
0 1 NaN 2018-01-01 2 2.0
1 1 0.0 2019-01-01 2 2.0
2 1 2.0 2020-01-01 2 2.0
3 1 8.0 2021-01-01 2 2.0
4 2 1.0 2018-01-01 1 1.0
5 2 NaN 2019-01-01 1 1.0
6 2 2.0 2020-01-01 1 1.0
7 2 8.0 2021-01-01 1 1.0
或者使用Series.map
而不使用transform
:
df['first_val'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())
性能(始终是对真实数据的最佳测试):
np.random.seed(123)
N = 10000
L = [1,2,3,4,np.nan]
df = pd.DataFrame({'id':np.random.randint(N // 10,size=N),
'value': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489))
})
df = df.sort_values(['id']).reset_index(drop=True)
df['first_val1'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')
df['first_val2'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())
df['first_val3'] = df.groupby('id')['value'].transform(lambda s: s[s.gt(0).idxmax()] if s.gt(0).any() else np.nan)
In [205]: %timeit df['first_val1'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')
4.13 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [206]: %timeit df['first_val2'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [207]: %timeit df['first_val3'] = df.groupby('id')['value'].transform(lambda s: s[s.gt(0).idxmax()] if s.gt(0).any() else np.nan)
752 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)