如何在Pandas中按分组后创建具有第一个正值的列?

时间:2020-06-16 09:34:47

标签: pandas

我有一个下面的数据框

id value   dt             first_val_expected
1  null    2018-01-01     2
1  0       2019-01-01     2
1  2       2020-01-01     2
1  8       2021-01-01     2
2  1       2018-01-01     1
2  null    2019-01-01     1
2  2       2020-01-01     1
2  8       2021-01-01     1

df['first_val'] = df[df['value'] > 0].groupby('id')['value'].transform('first')

此查询的问题在于,它仅填充值> 0的值,并且我需要为每一行填充该值。

有一个类似的线程,但是不会创建额外的颜色 pandas: How to get first positive number after grouping by a column?

1 个答案:

答案 0 :(得分:2)

第一个想法是用Series.where将不匹配的值替换为缺失的值,然后用transformfirst的解决方案很好地工作:

print (df['value'].where(df['value'] > 0))
0    NaN
1    NaN
2    2.0
3    8.0
4    1.0
5    NaN
6    2.0
7    8.0
Name: value, dtype: float64

df['first_val'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')
print (df)
   id  value          dt  first_val_expected  first_val
0   1    NaN  2018-01-01                   2        2.0
1   1    0.0  2019-01-01                   2        2.0
2   1    2.0  2020-01-01                   2        2.0
3   1    8.0  2021-01-01                   2        2.0
4   2    1.0  2018-01-01                   1        1.0
5   2    NaN  2019-01-01                   1        1.0
6   2    2.0  2020-01-01                   1        1.0
7   2    8.0  2021-01-01                   1        1.0

或者使用Series.map而不使用transform

df['first_val'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())

性能(始终是对真实数据的最佳测试):

np.random.seed(123)
N = 10000
L = [1,2,3,4,np.nan] 
df = pd.DataFrame({'id':np.random.randint(N // 10,size=N),
                   'value': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489))
                   })
df = df.sort_values(['id']).reset_index(drop=True)

df['first_val1'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')

df['first_val2'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())
df['first_val3'] = df.groupby('id')['value'].transform(lambda s: s[s.gt(0).idxmax()] if s.gt(0).any() else np.nan)


In [205]: %timeit df['first_val1'] = df['value'].where(df['value'] > 0).groupby(df['id']).transform('first')
4.13 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [206]: %timeit df['first_val2'] = df['id'].map(df[df['value'] > 0].groupby('id')['value'].first())
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [207]: %timeit df['first_val3'] = df.groupby('id')['value'].transform(lambda s: s[s.gt(0).idxmax()] if s.gt(0).any() else np.nan)
752 ms ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)