在按功能运行分组之后,我想从每个分组返回utm_source列的第一个非空值。
这是我编写的代码:
file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())
这似乎返回了这个:
anonymous_id
00003df1-be12-47b8-b3b8-d01c84a22fdf NaN
00009cc0-279f-4ccf-aea4-f6af1f2bb75a NaN
0000a6a0-00bc-475f-a9e5-9dcbb4309e78 NaN
0000c906-7060-4521-8090-9cd600b08974 638.0
0000c924-5959-4e2d-8757-0d10f96ca462 NaN
0000dc27-292c-4676-8a1b-4977f2ad1577 275.0
0000df7e-2579-4071-8aa5-814ab294bf9a 419.0
我不太确定与anon_id关联的值是什么。
以下是我的数据示例:
{'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000'},
'utm_source': {0: nan, 1: 'facebook', 2: 'facebook', 3: nan, 4: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'steps': {0: 'Sign-ups', 1: nan, 2: nan, 3: nan, 4: nan}}
因此,对于每个anonymous_id,我将返回与anon_id相关联的第一个(按时间顺序,按ts列排序)utm_source
答案 0 :(得分:0)
因此,对于每个匿名ID,我都会返回第一个(按时间顺序, tsm列排序)与anon_id关联的utm_source
IIUC,您可以先删除空值,然后再删除groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
输出示例数据:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook