选择第一个非空值并基于选择创建列标签

时间:2019-08-12 21:54:48

标签: python pandas

我正在尝试从utm_source列中为每个anonymous_id查找第一个非空值,并创建一个名为first的新列,并将其标记为第一个非空值。

我之前曾问过一个类似的问题,然后发现我可以使用.first()获得第一个非null值。但是,我很难将此值分配给新列。

这是我的代码:

first_two = pd.DataFrame(file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby(['anonymous_id','year']).transform(lambda x: x['first'] == x['utm_source'].first()))

当我尝试运行它时,出现以下错误消息:

  

KeyError :(“第一个”,“发生在未命名的索引:0”)

这是我正在使用的数据的示例:

 {'steps': {0: 'Sign-ups',
  1: nan,
  2: nan,
  3: nan,
  4: nan,
  5: nan,
  6: nan,
  7: nan,
  8: nan,
  9: nan},
 'utm_source': {0: nan,
  1: 'facebook',
  2: 'facebook',
  3: nan,
  4: nan,
  5: nan,
  6: nan,
  7: nan,
  8: nan,
  9: nan},
 'ts': {0: Timestamp('2018-04-11 06:59:20.206000'),
  1: Timestamp('2019-05-18 05:59:11.874000'),
  2: Timestamp('2018-09-10 18:19:25.260000'),
  3: Timestamp('2017-10-11 08:20:18.092000'),
  4: Timestamp('2017-10-11 08:20:31.466000'),
  5: Timestamp('2017-10-11 08:20:37.345000'),
  6: Timestamp('2017-10-11 08:21:01.322000'),
  7: Timestamp('2017-10-11 08:21:14.145000'),
  8: Timestamp('2017-10-11 08:23:47.526000'),
  9: Timestamp('2019-06-12 10:42:50.401000')},
 'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
  1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
  2: '0002226e-26a4-4f55-9578-2eff2999de7e',
  3: '00022b83-240e-4ef9-aaad-ac84064bb902',
  4: '00022b83-240e-4ef9-aaad-ac84064bb902',
  5: '00022b83-240e-4ef9-aaad-ac84064bb902',
  6: '00022b83-240e-4ef9-aaad-ac84064bb902',
  7: '00022b83-240e-4ef9-aaad-ac84064bb902',
  8: '00022b83-240e-4ef9-aaad-ac84064bb902',
  9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
 'year': {0: 2018,
  1: 2019,
  2: 2018,
  3: 2017,
  4: 2017,
  5: 2017,
  6: 2017,
  7: 2017,
  8: 2017,
  9: 2019}}

注意:我将数据框转换为字典,以便每个人都可以轻松查看并与数据进行交互

我的预期输出的一个例子是

anonymous_id      utm_source          first             year
  1111              Facebook         Facebook           2017
  1234                NaN              NaN              2017 
  1243              Google           Google             2018

要重申的是,“第一”列将根据在utm_source中找到的第一个non_null值(第一个匿名名被点击的广告)标记

1 个答案:

答案 0 :(得分:0)

如果我对您的理解正确,我们可以将groupbyfirst_valid_index结合使用:

df.loc[df.groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())]\
  .dropna(subset=['utm_source'])

输出

    steps utm_source                      ts                          anonymous_id    year
1.0   NaN   facebook 2019-05-18 05:59:11.874  00015d49-2cd8-41b1-bbe7-6aedbefdb098  2019.0
2.0   NaN   facebook 2018-09-10 18:19:25.260  0002226e-26a4-4f55-9578-2eff2999de7e  2018.0