我有一个排序的数据框(按“ customer_id”和“ point_in_time”排序),如下所示:
import pandas as pd
import numpy as np
testing = pd.DataFrame({"customer_id": (1,1,1,2,2,2,2,2,3,3,3,3,4,4),
"point_in_time": (4,5,6,1,2,3,7,9,5,6,8,10,2,5),
"x": ("d", "a", "c", "ba", "cd", "d", "o", "a", "g", "f", "h", "d", "df", "b"),
"revenue": (np.nan, np.nan, 40, np.nan, np.nan, 23, np.nan, 10, np.nan, np.nan, np.nan, 40, np.nan, 100)})
testing
现在,我想按“ customer_id”和“收入”对数据框进行分组。但是,关于“收入”,组应在上一个现有收入之后开始,并在下一个出现的收入之后结束。 因此,组应如下所示:
如果我有这些小组,我可以轻松地做到
testing.groupby(["customer_id", "groups"])
我首先尝试通过先按“ customer_id”分组并对其应用函数来填充“ revenue”的缺失值来创建这些组:
def my_func(sub_df):
sub_df["groups"] = sub_df["revenue"].fillna(method="bfill")
sub_df.groupby("groups").apply(next_function)
testing.groupby(["customer_id"]).apply(my_func)
不幸的是,如果一个客户有两个完全相同的收入,这将不起作用。在这种情况下,使用fillna后,此客户的“组”列将仅包含一个值,不允许进行其他分组。
那么这怎么办?完成这项任务的最有效方法是什么? 预先谢谢你!
答案 0 :(得分:1)
将Series.shift
与Series.notna
和Series.cumsum
一起使用,最后在必要时添加1
:
testing["groups"] = testing['revenue'].shift().notna().cumsum() + 1
print (testing)
customer_id point_in_time x revenue groups
0 1 4 d NaN 1
1 1 5 a NaN 1
2 1 6 c 40.0 1
3 2 1 ba NaN 2
4 2 2 cd NaN 2
5 2 3 d 23.0 2
6 2 7 o NaN 3
7 2 9 a 10.0 3
8 3 5 g NaN 4
9 3 6 f NaN 4
10 3 8 h NaN 4
11 3 10 d 40.0 4
12 4 2 df NaN 5
13 4 5 b 100.0 5