Question

我有一个排序的数据框（按“ customer_id”和“ point_in_time”排序），如下所示：

import pandas as pd
import numpy as np

testing = pd.DataFrame({"customer_id": (1,1,1,2,2,2,2,2,3,3,3,3,4,4), 
                        "point_in_time": (4,5,6,1,2,3,7,9,5,6,8,10,2,5),
                        "x": ("d", "a", "c", "ba", "cd", "d", "o", "a", "g", "f", "h", "d", "df", "b"),
                        "revenue": (np.nan, np.nan, 40, np.nan, np.nan, 23, np.nan, 10, np.nan, np.nan, np.nan, 40, np.nan, 100)})
testing

现在，我想按“ customer_id”和“收入”对数据框进行分组。但是，关于“收入”，组应在上一个现有收入之后开始，并在下一个出现的收入之后结束。因此，组应如下所示：

如果我有这些小组，我可以轻松地做到

testing.groupby(["customer_id", "groups"])

我首先尝试通过先按“ customer_id”分组并对其应用函数来填充“ revenue”的缺失值来创建这些组：

def my_func(sub_df):
    sub_df["groups"] = sub_df["revenue"].fillna(method="bfill")
    sub_df.groupby("groups").apply(next_function)

testing.groupby(["customer_id"]).apply(my_func)

不幸的是，如果一个客户有两个完全相同的收入，这将不起作用。在这种情况下，使用fillna后，此客户的“组”列将仅包含一个值，不允许进行其他分组。

那么这怎么办？完成这项任务的最有效方法是什么？预先谢谢你！

Answer 1

将Series.shift与Series.notna和Series.cumsum一起使用，最后在必要时添加1：

testing["groups"] = testing['revenue'].shift().notna().cumsum() + 1
print (testing)
    customer_id  point_in_time   x  revenue  groups
0             1              4   d      NaN       1
1             1              5   a      NaN       1
2             1              6   c     40.0       1
3             2              1  ba      NaN       2
4             2              2  cd      NaN       2
5             2              3   d     23.0       2
6             2              7   o      NaN       3
7             2              9   a     10.0       3
8             3              5   g      NaN       4
9             3              6   f      NaN       4
10            3              8   h      NaN       4
11            3             10   d     40.0       4
12            4              2  df      NaN       5
13            4              5   b    100.0       5

熊猫：按条件“将列中的最后一个值定义组”对数据框进行分组

1 个答案: