Question

尝试在数据框架中建立新客户与现有客户，'现有'意味着它们在订单当天之前90天内存在于数据框中...试图找到最好的熊猫方式来做到这一点 - 目前我根据日期掩盖，然后查看系列：

from datetime import datetime, timedelta


def is_existing(row):
    mask = (df_only_90_days['placed_at'] <= (row['placed_at'] + timedelta(-1)).date())
    return row['customer_id'] in df_only_90_days.loc[mask]['customer_id']


df_only_90_days.apply(is_existing, axis=1)

几千条记录很好但是一旦进入数百万条记录就太慢了。道歉，也是熊猫的新手。有什么想法吗？

Answer 1

您可以根据customer_id使用pandas groupby功能，然后您可以单独查看每个组。

假设您的数据框如下所示：

   customer_id                  placed_at
0            1 2016-11-17 19:16:35.635774
1            2 2016-11-17 19:16:35.635774
2            3 2016-11-17 19:16:35.635774
3            4 2016-11-17 19:16:35.635774
4            5 2016-11-17 19:16:35.635774
5            5 2016-07-07 00:00:00.000000

客户5已提前90天存在。但其他客户都没有。使用groupby我们可以创建一个groupby对象，其中每个组包含具有特定customer_id的所有行。我们为您的数据框中的每个唯一customer_id获取一个组。当我们将函数应用于此groupby对象时，它将应用于每个组。

 groups = df.groupby("customer_id")

然后我们可以定义一个函数来检查给定的组，看看该客户是否存在于90天之前。

 def existedBefore(g):
    # if the difference between the max and min placed_at values is less than 90 days
     # then return False.  Otherwise, return True
     # if the group only has 1 row, then max and min are the same
     # so this check still works
     if g.placed_at.max() - g.placed_at.min() >= datetime.timedelta(90):
         return True

     return False

现在，如果我们运行：

groups.apply(existedBefore)

我们得到：

customer_id
1    False
2    False
3    False
4    False
5     True

因此，我们可以看到之前存在客户5。

此解决方案的性能取决于您拥有多少独特客户。有关groupby效果apply apply的详细信息，请参阅此链接：Pandas groupby apply performing slow

矢量化解决方案

如果您只是寻找在今天之前至少90天注册的所有用户，那么您可以采用矢量化方法，而不是依赖import datetime priors = df[datetime.datetime.now() - df.placed_at >= timedelta(90)]。

priors

customer_id placed_at 5 5 2016-07-07将如下所示：

因此，我们发现客户apply在今天前90天就已存在。您的原始解决方案与此非常接近，问题是<app-location route="{{route}}" query-params="{{queryParams}}"></app-location>对于大型数据帧来说速度很慢。 There are ways to improve that performance但这种矢量化方法可以为您提供所需的内容。

使用pandas在90天之前找到现有客户

1 个答案:

矢量化解决方案