熊猫-计算开始日期和结束日期之间的客户数量

时间:2020-09-22 09:09:09

标签: python pandas

我想计算过去一年中每个月的每个时间有多少客户。我的数据框包含客户ID,开始日期(客户开始为客户)和结束日期(客户结束为客户):

Customer_ID     StartDate    EndDate
1               01/01/2019    NAT
2               25/10/2017    01/06/2020
2               13/06/2012    15/07/2015
2               20/12/2015    03/01/2016
2               25/03/2016    14/06/2017
3               05/06/2018    05/06/2019
3               12/12/2019    NAT

我想要的结果;正在计算每月/每年组合中“活跃”的客户数量:

MONTH YEAR  NUMB_CUSTOMERS
01    2013  1
02    2013  1
03    2013  1
04    2013  1
...
01    2019  2
...
09    2020  2

我想避免for循环,因为这会花费太多时间(我的表超过10万行)。

有没有人想整齐快速地做到这一点?

谢谢!

1 个答案:

答案 0 :(得分:0)

首先,读取数据并使其可被程序消化

import pandas as pd
import datetime

df = pd.read_csv("table.csv") 
func = lambda x: x.split('/', maxsplit=1)[1]

df["StartDate"] = df["StartDate"].apply(func)

mask = df["EndDate"] != "NAT"
df.loc[mask, "EndDate"] = df.loc[mask, "EndDate"].apply(func)

然后,计算客户数量的变化(您基本上可以得到数据的衍生产品)

customers_gained = df[["Customer_ID", "StartDate"]].groupby("StartDate").agg("count")
customers_lost = df[["Customer_ID", "EndDate"]].groupby("EndDate").agg("count")
customers_lost.drop("NAT",inplace=True)

对所有客户数量的变化进行分组。

def make_time_table(start, end):
    start_date = datetime.datetime.strptime(start, "%d/%m/%Y")
    end_date = datetime.datetime.strptime(end, "%d/%m/%Y")
    data_range = pd.date_range(start_date, end_date, freq="M")
    string_range = [el.strftime("%m/%Y") for el in data_range]
    ser = pd.Series([0]*data_range.size, index=string_range)
    return ser

接下来将更改引入time_table并通过累积“整合”

time_table = make_time_table("01/01/2012", "01/12/2020")
time_table[customers_gained.index] = customers_gained["Customer_ID"]
time_table[customers_lost.index] -= customers_lost["Customer_ID"]
result = time_table.cumsum()
print(result)

输出:

01/2012    0
02/2012    0
03/2012    0
04/2012    0
05/2012    0
06/2012    1
07/2012    1
...
10/2019    2
11/2019    2
12/2019    3
01/2020    3
02/2020    3
03/2020    3
04/2020    3
05/2020    3
06/2020    2
07/2020    2
08/2020    2
09/2020    2
10/2020    2
11/2020    2
dtype: int64

table.csv

Customer_ID,StartDate,EndDate
1,01/01/2019,NAT
2,25/10/2017,01/06/2020
2,13/06/2012,15/07/2015
2,20/12/2015,03/01/2016
2,25/03/2016,14/06/2017
3,25/03/2016,05/06/2019
3,12/12/2019,NAT