我想计算过去一年中每个月的每个时间有多少客户。我的数据框包含客户ID,开始日期(客户开始为客户)和结束日期(客户结束为客户):
Customer_ID StartDate EndDate
1 01/01/2019 NAT
2 25/10/2017 01/06/2020
2 13/06/2012 15/07/2015
2 20/12/2015 03/01/2016
2 25/03/2016 14/06/2017
3 05/06/2018 05/06/2019
3 12/12/2019 NAT
我想要的结果;正在计算每月/每年组合中“活跃”的客户数量:
MONTH YEAR NUMB_CUSTOMERS
01 2013 1
02 2013 1
03 2013 1
04 2013 1
...
01 2019 2
...
09 2020 2
我想避免for循环,因为这会花费太多时间(我的表超过10万行)。
有没有人想整齐快速地做到这一点?
谢谢!
答案 0 :(得分:0)
首先,读取数据并使其可被程序消化
import pandas as pd
import datetime
df = pd.read_csv("table.csv")
func = lambda x: x.split('/', maxsplit=1)[1]
df["StartDate"] = df["StartDate"].apply(func)
mask = df["EndDate"] != "NAT"
df.loc[mask, "EndDate"] = df.loc[mask, "EndDate"].apply(func)
然后,计算客户数量的变化(您基本上可以得到数据的衍生产品)
customers_gained = df[["Customer_ID", "StartDate"]].groupby("StartDate").agg("count")
customers_lost = df[["Customer_ID", "EndDate"]].groupby("EndDate").agg("count")
customers_lost.drop("NAT",inplace=True)
对所有客户数量的变化进行分组。
def make_time_table(start, end):
start_date = datetime.datetime.strptime(start, "%d/%m/%Y")
end_date = datetime.datetime.strptime(end, "%d/%m/%Y")
data_range = pd.date_range(start_date, end_date, freq="M")
string_range = [el.strftime("%m/%Y") for el in data_range]
ser = pd.Series([0]*data_range.size, index=string_range)
return ser
接下来将更改引入time_table并通过累积“整合”
time_table = make_time_table("01/01/2012", "01/12/2020")
time_table[customers_gained.index] = customers_gained["Customer_ID"]
time_table[customers_lost.index] -= customers_lost["Customer_ID"]
result = time_table.cumsum()
print(result)
输出:
01/2012 0
02/2012 0
03/2012 0
04/2012 0
05/2012 0
06/2012 1
07/2012 1
...
10/2019 2
11/2019 2
12/2019 3
01/2020 3
02/2020 3
03/2020 3
04/2020 3
05/2020 3
06/2020 2
07/2020 2
08/2020 2
09/2020 2
10/2020 2
11/2020 2
dtype: int64
table.csv
Customer_ID,StartDate,EndDate
1,01/01/2019,NAT
2,25/10/2017,01/06/2020
2,13/06/2012,15/07/2015
2,20/12/2015,03/01/2016
2,25/03/2016,14/06/2017
3,25/03/2016,05/06/2019
3,12/12/2019,NAT