我有一个pandas.DataFrame
包含多个客户的多张发票。
我希望找到一种优雅的方法来计算2张发票之间的时间,具体取决于客户。
我的数据框如下所示(索引是发票号,最后一列是我所期待的):
CustomerID InvoiceDate time between 2 orders
index
536365 17850.0 2010-12-01 08:26:00 0 minutes (or np.nat)
536366 17850.0 2010-12-01 08:28:00 2 minutes
536367 13047.0 2010-12-01 08:34:00 0 minutes (It's a new customer)
536369 13047.0 2010-12-01 08:35:00 1 minute
536371 13748.0 2010-12-01 09:00:00 0 minute (new customer)
536372 17850.0 2010-12-01 09:01:00 33 minutes (see line #2)
536373 17850.0 2010-12-01 09:02:00 1 minute
536374 15100.0 2010-12-01 09:09:00 0 minute
这是我到目前为止所发现的(但显然它不起作用!)
df = df.sort_values(['CustomerID', 'InvoiceDate']) #To order first according
df = df.set_index('index', drop = True)
for CustomerID in df['CustomerID'].unique():
index = df.set_index('CustomerID').index.get_loc(CustomerID)
df['Ordersep'].iloc[index] = df['InvoiceDate'].iloc[index].diff()
有什么想法可以帮助我吗?
答案 0 :(得分:2)
您可以将groupby()
与diff()
:
df.InvoiceDate = pd.to_datetime(df.InvoiceDate)
df["timedelta"] = df.groupby(["CustomerID"]).InvoiceDate.apply(lambda x: x.diff())
df
index CustomerID InvoiceDate timedelta
0 536365 17850.0 2010-12-01 08:26:00 NaT
1 536366 17850.0 2010-12-01 08:28:00 00:02:00
2 536367 13047.0 2010-12-01 08:34:00 NaT
3 536369 13047.0 2010-12-01 08:35:00 00:01:00
4 536371 13748.0 2010-12-01 09:00:00 NaT
5 536372 17850.0 2010-12-01 09:01:00 00:33:00
6 536373 17850.0 2010-12-01 09:02:00 00:01:00
7 536374 15100.0 2010-12-01 09:09:00 NaT
答案 1 :(得分:0)
这应该有效,假设您在此之前对客户ID和发票日期进行了排序(可能稍微调整一下)
for customer_id in df.CustomerId.unique():
matching_customer_mask = df.CustomerId == customer_id
customer_df = df[matching_customer_mask]
order_times = customer_df.InvoiceDate
prev_order_times = customer_df.InvoiceDate.shift(1)
df.loc[matching_customer_mask, 'Ordersep'] = order_times - prev_order_times
这样做是将发票日期列向下移动一步,然后计算您想要的差异。