我下面有一个dataframe
:
import pandas as pd
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
})
df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd')
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days
print(df)
Out [1]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1
现在,我要根据以下假设创建一个新的列名Average_Delay
:
ID
的{{1}}将分为两组,分别是2020-06-26至2020-07-25和2020-07-30至2020-08-02的30天发票。 27459
ID
还将有两组为期30天的时段,即2020-05-13和2020-06-20至2020-06-28。48002
记录在ID 30天期限的最后日期。Average_Delay
的计算是Average_Delay
的总和除以30天周期内的发票数。
预期输出应大致如下所示:Delay
答案 0 :(得分:0)
可以使用Average_Delay
和.groupby
来计算.resample
,如下所示:
df.groupby("ID").get_group("27459").resample("30D", on="Invoice_Date").mean()["Delay"]
产生
Invoice_Date
2020-06-26 0.6
2020-07-26 10.5
但是我不知道如何将结果正确地放在正确的位置。 也许其他人有个主意。
答案 1 :(得分:0)
您可以基于Andre S.'s answer
delays = df.groupby("ID").resample("30M", on="Invoice_Date").mean()["Delay"]
并使用以下命令将其放入df中:
df['Average_Delay'] = np.nan
for id, invoice_date in delays.index:
df.loc[(df['ID'] == id) & (df['Invoice_Date'] == invoice_date),"Average_Delay"] = delays[(id,invoice_date)]
但是,我担心某些日期可能与Invoice_Date
不匹配。您可以每月以“ 1M”的重采样频率执行此操作。另一种方法是将ID和Invoide_Date一起用作索引,但是我没有提到它,因为它改变了df
的结构。