以下是我的示例数据:
Customer Document Date Clearing Date Invoice_Amount
0 A 09/13/2016 11/04/2016 2,007,324
1 A 04/18/2016 07/11/2016 631,714
2 A 09/13/2016 09/16/2016 4,000,000
3 A 07/11/2017 09/23/2017 5,000,000
4 A 05/03/2016 06/17/2016 2,000,000
--- --- --- --- ---
1158 H 04/21/2017 06/28/2017 3,000,000
1159 H 04/25/2017 05/19/2017 1,000,000
1160 H 11/03/2017 12/11/2017 4,500,000
1161 H 03/15/2018 05/27/2018 3,500,000
1162 H 02/21/2018 05/03/2018 1,500,000
我想创建一个新变量(在Invoice_Amount之后添加新列) No_Paid ,该变量计算“客户新发票的文档日期之前的已支付发票数。” >
预期输出如下...
Customer Document Date Clearing Date Invoice_Amount No_Paid*
0 A 09/13/2016 11/04/2016 2,007,324 8
1 A 04/18/2016 07/11/2016 631,714 1
2 A 09/13/2016 09/16/2016 4,000,000 8
3 A 07/11/2017 09/23/2017 5,000,000 6
4 A 05/03/2016 06/17/2016 2,000,000 1
--- --- --- --- --- ---
1158 H 04/21/2017 06/28/2017 3,000,000 5
1159 H 04/25/2017 05/19/2017 1,000,000 3
1160 H 11/03/2017 12/11/2017 4,500,000 7
1161 H 03/15/2018 05/27/2018 3,500,000 37
1162 H 02/21/2018 05/03/2018 1,500,000 37
当前,我使用for循环来实现预期的输出
import pandas as pd
df = pd.read_csv('E:\data.csv')
df['Document Date'] = pd.to_datetime(df['Document Date'],format="%m/%d/%Y")
df['Clearing Date'] = pd.to_datetime(df['Clearing Date'],format="%m/%d/%Y")
df["No_Paid"] = ""
for i in df.index:
Vendor= df.loc[i,"Vendor"]
Doc_Date= df.loc[i,"Document Date"]
Six_Month = Doc_Date - pd.Timedelta(days=180)
df.loc[i,"No_Paid"] = df.loc[(df["Vendor"] == Vendor) & (df["Clearing Date"] < Doc_Date) & (df["Document Date"] >= Six_Month),"Invoice_Amount"].count()
在实际情况下,我有100,000多张发票数据,这需要更长的时间 我尝试使用df.apply ...但是无法达到相同的输出...
答案 0 :(得分:0)
以您的示例为例:
import pandas as pd
# read in csv (save as csv or read in using pd.read_excel)
df = pd.read_csv('file.csv')
# to datetime just in case
df['Doc_Date'] = pd.to_datetime(df['Doc_Date'])
df['Exp_Date'] = pd.to_datetime(df['Exp_Date'])
df['Overdue'] = df['Doc_Date'] - df['Exp_Date']
# 180 days for 6 months
df['6M_Age'] = df['Doc_Date'] - pd.Timedelta(days=180)
# Hard to tell what the line in the middle of the data means
# you can group by two columns if you need too
df['Sum_of_paid'] = df.groupby('ID').cumsum()