Python Pandas有多个条件的总和

时间:2019-01-09 05:14:15

标签: python pandas

以下是我的示例数据:

        Customer   Document Date   Clearing Date   Invoice_Amount
0       A          09/13/2016      11/04/2016      2,007,324
1       A          04/18/2016      07/11/2016      631,714
2       A          09/13/2016      09/16/2016      4,000,000
3       A          07/11/2017      09/23/2017      5,000,000
4       A          05/03/2016      06/17/2016      2,000,000
---     ---        ---             ---             ---
1158    H          04/21/2017      06/28/2017      3,000,000
1159    H          04/25/2017      05/19/2017      1,000,000
1160    H          11/03/2017      12/11/2017      4,500,000
1161    H          03/15/2018      05/27/2018      3,500,000
1162    H          02/21/2018      05/03/2018      1,500,000

我想创建一个新变量(在Invoice_Amount之后添加新列) No_Paid ,该变量计算“客户新发票的文档日期之前的已支付发票数。”

预期输出如下...

        Customer   Document Date   Clearing Date   Invoice_Amount No_Paid*
0       A          09/13/2016      11/04/2016      2,007,324          8 
1       A          04/18/2016      07/11/2016      631,714            1
2       A          09/13/2016      09/16/2016      4,000,000          8
3       A          07/11/2017      09/23/2017      5,000,000          6
4       A          05/03/2016      06/17/2016      2,000,000          1
---     ---        ---             ---             ---              ---
1158    H          04/21/2017      06/28/2017      3,000,000          5 
1159    H          04/25/2017      05/19/2017      1,000,000          3
1160    H          11/03/2017      12/11/2017      4,500,000          7
1161    H          03/15/2018      05/27/2018      3,500,000         37
1162    H          02/21/2018      05/03/2018      1,500,000         37

当前,我使用for循环来实现预期的输出

import pandas as pd
df = pd.read_csv('E:\data.csv')
df['Document Date'] = pd.to_datetime(df['Document Date'],format="%m/%d/%Y")
df['Clearing Date'] = pd.to_datetime(df['Clearing Date'],format="%m/%d/%Y")
df["No_Paid"] = ""
for i in df.index: 
     Vendor= df.loc[i,"Vendor"]
     Doc_Date= df.loc[i,"Document Date"]
     Six_Month = Doc_Date - pd.Timedelta(days=180)
     df.loc[i,"No_Paid"] = df.loc[(df["Vendor"] == Vendor) & (df["Clearing Date"] < Doc_Date) & (df["Document Date"] >= Six_Month),"Invoice_Amount"].count()

在实际情况下,我有100,000多张发票数据,这需要更长的时间 我尝试使用df.apply ...但是无法达到相同的输出...

1 个答案:

答案 0 :(得分:0)

以您的示例为例:

import pandas as pd
# read in csv (save as csv or read in using pd.read_excel)
df = pd.read_csv('file.csv')
# to datetime just in case
df['Doc_Date'] = pd.to_datetime(df['Doc_Date'])
df['Exp_Date'] = pd.to_datetime(df['Exp_Date'])
df['Overdue'] = df['Doc_Date'] - df['Exp_Date']
# 180 days for 6 months
df['6M_Age'] = df['Doc_Date'] - pd.Timedelta(days=180)
# Hard to tell what the line in the middle of the data means
# you can group by two columns if you need too
df['Sum_of_paid'] = df.groupby('ID').cumsum()