我试图实施以下文件:Melville invoice to cash
本文的第3页列出了所使用的功能。
数据集将有一个发票条目,每个条目将包含以下字段:
creation_date payment_date customer_id
2016-01-01 2016-01-03 0
2016-01-02 2016-01-02 1
2016-01-02 2016-01-02 1
2016-01-04 2016-01-05 0
现在,对于每张发票,我需要计算该客户在当前发票创建日期之前已支付的发票数量。 所以,结果将是:
creation_date payment_date customer_id no_invoice_paid
2016-01-01 2016-01-03 0 0
2016-01-02 2016-01-02 1 0
2016-01-02 2016-01-02 1 0
2016-01-04 2016-01-05 0 1
我提出了一个天真的解决方案:
data_customer = data.groupby(by='customer_id')
final_df = pd.DataFrame()
for group , group_data in data_customer:
group_data = group_data.assign(no_invoice_before=count_paid_invoice)
final_df = final_df.append(group_data)
计数付费发票功能如下:
def count_paid_invoice(group_data):
for index , row in group_data.iterrows() :
group_data.iloc[index,13] = group_data[(group_data['creation_date'] < row['creation_date']) & (group_data['payment_date'] < row['creation_date'])].shape[0]
return group_data.iloc[:,13]
但这很慢。有没有办法可以更有效地完成这项工作?
答案 0 :(得分:0)
您可以使用cumsum()
和pd.concat
的组合。
请尝试以下操作:
data['is_invoice_paid'] = 1 # Creating a dummy variable
count_invoice = data.groupby('customer_id')['is_invoice_paid'].cumsum()
count_invoice.name = 'no_invoice_paid'
final_df = pd.concat([data,count_invoice],axis=1)
final_df['no_invoice_paid'] = final_df['no_invoice_paid'] - 1 # to set the count correct
final_df = final_df.drop('is_invoice_paid',axis=1)
我在这里假设两件事:
答案 1 :(得分:0)
假设您的数据框名为df
,则应该会为您提供所需的结果。
df['no_invoice_paid '] = df.apply(lambda row:
df[(df.customer_id == row['customer_id']) &
(df.payment_date < row['creation_date'])].shape[0] ,axis=1)
甚至更短:
df['no_invoice_paid '] = df.apply(lambda row:
((df.customer_id == row['customer_id']) &
(df.payment_date < row['creation_date'])).sum(), axis=1)
另一种方法(类似于Spandan的方法):
df = df.sort_values(['customer_id', 'payment_date'])
payment_lookup = df.groupby(('customer_id', 'payment_date')).count().groupby(level=[0]).cumsum()
from functools import lru_cache
@lru_cache(maxsize=1024)
def get_customer_payments(customer_id):
return payment_lookup.loc[customer_id]
@lru_cache(maxsize=1024)
def lookup_payments(customer_id, payment_date):
customer_payments = get_customer_payments(customer_id)
payments_before_current = customer_payments[customer_payments.index <
payment_date]
try:
return payments_before_current.values[-1][0]
except IndexError:
return 0
df['no_invoice_paid'] = df.apply(lambda row:
lookup_payments(row['customer_id'],
row['creation_date']), axis=1)
我没有测试性能。如果有效,请告诉我。