在此问题中,我有两个数据框,我想向column_df添加一列,该列跨recharge_df汇总。因此,对于给定的每笔贷款,我想在借贷日期之前(在这种情况下为90天之前)获得借款人的平均还款额。然后,我将这个新列添加到loan_df中。我的下面代码可以运行,但是速度很慢。关于如何使其变得超级高效的任何想法?
def mean_rec_func(msisdn,date,advance_id,window, name):
"""Returns mean recharges within a specified number of days prior to loan being taken
Keyword Arguments:
msisdn -- APF_MSISDN for loan (this is like customer ID)
date -- APF_DATE on which loan taken
advance_id -- APF_ADVANCE_ID for loan
window -- number of days to look back(int)
name -- name of the newly computed stat
"""
mean_rec = recharge_df.loc[(recharge_df['APF_MSISDN'] == msisdn) &
(recharge_df['APF_DATE']<date)
& (recharge_df['APF_DATE']>=date - datetime.timedelta(days = window))
]['APF_AMOUNT'].mean()
return pd.Series([advance_id,msisdn,mean_rec], index=['APF_ADVANCE_ID', 'APF_MSISDN', name])
# Mean recharge over last 90 days
mean_recharge_90 = loan_df.apply(lambda row: mean_rec_func(row['APF_MSISDN'], row['APF_DATE'],
row['APF_ADVANCE_ID'],
window = 90,
name ="MEAN_RECHARGE_90"), axis = 1)
答案 0 :(得分:0)
考虑一个SQL解决方案,因为您的逻辑将使用相关的聚合子查询将其转换为以下查询(当然,这也是一种昂贵的查询类型,因为对每个外部查询行都运行聚合,类似于pandas apply
循环)
SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l
在熊猫中,您可以使用pandasql
模块,该模块在SQLite中运行内存中的实例:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_df r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_df l"""
output_df = pysqldf(q)
下面是在pandasql
的幕后运行的扩展版本,与SQLAlchemy和熊猫的导入/导出调用:read_sql
和to_sql
接口。
from sqlalchemy import create_engine
# IN-MEMORY DATABASE (NO PATH SPECIFIED)
engine = create_engine('sqlite://')
# EXPORT DATAFRAMES
recharge_df.to_sql("recharge_tbl", con=engine, if_exists='replace')
loan_df.to_sql("loan_tbl", con=engine, if_exists='replace')
sql = """SELECT l.*,
(SELECT AVG([APF_AMOUNT]) FROM recharge_tbl r
WHERE r.[APF_DATE] >= date(l.[APF_DATE], '-90 day')
AND r.[APF_DATE] < l.[APF_DATE]
AND r.[APF_MSISDN] = l.[APF_MSISDN]) AS mean_recharge_90
FROM loan_tbl l"""
# IMPORT QUERY RESULT
output_df = pd.read_sql(strSQL, engine)
# IN-MEMORY DATABASE DESTROYED
engine.dispose()