我需要根据来自不同pandas数据帧的现有数据创建新的数据列。所有数据帧都来自于读取CSV文件。
我是新手,正在寻找有用的东西然后改进它。
我写了两个这样的函数;
def create_bureau_data_cols(row):
bureau_data = bureau.loc[bureau['SK_ID_CURR'] == row['SK_ID_CURR']]
active_credits = bureau_data.loc[bureau_data['CREDIT_ACTIVE'] == 'Active'].count()
total_active_credits = bureau_data['AMT_CREDIT_SUM'].sum()
overdue_loans = bureau_data.loc[bureau['CREDIT_DAY_OVERDUE'] != 0].count()
active_credits, total_active_credits, overdue_loans
def create_prev_app_data_cols(row):
prev_app_data = previous_application.loc[previous_application['SK_ID_CURR'] == row['SK_ID_CURR']]
no_prev_apps = prev_app_data.count
perc_approved = ((prev_app_data.loc[prev_app_data['NAME_CONTRACT_STATUS']== 'Approved']).count()/prev_app_data.count())
perc_canceled = ((prev_app_data.loc[prev_app_data['NAME_CONTRACT_STATUS'] == 'Canceled']).count()/prev_app_data.count())
perc_refused = ((prev_app_data.loc[prev_app_data['NAME_CONTRACT_STATUS'] == 'Refused']).count()/prev_app_data.count())
perc_unused = ((prev_app_data.loc[prev_app_data['NAME_CONTRACT_STATUS'] == 'Unused offer']).count()/prev_app_data.count())
return perc_approved, perc_canceled, perc_refused, perc_unused
此外,我添加了这些似乎贯穿每行数据的调用,以便为其他行创建记录。
application_train['NO_ACTIVE_CREDITS'],application_train['TOTAL_ACTIVE_CREDITS'], application_train['NO_CREDITS_OVERDUE'] = application_train.apply(create_bureau_data_cols, axis=1)
application_train['NO_PREV_APPLICATIONS'],application_train['PERCENT_APPROVED'], application_train['PERCENT_REFUSED'],application_train['PERCENT_CANCELLED'],application_train['PERCENT_UNUSED'] = application_train.apply(create_prev_app_data_cols, axis=1)
这需要数小时才能完成。关于以更有效的方式生成新列数据的任何建议。