我有一个具有以下结构的数据框
Debtor ID | AccountRating | AccountRatingDate | AmountOutstanding |AmountPastDue
John Snow Closed 2017-03-01 0 0
John Snow Delayed 2017-04-22 2000 500
John Snow Closed 2017-05-23 0 0
John Snow Delayed 2017-07-15 6000 300
Sarah Parker Closed 2017-02-01 0 0
Edward Hall Closed 2017-05-01 0 0
Douglas Core Delayed 2017-01-01 1000 200
Douglas Core Delayed 2017-06-01 1000 400
我想要实现的是
Debtor ID | Incidents of delay | TheMostRecentOutstanding | TheMostRecentPastDue
John Snow 2 6000 300
Sarah Parker 0 0 0
Edward Hall 0 0 0
Douglas Core 2 1000 400
计算延迟事件非常简单
df_account["pastDuebool"] = df_account['amtPastDue'] > 0
new_df = pd.DataFrame(index = df_account.groupby("Debtor ID").groups.keys())
new_df['Incidents of delay'] = df_account.groupby("Debtor ID")["pastDuebool"].sum()
我一直在努力提取最新的amonts杰出和历史。我的代码是这样的
new_df["TheMostRecentOutstanding"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountOutstanding"]
new_df["TheMostRecentPastDue"] = df_account.loc[df_account[df_account["AccountRating"]=='Delayed'].groupby('Debtor ID')["AccountRatingDate"].idxmax(),"AmountPastDue"]
但是它们返回带有所有NaN值的Series。请帮助我,我在这里做什么错了?
答案 0 :(得分:4)
您可以尝试以下方法:
df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.reset_index()
输出:
Debtor ID AccountRating AmountOutstanding AmountPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400
详细信息:
而且,您可以使用rename
和字典进行列重命名:
df.sort_values('AccountRatingDate')\
.query('AccountRating == "Delayed"')\
.groupby('Debtor ID')[['AccountRating','AmountOutstanding','AmountPastDue']]\
.agg({'AccountRating':'count',
'AmountOutstanding':'last',
'AmountPastDue':'last'})\
.reindex(df['Debtor ID'].unique(), fill_value=0)\
.rename(columns={'AccoutRating':'Incidents of delay',
'AmountOutstanding':'TheMostRecentOutstanding',
'AmountPastDue':'TheMostRecentPastDue'})\
.reset_index()
输出:
Debtor ID AccountRating TheMostRecentOutstanding TheMostRecentPastDue
0 John Snow 2 6000 300
1 Sarah Parker 0 0 0
2 Edward Hall 0 0 0
3 Douglas Core 2 1000 400
答案 1 :(得分:2)
这将按日期对值进行排序(保留最后一个值),对布尔值“ True”值求和,然后将重复项放入索引中。然后它将删除您不需要的列,并为您提供“ new_df”:
test = pd.read_csv("solution.csv")
test.to_datetime('col4')
输出:
df = pd.read_table('fun.txt')
df['pastDuebool'] = df['AmountPastDue'] > 0
df = df.set_index('DebtorID').sort_values('AccountRatingDate')
df['Incidents of Delay'] = df.groupby('DebtorID')['pastDuebool'].sum()
df = df[~df.index.duplicated(keep='last')]
df = df.drop(['pastDuebool', 'AccountRatingDate', 'AccountRating'], axis=1)
new_df = df.rename(columns={'AmountOutstanding':'TheMostRecentOutstanding',
'AmountPastDue':'TheMostRecentPastDue'})
print(new_df)