我的熊猫数据以以下格式存储:
Cus No Purchase_date Branch_code Amount
111 6-Jun-18 AAA 100
111 6-Jun-18 AAA 50
111 8-Jun-18 BBB 125
111 8-Aug-18 CCC 130
111 12-Dec-18 BBB 200
111 15-Feb-17 AAA 10
111 18-Jan-18 AAA 20
222 6-Jun-18 DDD 100
222 6-Jun-18 AAA 50
222 8-Jun-18 AAA 125
222 8-Aug-18 DDD 130
222 12-Dec-18 AAA 200
222 15-Feb-17 CCC 10
222 18-Jan-18 CCC 20
熊猫的预期输出格式:
Cus_No Tot_Amount Tot_Freq Top_1_Branch Top1_Tot_Sum Top1_Tot_Freq Top1_Avg_mon_sum Top1_Avg_mon_freq Top_2_Branch Top2_Tot_Sum Top2_Tot_Freq Top2_Avg_mon_sum Top2_Avg_mon_freq
111 635 7 BBB 325 2 162.5 1 AAA 180 4 60 1.3
222 635 7 AAA 375 3 187.5 1.5 DDD 230 2 115 1
列上的说明:
按客户编号分组,并获得以下列:
1. Tot Amount : Sum of “Amount” per Cus No
2. Tot Freq : Count of records per Cus No
3. Top_1_Branch : For Cus No, get the Top 1 “Branch_code” based on its sum of “Amount”. For eg. “Cus No” : 1, “Branch_code” BBB has maximum Sum of Amount.
4. Top1_Tot_Sum : Sum of “Amount” - Group by “Top_1_Branch” and that “Cus No”
5. Top1_Tot_Freq : Count of records - - Group by “Top_1_Branch” and that “Cus No”
6. Top1_Avg_mon_sum : Based on “Purchase_date” get the total unique months. Top1_Tot_Sum / total unique months
7. Top1_Avg_mon_freq : Based on “Purchase_date” get the total unique months. Top1_Tot_Freq / total unique months
类似地,获取前2个分支代码的所有列
答案 0 :(得分:0)
我将让您从前1列开始,然后您应该可以自己解决如何进行前2列的操作:
#First two columns only need to be grouped by customer number
grouped_df = data.groupby("Cus_No")
out_df = grouped_df.Amount.agg({"Tot_Amount": sum})
out_df["Tot_Freq"] = grouped_df.Amount.count().values
# Assuming Purchase_date is pd.datetime type, need this later
data["month_year"] = data.Purchase_date.apply(lambda d: (d.month, d.year))
# Next we group by cus_no and then branch_code
branch_group = data.groupby(["Cus_No", "Branch_code"])
top_sums = branch_group.Amount.sum().groupby(level=0, group_keys=False).nlargest(1)
out_df["Top_1_Branch"] = top_sums.index.get_level_values(1).values
out_df["Top1_Tot_Sum"] = top_sums.values
#Now we have retrieve information from the branch_group DF based on indexes from
#the top1 information we have in the out_df DF. The only way I can think of doing
#this is iterative indexing
out_df["Top1_Tot_Freq"] = [branch_group.loc[(cus_no, top_branch)].shape[0]
for _, (cus_no, top_branch) in
out_df.loc[: ["Cus_No", "Top_1_Branch"]].iterrows()]
months_per_top1 = np.array([branch_group.loc[(cus_no, top_branch), "month_year"].nunique()
for _, (cus_no, top_branch) in
out_df.loc[: ["Cus_No", "Top_1_Branch"]].iterrows()])
out_df["Top1_avg_mon_sum"] = out_df.Top1_Tot_Sum/months_per_top1
out_df["Top1_avg_mon_freq"] = out_df.Top1_Tot_Freq/months_per_top1
列表理解可能不是最高效的代码,但这应该可以大致完成工作。请注意您在out_df中设置值的顺序。 您可能希望“加入”客户编号,以确保正确的值在out_df的正确行中。
编辑: 前2个分支的开始提示:
grouped = branch_group.Amount.sum().groupby(level=0, group_keys=False)
second_sums = grouped.transform(lambda x: x.nlargest(2).min())
out_df["Top_2_Branch"] = second_sums.index.get_level_values(1).values
其余几乎相同。