在熊猫数据框中,按值获取前N个组

时间:2019-07-07 12:57:58

标签: python-3.x pandas-groupby

我的熊猫数据以以下格式存储:

Cus No  Purchase_date   Branch_code Amount
111     6-Jun-18        AAA         100
111     6-Jun-18        AAA         50
111     8-Jun-18        BBB         125
111     8-Aug-18        CCC         130
111     12-Dec-18       BBB         200
111     15-Feb-17       AAA         10
111     18-Jan-18       AAA         20
222     6-Jun-18        DDD         100
222     6-Jun-18        AAA         50
222     8-Jun-18        AAA         125
222     8-Aug-18        DDD         130
222     12-Dec-18       AAA         200
222     15-Feb-17       CCC         10
222     18-Jan-18       CCC         20

熊猫的预期输出格式:

Cus_No  Tot_Amount  Tot_Freq    Top_1_Branch    Top1_Tot_Sum    Top1_Tot_Freq   Top1_Avg_mon_sum    Top1_Avg_mon_freq   Top_2_Branch    Top2_Tot_Sum    Top2_Tot_Freq   Top2_Avg_mon_sum    Top2_Avg_mon_freq
111     635         7           BBB             325             2   162.5   1   AAA 180 4   60  1.3
222     635         7           AAA             375             3   187.5   1.5 DDD 230 2   115 1

列上的说明:

按客户编号分组,并获得以下列:

1. Tot Amount : Sum of “Amount” per Cus No
2. Tot Freq : Count of records per Cus No
3.  Top_1_Branch : For Cus No, get the Top 1 “Branch_code” based on its sum of “Amount”. For eg. “Cus No” : 1, “Branch_code” BBB has maximum Sum of Amount. 
4. Top1_Tot_Sum : Sum of “Amount” - Group by “Top_1_Branch” and that “Cus No”
5. Top1_Tot_Freq : Count of records - - Group by “Top_1_Branch” and that “Cus No”
6. Top1_Avg_mon_sum : Based on “Purchase_date” get the total unique months. Top1_Tot_Sum / total unique months
7. Top1_Avg_mon_freq : Based on “Purchase_date” get the total unique months. Top1_Tot_Freq / total unique months

类似地,获取前2个分支代码的所有列

1 个答案:

答案 0 :(得分:0)

我将让您从前1列开始,然后您应该可以自己解决如何进行前2列的操作:

#First two columns only need to be grouped by customer number
grouped_df = data.groupby("Cus_No")
out_df = grouped_df.Amount.agg({"Tot_Amount": sum})
out_df["Tot_Freq"] = grouped_df.Amount.count().values

# Assuming Purchase_date is pd.datetime type, need this later
data["month_year"] = data.Purchase_date.apply(lambda d: (d.month, d.year)) 

# Next we group by cus_no and then branch_code
branch_group = data.groupby(["Cus_No", "Branch_code"])
top_sums = branch_group.Amount.sum().groupby(level=0, group_keys=False).nlargest(1)
out_df["Top_1_Branch"] = top_sums.index.get_level_values(1).values
out_df["Top1_Tot_Sum"] = top_sums.values

#Now we have retrieve information from the branch_group DF based on indexes from
#the top1 information we have in the out_df DF. The only way I can think of doing
#this is iterative indexing
out_df["Top1_Tot_Freq"] = [branch_group.loc[(cus_no, top_branch)].shape[0]
                           for _, (cus_no, top_branch) in
                           out_df.loc[: ["Cus_No", "Top_1_Branch"]].iterrows()]

months_per_top1 = np.array([branch_group.loc[(cus_no, top_branch), "month_year"].nunique()
                   for _, (cus_no, top_branch) in
                   out_df.loc[: ["Cus_No", "Top_1_Branch"]].iterrows()])

out_df["Top1_avg_mon_sum"] = out_df.Top1_Tot_Sum/months_per_top1
out_df["Top1_avg_mon_freq"] = out_df.Top1_Tot_Freq/months_per_top1

列表理解可能不是最高效的代码,但这应该可以大致完成工作。请注意您在out_df中设置值的顺序。 您可能希望“加入”客户编号,以确保正确的值在out_df的正确行中。

编辑: 前2个分支的开始提示:

grouped = branch_group.Amount.sum().groupby(level=0, group_keys=False)
second_sums = grouped.transform(lambda x: x.nlargest(2).min())
out_df["Top_2_Branch"] = second_sums.index.get_level_values(1).values

其余几乎相同。