我的数据框(DF)看起来像这样
Customer_number Store_number year month last_buying_date1 amount
1 20 2014 10 2015-10-07 100
1 20 2014 10 2015-10-09 200
2 20 2014 10 2015-10-20 100
2 10 2014 10 2015-10-13 500
我希望得到像这样的输出
year month sum_purchase count_purchases distinct customers
2014 10 900 4 3
如何使用Agg和group by获得这样的输出。我目前正在使用一个两步小组,但正在努力获得独特的客户。这是我的方法
#### Step 1 - Aggregating everything at customer_number, store_number level
aggregations = {
'amount': 'sum',
'last_buying_date1': 'count',
}
grouped_at_Cust = DF.groupby(['customer_number','store_number','month','year']).agg(aggregations).reset_index()
grouped_at_Cust.columns = ['customer_number','store_number','month','year','total_purchase','num_purchase']
#### Step2 - Aggregating at year month level
aggregations = {
'total_purchase': 'sum',
'num_purchase': 'sum',
size
}
Monthly_customers = grouped_at_Cust.groupby(['year','month']).agg(aggregations).reset_index()
Monthly_customers.colums = ['year','month','sum_purchase','count_purchase','distinct_customers']
我的斗争是第二步。如何在第二个聚合步骤中包含大小?
答案 0 :(得分:1)
您可以使用groupby.agg
并提供函数nunique
来返回组中唯一客户ID的数量。
df_grp = df.groupby(['year', 'month'], as_index=False) \
.agg({'purchase_amt':['sum','count'], 'Customer_number':['nunique']})
df_grp.columns = map('_'.join, df_grp.columns.values)
df_grp
请注意,在执行groupby
操作时,您尝试对它们进行不同的分组(省略某些列):
df_grp_1 = df.groupby(['year', 'month']).agg({'purchase_amt':['sum','count']})
df_grp_2 = df.groupby(['Store_number', 'month', 'year'])['Customer_number'].agg('nunique')
获取包含执行agg
操作的多索引列的第一级:
df_grp_1.columns = df_grp_1.columns.get_level_values(1)
将它们合并回用于对它们进行分组的列的交集处:
df_grp = df_grp_1.reset_index().merge(df_grp_2.reset_index().drop(['Store_number'],
axis=1), on=['year', 'month'], how='outer')
将列重命名为新列:
d = {'sum': 'sum_purchase', 'count': 'count_purchase', 'nunique': 'distinct_customers'}
df_grp.columns = [d.get(x, x) for x in df_grp.columns]
df_grp