我将k-mans聚类用作客户和产品细分的一种方式。我在堆栈上找到了一个函数,该函数可以获取聚类结果并根据数据帧中目标值的平均值对其进行重新排序。这似乎工作得很好,但是为了绘制结果,我首先在数据库中基于有序聚类创建字符串列,以防止seaborn在色相标签中创建垃圾箱。我遇到的第一个问题是,在按计划生成图和标签时,图例是乱序的。我添加了一个色相顺序,但是ledgend固定为该顺序,因此更改K的值会使图例混乱。我还添加了一个函数来解决此问题,并且一切似乎都按预期工作,但是我想知道是否有更好的方法可以实现此目的。我将在下面放置相关代码块。
#function for ordering cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
new_cluster_field_name = 'new_' + cluster_field_name
df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
df_new['index'] = df_new.index
df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
df_final = df_final.drop([cluster_field_name],axis=1)
df_final = df_final.rename(columns={"index":cluster_field_name})
return df_final
#adding column to dataframe based on clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(data[['ORDERS_PLACED','UNITS_SOLD','AVG_UNIT_PRICE','TOTAL_SALES']])
data['Rank'] = kmeans.predict(data[['ORDERS_PLACED','UNITS_SOLD','AVG_UNIT_PRICE','TOTAL_SALES']])
#ordering the results
data = order_cluster('Rank','TOTAL_SALES',data,True)
top = data['Rank'].max()
#adding string column to dataframe
data['Rank_ID'] = [('Group_A' if x == top else
('Group_B' if x == top - 1 else
('Group_C' if x == top - 2 else
('Group_D' if x == top - 3 else
('Group_E' if x == top - 4 else
('Group_F' if x == top - 5 else
('Group_G' if x == top - 6 else
('Group_H' if x == top - 7 else
('Group_I' if x == top - 8 else
('Group_J' if x == top - 9 else 'Group_Z')))))))))
) for x in data['Rank']]
#function to build the plot legend values
def build_legend(k_value):
if k_value == 0:
legend = ['Group_A']
elif k_value == 1:
legend = ['Group_A','Group_B']
elif k_value == 2:
legend = ['Group_A','Group_B','Group_C']
elif k_value == 3:
legend = ['Group_A','Group_B','Group_C','Group_D']
elif k_value == 4:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E']
elif k_value == 5:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F']
elif k_value == 6:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F','Group_G']
elif k_value == 7:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F','Group_G','Group_H']
elif k_value == 8:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F','Group_G','Group_H','Group_I']
elif k_value == 9:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F','Group_G','Group_H','Group_I','Group_J']
else:
legend = ['Group_A','Group_B','Group_C','Group_D','Group_E','Group_F','Group_G','Group_H','Group_I','Group_J','Group_Z']
return legend
#plotting the results
orderHue = build_legend(top)
fig, ax = plt.subplots(figsize=(12,5))
plot = sns.scatterplot(x='ORDERS_PLACED', y='TOTAL_SALES', hue='Rank_ID', size='Rank_ID',
hue_order=orderHue, size_order=orderHue, data=report, ax=ax)
ytick = plot.get_yticks()
plot.set_yticklabels(['{:,.0f}'.format(x) for x in ytick])
plot.set_title('80/20 Customer Segmentation Using K-Means Clustering, Plot on Orders Placed & Total Sales',fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.show(plot)
这似乎是很多代码,以达到可能很简单的目的。
这里是根据要求提供的快速数据样本,
CUSTOMER_ID ORDERS_PLACED UNITS_SOLD AVG_UNIT_PRICE TOTAL_SALES
A 2 59 21553.9 1271680
B 106 184 6295.9 1158445.7
C 13 78 14290 1114620
D 43 2034 245.38 499102
E 53 582 760.92 442856
F 1 6 15000 90000
G 3 60 967 58020
H 1 1 1807 1807