我需要对客户数据进行集群,这些数据包含分类和数字功能。数字特征不在相同的范围(年龄,收入......)。在用StandardScale缩放之后,我尝试了Mclust的数值数据,但这给了我相交的组。
1 - 如果使用Standardscale结果不满意,我应该正常化吗? 2 - 使用K-Prototype进行聚类的最佳方法是什么? 3-should聚类方法应该依赖于数据分布?
我用熊猫 这是我用过的:
#K-mean Cluster#search K
from scipy.spatial import distance as sci_distance
from sklearn import cluster as sk_cluster
cdata = data
K = range(1, 10)
KM = (sk_cluster.KMeans(n_clusters=k).fit(cdata) for k in K)
centroids = (k.cluster_centers_ for k in KM)
D_k = (sci_distance.cdist(cdata, cent, 'euclidean') for cent in centroids)
dist = (np.min(D, axis=1) for D in D_k)
avgWithinSS = [sum(d) / cdata.shape[0] for d in dist]
plt.plot(K, avgWithinSS, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
plt.show()
#KMean Cluster
from sklearn.cluster import KMeans, AgglomerativeClustering,
AffinityPropagation #For clustering
from sklearn.mixture import GaussianMixture #For GMM clustering
import matplotlib.pyplot as plt # For graphics
import seaborn as sns
#Clustering
def doKmeans(X, nclust=3):
model = KMeans(nclust)
model.fit(X)
clust_labels = model.predict(X)
cent = model.cluster_centers_
return (clust_labels, cent)
clust_labels, cent = doKmeans(data, 3)
kmeans = pd.DataFrame(clust_labels)
data.insert((data.shape[1]),'kmeans',kmeans)
#Plot the clusters obtained using k means
fig = plt.figure()
ax = fig.add_subplot(111)
scatter = ax.scatter(data['var1'],data['var2'],
c=kmeans[0],s=50)
ax.set_title('K-Means Clustering')
ax.set_xlabel('var1')
ax.set_ylabel('var2')
plt.colorbar(scatter)
答案 0 :(得分:0)
你正以非常错误的方式接近这一点。
不选择方法只是因为您设法让代码运行。这永远不会给你带来好结果。
首先找出您需要的。什么是群集?什么是聚类(聚类中的所有点?可能不是。等等)?什么是好的群集,我该如何衡量?然后才根据符合要求的程度选择算法。
否则,你将解决错误的问题。