如何在sklearn库的k均值聚类中使用轮廓分数?

时间:2018-07-02 14:40:41

标签: python-2.7 machine-learning scikit-learn k-means silhouette

我想在脚本中使用轮廓分数,以自动计算来自sklearn的k均值聚类中的聚类数。

import numpy as np
import pandas as pd
import csv
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

filename = "CSV_BIG.csv"

# Read the CSV file with the Pandas lib.
path_dir = ".\\"
dataframe = pd.read_csv(path_dir + filename, encoding = "utf-8", sep = ';' ) # "ISO-8859-1")
df = dataframe.copy(deep=True)

#Use silhouette score
range_n_clusters = list (range(2,10))
print ("Number of clusters from 2 to 9: \n", range_n_clusters)

for n_clusters in range_n_clusters:
    clusterer = KMeans (n_clusters=n_clusters).fit(?)
    preds = clusterer.predict(?)
    centers = clusterer.cluster_centers_

    score = silhouette_score (?, preds, metric='euclidean')
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score)

有人可以帮助我解决问号吗?我不知道要问号而不是问号。我已经从一个示例中获取了代码。 带有注释的部分是先前的versione,在该版本中,我将k-means聚类进行了固定数目的聚类设置为4。这种方式是正确的,但是在我的项目中,我需要自动选择聚类的数目。

1 个答案:

答案 0 :(得分:4)

我假设您要进行轮廓比分以获得最佳编号。集群。

首先声明一个单独的对象KMeans,然后像这样在数据fit_predict上调用它的df函数

for n_clusters in range_n_clusters:
    clusterer = KMeans (n_clusters=n_clusters)
    preds = clusterer.fit_predict(df)
    centers = clusterer.cluster_centers_

    score = silhouette_score (df, preds, metric='euclidean')
    print ("For n_clusters = {}, silhouette score is {})".format(n_clusters, score)

请参阅this official example,以了解更多信息。