如何自动执行群集数量?

时间:2019-03-01 01:01:43

标签: python machine-learning scikit-learn cluster-analysis

我一直在使用以下脚本:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os

folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}

# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
    for file in files:
        full_path = os.path.join(root, file)
        print(f'Processing {file}')
        try:
            text = textract.process(full_path)
            dict_of_docs[file] = text
        except Exception as e:
            print(e)


vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())

true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)

它将扫描作为扫描文档的图像文件夹,提取文本,然后将文本聚类。我知道事实上有3种不同类型的文档,所以我将true_k设置为3。但是,如果我有一个未知文件的文件夹,其中可能有1到100多种不同类型的文档,那该怎么办。

1 个答案:

答案 0 :(得分:2)

这是一个湿滑的领域,因为在没有任何地面真相标签的情况下很难衡量您的聚类算法的工作效果。为了进行自动选择,您需要具有一个指标,该指标将比较KMeansn_clusters的不同值的执行情况。

一个流行的选择是轮廓分数。您可以找到关于它的更多详细信息here。这是scikit-learn文档:

  

使用每个样本的平均集群内距离(a)和平均最近集群距离(b)计算轮廓系数。样本的轮廓系数为(b-a)/ max(a,b)。为了明确起见,b是样本与该样本不属于的最近群集之间的距离。请注意,仅当标签数为2 <= n_labels <= n_samples-1时,才定义轮廓系数。

结果,您只能计算n_clusters >= 2的轮廓分数(不幸的是,鉴于问题描述,这可能是对您的限制)。

这是在虚拟数据集上使用它的方式(您可以将其适应您的代码,这只是一个可重现的示例):

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

iris = load_iris()
X = iris.data

sil_score_max = -1 #this is the minimum possible score

for n_clusters in range(2,10):
  model = KMeans(n_clusters = n_clusters, init='k-means++', max_iter=100, n_init=1)
  labels = model.fit_predict(X)
  sil_score = silhouette_score(X, labels)
  print("The average silhouette score for %i clusters is %0.2f" %(n_clusters,sil_score))
  if sil_score > sil_score_max:
    sil_score_max = sil_score
    best_n_clusters = n_clusters

这将返回:

The average silhouette score for 2 clusters is 0.68
The average silhouette score for 3 clusters is 0.55
The average silhouette score for 4 clusters is 0.50
The average silhouette score for 5 clusters is 0.49
The average silhouette score for 6 clusters is 0.36
The average silhouette score for 7 clusters is 0.46
The average silhouette score for 8 clusters is 0.34
The average silhouette score for 9 clusters is 0.31

因此,您将拥有best_n_clusters = 2(注意:实际上,鸢尾花有3个类...)