Question

将Kmeans与TF-IDF矢量化器一起使用是否可以在多个集群中获取术语？

以下是示例的数据集：

<div class="###AUTHOR###">

我使用TF-IDF矢量化器进行特征提取：

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

当我使用来自scikit-learn的KMeans对文档进行聚类时，结果如下：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
print "Top terms per cluster:"
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s,' % terms[ind],
    print

我们可以看到一些术语出现在多个集群中（例如，集群1和集群2中的Top terms per cluster: Cluster 0: user, eps, interface, human, response, time, computer, management, engineering, testing, Cluster 1: trees, intersection, paths, random, generation, unordered, binary, graph, interface, human, Cluster 2: minors, graph, survey, widths, ordering, quasi, iv, trees, engineering, eps,，集群0和集群中的graph）。

群集结果错误了吗？或者是否可以接受，因为每个文件的上述条款的tf-idf分数不同？

Answer 1

我认为你对你要做的事情感到有点困惑。您使用的代码为您提供了文档的聚类，而不是术语。这些术语是您聚类的维度。

如果要查找每个文档所属的群集，只需使用predict或fit_predict方法，如下所示：

vectorizer = TfidfVectorizer(stop_words='english')
feature = vectorizer.fit_transform(documents)
true_k = 3
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km.fit(feature)
for n in range(9):
    print("Doc %d belongs to cluster %d. " % (n, km.predict(feature[n])))

你得到：

Doc 0 belongs to cluster 2. 
Doc 1 belongs to cluster 1. 
Doc 2 belongs to cluster 2. 
Doc 3 belongs to cluster 2. 
Doc 4 belongs to cluster 1. 
Doc 5 belongs to cluster 0. 
Doc 6 belongs to cluster 0. 
Doc 7 belongs to cluster 0. 
Doc 8 belongs to cluster 1.

查看User Guide of Scikit-learn

Kmeans：在多个集群中出现的术语？

1 个答案: