使用Python中的scikit-learn kmeans对文本文档进行聚类

时间:2015-01-11 17:20:16

标签: python python-2.7 scikit-learn cluster-analysis k-means

我需要实现scikit-learn's kMeans来集群文本文档。 example code工作正常,但需要20个新组数据作为输入。我想使用相同的代码来聚类文档列表,如下所示:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我需要在kMeans example code中使用此列表作为输入进行哪些更改? (简单地说' dataset =文件'不起作用)

2 个答案:

答案 0 :(得分:59)

这是一个更简单的例子:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

对文本进行矢量化,即将字符串转换为数字要素

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

群集文件

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

打印每个群集的顶级术语

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print

如果您想更直观地了解这种情况,请参阅this answer

答案 1 :(得分:3)

发现本文对于使用K-Means的文档群集非常有用。 http://brandonrose.org/clustering

为了理解算法,您也可以查看本文https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/