如何使用KMeans查找与同一群集中的文档

时间:2014-09-14 01:45:42

标签: python artificial-intelligence scikit-learn k-means

我将各种文章与Scikit-learn框架一起聚集在一起。以下是每个群集中的前15个单词:

Cluster 0: whales islands seaworld hurricane whale odile storm tropical kph mph pacific mexico orca coast cabos
Cluster 1: ebola outbreak vaccine africa usaid foundation virus cdc gates disease health vaccines experimental centers obama
Cluster 2: jones bobo sanford children carolina mississippi alabama lexington bodies crumpton mccarty county hyder tennessee sheriff
Cluster 3: isis obama iraq syria president isil airstrikes islamic li strategy terror military war threat al
Cluster 4: yosemite wildfire park evacuation dome firefighters blaze hikers cobb helicopter backcountry trails homes california evacuate

我创造了"词袋"像这样的矩阵:

hasher = TfidfVectorizer(max_df=0.5,
                             min_df=2, stop_words='english',
                             use_idf=1)
vectorizer = make_pipeline(hasher, TfidfTransformer())
# document_text_list is a list of all text in a given article
X_train_tfidf = vectorizer.fit_transform(document_text_list)

然后像这样运行KMeans:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
km.fit(X_train_tfidf)

我正在打印出这样的集群:

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = hasher.get_feature_names()
for i in range(25):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :15]:
        print(' %s' % terms[ind], end='')
    print()

但是,我想知道如何确定哪些文档都属于同一个集群,理想情况下,它们各自与质心(集群)中心的距离。

我知道生成的矩阵(X_train_tfidf)的每一行都对应一个文档,但是在执行KMeans算法后没有明显的方法来获取这些信息。我怎样才能用scikit-learn做这个?

X_train_tfidf看起来像:

X_train_tfidf:   (0, 4661)  0.0405014425985
  (0, 19271)    0.0914545222775
  (0, 20393)    0.287636818634
  (0, 56027)    0.116893929188
  (0, 30872)    0.137815327338
  (0, 35256)    0.0343461345507
  (0, 31291)    0.209804679792
  (0, 66008)    0.0643776635222
  (0, 3806) 0.0967713285061
  (0, 66338)    0.0532881852791
  (0, 65023)    0.0702918299573
  (0, 41785)    0.197672720592
  (0, 29774)    0.120772893833
  (0, 61409)    0.0268609667042
  (0, 55527)    0.134102682463
  (0, 40011)    0.0582437010271
  (0, 19667)    0.0234843097048
  (0, 51667)    0.128270976476
  (0, 52791)    0.57198926651
  (0, 15014)    0.149195054799
  (0, 18805)    0.0277497826525
  (0, 35939)    0.170775938672
  (0, 5808) 0.0473913910636
  (0, 24922)    0.0126531527875
  (0, 10346)    0.0200098997901
  : :
  (23945, 56927)    0.0595132327966
  (23945, 23259)    0.0100977769025
  (23945, 12515)    0.0482102583442
  (23945, 49709)    0.210139450446
  (23945, 28742)    0.0190221880312
  (23945, 16628)    0.137692798005
  (23945, 53424)    0.157029848335
  (23945, 30647)    0.104485375827
  (23945, 57512)    0.0569754813269
  (23945, 39389)    0.0158180459761
  (23945, 26093)    0.0153713768922
  (23945, 9787) 0.0963777149738
  (23945, 23260)    0.158336452835
  (23945, 50595)    0.0527243936945
  (23945, 42447)    0.0527515904547
  (23945, 2829) 0.0351677269698
  (23945, 2832) 0.0175929392039
  (23945, 52079)    0.0849796887889
  (23945, 13523)    0.0878730969786
  (23945, 57849)    0.133869666381
  (23945, 25064)    0.128424780903
  (23945, 31129)    0.0919760384953
  (23945, 65601)    0.0388718258746
  (23945, 1428) 0.391477289626
  (23945, 2152) 0.655211469073
  X_train_tfidf shape: (23946, 67816)

回应ttttthomasssss的答案:

当我尝试运行以下内容时:

X_cluster_0 = X_train_tfidf[cluster_0]

我收到错误:

File "cluster.py", line 52, in main
    X_cluster_0 = X_train_tfidf[cluster_0]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/csr.py", line 226, in __getitem__
    col = key[1]
IndexError: tuple index out of range

查看cluster_0

的结构
(array([  858,  2012,  2256,  2762,  2920,  3770,  6052,  6174,  8296,
9494,  9966, 10085, 11914, 12117, 12633, 12727, 12993, 13527,
13754, 14186, 14669, 14713, 14973, 15071, 15157, 15208, 15926,
16300, 16301, 17138, 17556, 17775, 18236, 19057, 20106, 21014, 21080]),)

它是一个元组结构,其内容位于第0位,因此我将该行更改为以下内容:

X_cluster_0 = X_train_tfidf[cluster_0[0]]

我正在拉"文件"从我可以轻松获取索引的数据库中(迭代提供的数组,直到我找到相应的文档[当然假设scikit不会改变矩阵中文档的顺序])。所以我不确切地理解X_cluster_0代表什么。 X_cluster_0具有以下结构:

  X_cluster_0:   (0, 42726) 0.741747456202
  (0, 13535)    0.115880661286
  (0, 17447)    0.117608794277
  (0, 44849)    0.414829246262
  (0, 14574)    0.10214258736
  (0, 17317)    0.0634383214735
  (0, 17935)    0.0591234431875
  : :
  (17, 33867)   0.0174155914371
  (17, 48916)   0.0227046046275
  (17, 59132)   0.0168864861723
  (17, 40860)   0.0485813219503
  (17, 63725)   0.0271415763987
  (18, 45019)   0.490135684209
  (18, 36168)   0.14595160766
  (18, 52304)   0.139590524213
  (18, 63586)   0.16501953796
  (18, 28709)   0.15075416279
  (18, 11495)   0.0926490431993
  (18, 40860)   0.124236878928

计算距质心的距离

目前运行建议的代码(distance = euclidean(X_cluster_0[0], km.cluster_centers_[0]))会导致以下错误:

File "cluster.py", line 68, in main
    distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/spatial/distance.py", line 211, in euclidean
    dist = norm(u - v)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/sparse/compressed.py", line 197, in __sub__
    raise NotImplementedError('adding a nonzero scalar to a '
NotImplementedError: adding a nonzero scalar to a sparse matrix is not supported

以下是km.cluster_centers的样子:

km.cluster_centers: [  9.47080802e-05   2.53907413e-03   0.00000000e+00 ...,   0.00000000e+00
   0.00000000e+00   0.00000000e+00]

我猜我现在遇到的问题是如何提取矩阵的第i项(假设从左到右遍历矩阵)。我指定的任何级别的索引嵌套都没有区别(即X_cluster_0[0]X_cluster_0[0][0]X_cluster_0[0][0][0]都给出了上面描述的相同的打印输出矩阵结构。

1 个答案:

答案 0 :(得分:13)

您可以使用fit_predict()函数执行聚类并获取生成的聚类的索引。

获取每个文档的集群索引

您可以尝试以下操作:

km = sklearn.cluster.KMeans(init='k-means++', max_iter=10000, n_init=1,
                verbose=0, n_clusters=25)
clusters = km.fit_predict(X_train_tfidf)

# Note that your input data has dimensionality m x n and the clusters array has dimensionality m x 1 and contains the indices for every document
print X_train_tfidf.shape
print clusters.shape

# Example to get all documents in cluster 0
cluster_0 = np.where(clusters==0) # don't forget import numpy as np

# cluster_0 now contains all indices of the documents in this cluster, to get the actual documents you'd do:
X_cluster_0 = X_train_tfidf[cluster_0]

查找每个文档到每个质心的距离

您可以通过执行centroids = km.cluster_centers_来获取质心,在您的情况下,应该具有维度25(群集数量)x n(要素数量)。为了计算文件到质心的欧氏距离,你可以使用SciPy(可以找到scipy各种距离度量的文档here):

# Example, distance for 1 document to 1 cluster centroid
from scipy.spatial.distance import euclidean

distance = euclidean(X_cluster_0[0], km.cluster_centers_[0])
print distance

更新:Sparse&的距离密集矩阵

scipy.spatial.distance中的距离指标要求输入矩阵是密集矩阵,因此如果X_cluster_0是稀疏矩阵,您可以将矩阵转换为密集矩阵:

d = euclidean(X_cluster_0.A[0], km.cluster_centers_[0]) # Note the .A on X_cluster_0
print d

或者你可以使用scikit的euclidean_distances()函数,它也适用于稀疏矩阵:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X_cluster_0.getrow(0), km.cluster_centers_[0]) 
# This would be the equivalent expression to the above scipy example, however note that euclidean_distances returns a matrix and not a scalar
print D

请注意,使用scikit方法,您还可以立即计算整个距离矩阵:

D = euclidean_distances(X_cluster_0, km.cluster_centers_)
print D

更新:X_cluster_0的结构和类型:

X_cluster_0以及X_train_tfidf都是稀疏矩阵(请参阅文档:scipy.sparse.csr.csr_matrix)。

转储的解释,例如

(0, 13535)    0.115880661286
(0, 17447)    0.117608794277
(0, 44849)    0.414829246262
(0, 14574)    0.10214258736
.             .
.             .

如下:(0, 13535)指的是文档0和功能13535,所以你的文字包矩阵中的行号0和列号13535。以下浮点数0.115880661286表示给定文档中该功能的tf-idf分数

要找出您可以尝试的确切字词hasher.get_feature_names()[13535](首先查看len(hasher.get_feature_names())以查看您有多少功能)。

如果您的语料库变量document_text_list是一个列表列表,那么相应的文档将只是document_text_list[0]