我将一大堆PDF文档转换为文本,然后将它们编译为字典,我知道我有3种不同的文档类型,并且我想使用“聚类”将它们自动分组:
dict_of_docs = {'document_1':'contents of document', 'document_2':'contents of document', 'document_3':'contents of document',...'document_100':'contents of document'}
然后,我将字典的值向量化了:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
我的X输出是这样的:
(0, 768) 0.05895270500636258
(0, 121) 0.11790541001272516
(0, 1080) 0.05895270500636258
(0, 87) 0.2114378682212116
(0, 1458) 0.1195944498355368
(0, 683) 0.0797296332236912
(0, 1321) 0.12603709835806634
(0, 630) 0.12603709835806634
(0, 49) 0.12603709835806634
(0, 750) 0.12603709835806634
(0, 1749) 0.10626171032944469
(0, 478) 0.12603709835806634
(0, 1632) 0.14983692373373858
(0, 177) 0.12603709835806634
(0, 653) 0.0497440271723707
(0, 1268) 0.13342186854440274
(0, 1489) 0.07052056544031632
(0, 72) 0.12603709835806634
...etc etc
然后,我将它们转换为数组X = X.toarray()
我现在正处于尝试使用我的实际数据分散使用matplotlib绘制群集的阶段。然后,从那里开始,我要使用在聚类中学到的知识对文档进行排序。我所遵循的所有指南都使用了数据数组,但是它们没有显示如何从现实世界的数据转换成可以按照其演示方式使用的数据。
如何将我的矢量化数据数组放入散点图中?
答案 0 :(得分:2)
如何将我的矢量化数据数组放入散点图中?
分几个步骤:聚类,降维,绘制和调试。
我们使用K均值拟合X
(我们的 TF-IDF 矢量化数据集)。
from sklearn.cluster import KMeans
NUMBER_OF_CLUSTERS = 3
km = KMeans(
n_clusters=NUMBER_OF_CLUSTERS,
init='k-means++',
max_iter=500)
km.fit(X)
from sklearn.decomposition import PCA
# First: for every document we get its corresponding cluster
clusters = km.predict(X)
# We train the PCA on the dense version of the tf-idf.
pca = PCA(n_components=2)
two_dim = pca.fit_transform(X.todense())
scatter_x = two_dim[:, 0] # first principle component
scatter_y = two_dim[:, 1] # second principle component
我们用预先指定的颜色绘制每个群集。
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fig, ax = plt.subplots()
fig.set_size_inches(20,10)
# color map for NUMBER_OF_CLUSTERS we have
cmap = {0: 'green', 1: 'blue', 2: 'red'}
# group by clusters and scatter plot every cluster
# with a colour and a label
for group in np.unique(clusters):
ix = np.where(clusters == group)
ax.scatter(scatter_x[ix], scatter_y[ix], c=cmap[group], label=group)
ax.legend()
plt.xlabel("PCA 0")
plt.ylabel("PCA 1")
plt.show()
在每个群集中打印前10个字。
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(3):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
# Cluster 0: com edu medical yeast know cancer does doctor subject lines
# Cluster 1: edu game games team baseball com year don pitcher writes
# Cluster 2: edu car com subject organization lines university writes article