如何绘制文本簇?

时间:2019-08-23 12:23:30

标签: python matplotlib scikit-learn

我已经开始学习使用Python和sklearn库的集群。我写了一个简单的代码来聚类文本数据。 我的目标是找到相似句子的组/类。 我试图绘制它们,但失败了。

问题是文本数据,我总是会收到此错误:

ValueError: setting an array element with a sequence.

相同的方法适用于数字数据,但不适用于文本数据。 有没有办法绘制相似句子的组/群? 另外,是否有办法查看这些组是什么,这些组代表什么,如何识别它们? 我打印了labels = kmeans.predict(x),但这些只是数字列表,它们代表什么?

import pandas as pd
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt


x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
x = cv.fit_transform(x)
#x_test = cv.transform(x_test)


my_list = []

for i in range(1,11):

    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(x)
    my_list.append(kmeans.inertia_)
    labels = kmeans.predict(x) #this prints the array of numbers
    print(labels)

plt.plot(range(1,11),my_list)
plt.show()



kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(x)

plt.scatter(x[y_kmeans == 0,0], x[y_kmeans==0,1], s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0], x[y_kmeans==1,1], s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0], x[y_kmeans==2,1], s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0], x[y_kmeans==3,1], s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0], x[y_kmeans==4,1], s = 15, c= 'magenta', label = 'Cluster_5')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s = 100, c = 'black', label = 'Centroids')
plt.show()

2 个答案:

答案 0 :(得分:7)

这个问题有几个动人之处:

  1. 如何将文本向量化为kmeans聚类可以理解的数据
  2. 如何在二维空间中绘制聚类
  3. 如何通过源句子标记剧情

我的解决方案遵循一种非常普遍的方法,即使用kmeans标签作为散点图的颜色。 (拟合后的kmeans值分别为0、1、2、3和4,指示每个句子被分配到哪个任意组。输出的顺序与原始样本的顺序相同。)关于如何将这些点分为两个在三维空间中,我使用主成分分析(PCA)。请注意,我对完整数据而不是降维后的输出执行kmeans聚类。然后,我使用matplotlib的ax.annotate()用原始句子装饰我的情节。 (我也使图形更大,因此点之间有间距。)我可以根据要求进一步对此进行评论。

import pandas as pd
import re
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

x = ['this is very good show' , 'i had a great time on my school trip', 'such a boring movie', 'Springbreak was amazing', 'You are wrong', 'This food is so tasty', 'I had so much fun last night', 'This is crap', 'I had a bad time last month',
    'i love this product' , 'this is an amazing item', 'this food is delicious', 'I had a great time last night', 'thats right',
     'this is my favourite restaurant' , 'i love this food, its so good', 'skiing is the best sport', 'what is this', 'this product has a lot of bugs',
     'I love basketball, its very dynamic' , 'its a shame that you missed the trip', 'game last night was amazing', 'Party last night was so boring',
     'such a nice song' , 'this is the best movie ever', 'hawaii is the best place for trip','how that happened','This is my favourite band',
     'I cant believe that you did that', 'Why are you doing that, I do not gete it', 'this is tasty', 'this song is amazing']

cv = CountVectorizer(analyzer = 'word', max_features = 5000, lowercase=True, preprocessor=None, tokenizer=None, stop_words = 'english')  
vectors = cv.fit_transform(x)
kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0)
kmean_indices = kmeans.fit_predict(vectors)

pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(vectors.toarray())

colors = ["r", "b", "c", "y", "m" ]

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

ax.scatter(x_axis, y_axis, c=[colors[d] for d in kmean_indices])

for i, txt in enumerate(x):
    ax.annotate(txt, (x_axis[i], y_axis[i]))

enter image description here

答案 1 :(得分:2)

根据matplotlib.pyplot.scatter的{​​{3}},采用输入数组,但 对于您的情况x[y_kmeans == a,b],您将输入稀疏矩阵,因此需要使用.toarray()方法将其转换为numpy数组。我已经在下面修改了您的代码:

修改

plt.scatter(x[y_kmeans == 0,0].toarray(), x[y_kmeans==0,1].toarray(), s = 15, c= 'red', label = 'Cluster_1')
plt.scatter(x[y_kmeans == 1,0].toarray(), x[y_kmeans==1,1].toarray(), s = 15, c= 'blue', label = 'Cluster_2')
plt.scatter(x[y_kmeans == 2,0].toarray(), x[y_kmeans==2,1].toarray(), s = 15, c= 'green', label = 'Cluster_3')
plt.scatter(x[y_kmeans == 3,0].toarray(), x[y_kmeans==3,1].toarray(), s = 15, c= 'cyan', label = 'Cluster_4')
plt.scatter(x[y_kmeans == 4,0].toarray(), x[y_kmeans==4,1].toarray(), s = 15, c= 'magenta', label = 'Cluster_5')

输出

documentation

希望这会有所帮助!