如何查找哪些文本靠近kmeans群集的中心

时间:2019-07-18 15:20:40

标签: python-3.x scikit-learn k-means

我有一个文本列表,我已经执行过brew link --overwrite --dry-run node | xargs rm brew link node tfidf聚类,如何访问最接近kmeans聚类中心的文本。

kmeans

预期输出:

text=['this is text one','this is text two','this is text three',
     'thats are next','that are four','that are three',
     'lionel messi is footbal player','kobe bryant is basket ball player',
     'rossi is motogp racer']
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(text)
cluster_text = Tfidf_vect.transform(text)
kmeans = KMeans(n_clusters=3, random_state=0,max_iter=600,n_init=10)
kmeans.fit(cluster_text)
labels = (kmeans.labels_)
center=kmeans.cluster_centers_

谢谢您的帮助

1 个答案:

答案 0 :(得分:1)

您可以在每个文本的tfidf表示形式和聚类中心之间使用余弦相似度。试试吧!

from sklearn.metrics import pairwise_distances

distances = pairwise_distances(cluster_text, kmeans.cluster_centers_, 
                               metric='cosine')

ranking = np.argsort(distances, axis=0)

df = pd.DataFrame({'text': text})
for i in range(kmeans.n_clusters):
    df['cluster_{}'.format(i)] = ranking[:,i]

top_n = 2

for i in range(kmeans.n_clusters):
    print('top_{} closest text to the cluster {} :'.format(top_n, i))
    print(df.nsmallest(top_n,'cluster_{}'.format(i))[['text']].values)
top_2 closest text to the cluster 0 :
[['that are four']
 ['that are three']]
top_2 closest text to the cluster 1 :
[['thats are next']
 ['that are four']]
top_2 closest text to the cluster 2 :
[['this is text three']
 ['this is text two']]