我有一个文本列表,我已经执行过brew link --overwrite --dry-run node | xargs rm
brew link node
和tfidf
聚类,如何访问最接近kmeans
聚类中心的文本。
kmeans
预期输出:
text=['this is text one','this is text two','this is text three',
'thats are next','that are four','that are three',
'lionel messi is footbal player','kobe bryant is basket ball player',
'rossi is motogp racer']
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(text)
cluster_text = Tfidf_vect.transform(text)
kmeans = KMeans(n_clusters=3, random_state=0,max_iter=600,n_init=10)
kmeans.fit(cluster_text)
labels = (kmeans.labels_)
center=kmeans.cluster_centers_
谢谢您的帮助
答案 0 :(得分:1)
您可以在每个文本的tfidf表示形式和聚类中心之间使用余弦相似度。试试吧!
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(cluster_text, kmeans.cluster_centers_,
metric='cosine')
ranking = np.argsort(distances, axis=0)
df = pd.DataFrame({'text': text})
for i in range(kmeans.n_clusters):
df['cluster_{}'.format(i)] = ranking[:,i]
top_n = 2
for i in range(kmeans.n_clusters):
print('top_{} closest text to the cluster {} :'.format(top_n, i))
print(df.nsmallest(top_n,'cluster_{}'.format(i))[['text']].values)
top_2 closest text to the cluster 0 :
[['that are four']
['that are three']]
top_2 closest text to the cluster 1 :
[['thats are next']
['that are four']]
top_2 closest text to the cluster 2 :
[['this is text three']
['this is text two']]