我有一个简单的K-means程序,该程序提取2个聚类,然后尝试预测新句子。我想为每个群集找到最佳的“适合”。
在我的“示例预测”中,_c0非常适合聚类0,而“ predict_bad_fit”涵盖了聚类0和1
我想我必须为每个预测句子计算出与聚类质心的平均差异。我怎么做?。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
c0_sents = ['cats and dogs','i like cats','cats not like dogs','cats and dogs animals',]
c1_sents = ['computer is for typing','i play games on my computer','programs run on computer','computer has screen']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(c0_sents+c1_sents)
k_means = KMeans(n_clusters=2)
k_means.fit(X)
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(2):
for ind in order_centroids[i, :4]:
print (i, terms[ind])
#0 computer
#0 on
#0 screen
#0 has
#1 cats
#1 dogs
#1 like
#1 and
predict_c0 = ['cats are not dogs']
predict_c1 = ['typing on computers']
predict_bad_fit = ['cats on computers dogs on screen']
for sent in predict_c0+predict_c1+predict_bad_fit:
X = vectorizer.transform([sent])
predicted = k_means.predict(X)
print (sent,predicted)
#cats are not dogs [0]
#typing on computers [1]
#cats on computers dogs on screen [1]