我已经对文本数据进行了K-均值聚类
#K-means clustering
from sklearn.cluster import KMeans
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
%time km.fit(features)
clusters = km.labels_.tolist()
其中的特征是tf-idf向量
#preprocessing text - converting to a tf-idf vector form
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=0.01,max_df=0.75, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.keywrds).toarray()
labels = df.CD
然后我将聚类标签添加到原始数据集中
df['clusters'] = clusters
并按簇索引数据框
pd.DataFrame(df,index = [clusters])
如何获取每个群集的热门单词?