k-means中的特征权重

时间:2017-07-24 17:45:30

标签: python scikit-learn nlp k-means

我有一组我想要聚类的维基百科文本。

代码如下:

import pandas as pd                                             
import numpy as np                                             
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

#parameters
maximum_features = 1000000
max_intera = 300

#load text file
wiki = pd.read_csv('people_wiki.csv')

#TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=maximum_features, norm = 'l2', stop_words='english')
tfidf = vectorizer.fit_transform(wiki['text'])

#clustering
kmeans = KMeans(n_clusters=3, random_state=0, init='k-means++', max_iter = max_intera).fit(tfidf)

我想知道每个功能的重量,如此处所示(她0.025她:0.017 .....):

enter image description here

总结:我想要每个特征(单词)的权重,并使5更相关。

'people_wiki.csv'文件在这里:

https://ufile.io/udg1y

1 个答案:

答案 0 :(得分:1)

尝试使用此解决方案:

print(tfidf.idf_)