Scikit学习K-means聚类和TfidfVectorizer:如何将tf-idf得分最高的前n个术语传递给k-means

时间:2019-09-09 14:31:35

标签: python scikit-learn k-means text-mining tfidfvectorizer

我正在基于TFIDF矢量化器对文本数据进行聚类。该代码工作正常。它将整个TFIDF矢量化器输出作为K-Means聚类的输入,并生成散点图。相反,我只想发送基于TF-IDF分数的前n个项作为k均值聚类的输入。有办法实现吗?

vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english')

tfidf_matrix = vect.fit_transform(df_doc_wholetext['csv_text'])


'''create k-means model with custom config '''
clustering_model = KMeans(
    n_clusters=num_clusters,
    max_iter=max_iterations,
    precompute_distances="auto",
    n_jobs=-1
)

labels = clustering_model.fit_predict(tfidf_matrix)

x = tfidf_matrix.todense()

reduced_data = PCA(n_components=pca_num_components).fit_transform(x)


fig, ax = plt.subplots()
for index, instance in enumerate(reduced_data):        
    pca_comp_1, pca_comp_2 = reduced_data[index]
    color = labels_color_map[labels[index]]
    ax.scatter(pca_comp_1,pca_comp_2, c = color)
plt.show()

1 个答案:

答案 0 :(得分:2)

在TfidfVectorizer中使用max_features考虑前n个功能

vect = TfidfVectorizer(ngram_range=(1,3),stop_words='english', max_features=n)

根据scikit-learn的文档,max_features的值为int或None(默认值为None)。如果不是None,则TfidfVectorizer会建立一个仅考虑整个语料库中按词频排列的最大max_features的词汇表。

这里是link