Question

我正在尝试从我的csv文件中构建我的特征向量，其中包含大约1000条注释。我的一个特征向量是使用scikit learn的tfidf矢量化器的tfidf。将count作为特征向量还是使用更好的特征向量是否有意义？

如果我最终同时使用Countvectorizer和tfidfvectorizer作为我的功能，我应该如何将它们都装入我的Kmeans模型（特别是km.fit（）部分）？目前我只能将tfidf特征向量拟合到模型中。

这是我的代码：

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)

#count_vectorizer=CountVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
#count_vectorized=count_vectorizerfit_transform(sentence_list)

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)

Answer 1

基本上你正在做的是找到文本文档的数字表示（特征工程）。在一些问题中，计数更好地工作，而在其他一些问题中，tfidf表示是最佳选择。你应该真的尝试他们两个。虽然这两个表示非常相似，因此带有大致相同的信息，但可能是通过使用完整的功能集（tfidf +计数）来获得更好的精度。通过在此特征空间中搜索，您可以更接近真实模型。

这是您可以水平堆叠功能的方法：

import scipy.sparse

X = scipy.sparse.hstack([vectorized, count_vectorized])

然后你可以这样做：

model.fit(X, y)  # y is optional in some models

使用countvectorizer和tfidfvectorizer作为KMeans文本聚类的特征向量是否有意义？

1 个答案: