Question

我正在尝试在非英语文本数据集上运行LDA（Latent Dirichlet Allocation）。

从sklearn的教程中，您可以在此部分中计算要提供给LDA的单词的术语频率：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                            max_features=n_features,
                            stop_words='english')

哪个内置停用词功能，我认为只适用于英语。我怎么能用自己的停用词列表？

Answer 1

您可以将stop_words = frozenset(["word1", "word2","word3"])个自己的单词分配给stop_words argument，例如：

class WallNotification < ActiveRecord::Base
    belongs_to :attachable, polymorphic: true
    belongs_to :user
end

如何为sklearn CountVectorizer设置自定义停用词？

1 个答案: