Question

我想使用scikit-learn对两个文本进行分类。但我想自己提取功能。就像在 CountVectorizer 上使用stop_words='english'来停止英语单词列表一样。如何设置我自己的单词列表让 CountVectorizer 计数？

Answer 1

您可以在CountVectorizer中的stop_words参数中提供自己的停用词列表，它不会计算您不想在scikit-learn中输入文本中计算的词数。例如，如果我不想要像＃34; cat＆＃34;，＆＃34; dog＆＃34;和＆＃34; elephant＆＃34;要使用作为标记，我将实例化CountVectorizer如下：

CountVectorizer(stop_words=['cat','dog', elephant'])

希望有所帮助。

如何使用scikit-learn来分类文本

1 个答案: