我正在使用nltk和sklearn
构建文本分类模型,并在sklearn
的20newsgroups数据集上进行训练(每个文档大约130个单词)。
我的预处理包括删除停用词和词令令。
接下来,在我的管道中,我将其传递给tfidfVectorizer()
,并希望操纵矢量化器的一些输入参数以提高准确性。我已经读过n-gram(通常,n小于提高准确度,但是当我使用multinomialNB()
分类器对矢量化器输出进行分类时,在tfidf中使用ngram_range=(1,2)
和ngram_range=(1,3)
,恶化了准确性。有人能帮忙解释原因吗?
修改: 这是一个请求的示例数据,我用它来获取它并删除标题:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', remove="headers")
#example of data text (no header)
print(news.data[0])
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!!
这是我的管道,用于训练模型的代码和打印精度:
test1_pipeline=Pipeline([('clean', clean()),
('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
('classifier', MultinomialNB())])
train(test1_pipeline, news_group_train.data, news_group_train.target)