sklearn管道如何使用不平衡库?

时间:2019-01-24 11:07:52

标签: python machine-learning scikit-learn sampling

我正在尝试解决文本分类问题。我想使用MultinomialNB

创建基准模型

我的数据在少数类别中是高度不平衡的,因此决定将imbalanced库与sklearn管道一起使用并引用tutorial

在按照文档中的建议引入流水线的两个阶段之后,该模型失败并给出了错误。

from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
                                     RepeatedEditedNearestNeighbours)
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()

pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
                         ('tfidf', TfidfTransformer(use_idf= True)),\
                          ('enn', EditedNearestNeighbours()),\
                          ('renn', RepeatedEditedNearestNeighbours()),\
                          ('clf-gnb',  MultinomialNB()),])

错误:

TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',

有人可以在这里帮忙吗?我也愿意使用(Boosting / SMOTE)实现的不同方式?

1 个答案:

答案 0 :(得分:1)

似乎ìmblearn的管道不支持sklearn中的命名。来自imblearn documentation

  

* steps:估算器列表。

您应该将代码修改为:

pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
                                         ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
                         TfidfTransformer(use_idf= True),\
                         EditedNearestNeighbours(),\
                         RepeatedEditedNearestNeighbours(),\
                         MultinomialNB())