在管道sklearn中包含特征提取

时间:2017-07-18 17:20:29

标签: python machine-learning scikit-learn pipeline feature-extraction

对于文本分类项目,我为特征选择和分类器创建了一个管道。现在我的问题是,是否可以在管道中包含特征提取模块以及如何。我看了一些关于它的事情,但它似乎不符合我现在的代码。

这就是我现在所拥有的:

# feature_extraction module.  
from sklearn.preprocessing import LabelEncoder, StandardScaler 
from sklearn.feature_extraction import DictVectorizer  
import numpy as np

vec = DictVectorizer() 
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set 
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale

enc = LabelEncoder()
y = enc.fit_transform(labels)

# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline

feat_sel = SelectKBest(mutual_info_classif, k=200)  
clf = linear_model.LogisticRegression() 
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)])) 
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)

如何将dictvectorizer放到管道中的标签编码器?

1 个答案:

答案 0 :(得分:1)

在这里,您将如何做到这一点。假设instances是类似dict的对象,如mongodb ecto adapter中所指定的那样,那么就像这样构建管道:

pipe = Pipeline([('vectorizer', DictVectorizer()),
                 ('scaler', StandardScaler(with_mean=False)),
                 ('mutual_info', feat_sel),
                 ('logistregress', clf)])

预测,然后致电cross_val_predict,将instances传递为X:

y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)