Sklearn在管道中不同地转换不同的列 - 例如:X [col1]得到tfidf,X [col2]得到标签编码?

时间:2016-07-30 09:48:54

标签: python scikit-learn pipeline

我试图了解如何为不同的列执行不同的转换。我知道我需要Pipeline,但我认为我需要FeatureUnion

我的数据框:

               text  labels    pred
0  this is a phrase   green  0.0134
1        so is this    blue  0.0231
2       this is too   green  0.0321
3     and i am done  yellow  0.0123

我的示例代码:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import TransformerMixin

df = pd.DataFrame({'text': ['this is a phrase', 'so is this', 'this is too', 'and i am done'],
           'labels': ['green', 'blue', 'green', 'yellow'],
           'pred': [0.0134, 0.0231, 0.0321, 0.0123]},
          columns=['text', 'labels', 'pred'])

X = df[['text', 'labels']]
y = df['pred']

pipeline = Pipeline(steps=[
  ('union', FeatureUnion(
    transformer_list=[
      ('bagofwords', Pipeline([
        # X['text'] processed here
        ('tfidf', TfidfVectorizer()),
        ])),
      ('encoder', Pipeline([
        # X['labels'] processed here
        ('le', LabelEncoder()), 
        ]))
      ])
   ),
  # join above steps back into single X and pass to LinearRegression??
  ('lr', LinearRegression()),
  ])

pipeline.fit(X, y)

如果FeatureUnion是解决方案,我如何告诉管道使用tfidf作为X['text'],labelencoder使用X['labels'],然后合并它们并发送到LinearRegression?< / p>

我需要定制变压器吗?如果是这样,在这种情况下如何运作?

0 个答案:

没有答案