我试图了解如何为不同的列执行不同的转换。我知道我需要Pipeline
,但我认为我需要FeatureUnion
。
我的数据框:
text labels pred
0 this is a phrase green 0.0134
1 so is this blue 0.0231
2 this is too green 0.0321
3 and i am done yellow 0.0123
我的示例代码:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import TransformerMixin
df = pd.DataFrame({'text': ['this is a phrase', 'so is this', 'this is too', 'and i am done'],
'labels': ['green', 'blue', 'green', 'yellow'],
'pred': [0.0134, 0.0231, 0.0321, 0.0123]},
columns=['text', 'labels', 'pred'])
X = df[['text', 'labels']]
y = df['pred']
pipeline = Pipeline(steps=[
('union', FeatureUnion(
transformer_list=[
('bagofwords', Pipeline([
# X['text'] processed here
('tfidf', TfidfVectorizer()),
])),
('encoder', Pipeline([
# X['labels'] processed here
('le', LabelEncoder()),
]))
])
),
# join above steps back into single X and pass to LinearRegression??
('lr', LinearRegression()),
])
pipeline.fit(X, y)
如果FeatureUnion
是解决方案,我如何告诉管道使用tfidf作为X['text']
,labelencoder使用X['labels']
,然后合并它们并发送到LinearRegression
?< / p>
我需要定制变压器吗?如果是这样,在这种情况下如何运作?