我试图修改我的数据框,用稀疏矩阵替换所有分类属性。我使用FeatureUnion合并了3个管道。当我使用fit_transform时,它工作得很好,但是当我尝试做恰到好处的时候给我一个错误。 我想训练这个管道以便稍后在测试数据集上使用它,这就是为什么我需要适合的部分。 我使用的是Python 3
import pandas as pd
import numpy as np
data = [[3,4,'WN','DEN','SNA',2],[6,1,'WN','FLL','DAL',1],[6,1,'WN','FLL','DAL',1],[6,1,'WN','FLL','DAL',1],[6,1,'WN','FLL','DAL',1],[6,1,'WN','FLL','DAL',1]]
df = pd.DataFrame(data, columns = ['MONTH','DAY_OF_WEEK','AIRLINE','ORIGIN_AIRPORT','DESTINATION_AIRPORT','SCHEDULED_DEPARTURE'])
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
MONTH_pipeline = Pipeline([
('selector', DataFrameSelector(['MONTH'])),
('label_binarizer', LabelBinarizer()),
])
DAY_OF_WEEK_pipeline = Pipeline([
('selector', DataFrameSelector(['DAY_OF_WEEK'])),
('label_binarizer', LabelBinarizer()),
])
AIRLINE_pipeline = Pipeline([
('selector', DataFrameSelector(['AIRLINE'])),
('label_binarizer', LabelBinarizer()),
])
full_pipeline = FeatureUnion(transformer_list = [
('MONTH_pipeline',MONTH_pipeline),
('DAY_OF_WEEK_pipeline',DAY_OF_WEEK_pipeline),
('AIRLINE_pipeline',AIRLINE_pipeline),
])
train_set_prepared = full_pipeline.fit_transform(df)
full_pipeline.fit(df)
使用fit_transform的第一个命令可以很好地工作并给出一个想要的答案,但第二个使用恰当拟合的命令会产生错误。如果有人能帮我理解原因,我将不胜感激。