Sklearn FeatureUnion是否不像管道一样顺序应用步骤?

时间:2020-07-24 14:24:25

标签: scikit-learn pipeline preprocessor feature-engineering

我有一个这样构造的要素联合

from sklearn import datasets
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import numpy as np

numeric_transformer = Pipeline(steps=[('StandardScaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder( sparse = False, handle_unknown='ignore' ))])
preprocessor = ColumnTransformer(
                    transformers=[
                        ('num', numeric_transformer, selector(dtype_exclude="category")),
                        ('cat', categorical_transformer, selector(dtype_include="category"))])

union_short = FeatureUnion([("preprocessor", preprocessor)
                    #, ('variance drop', VarianceThreshold()) ### Final step which causes issue
                                       ])

联合工作正常,直到我添加最后的VarianceThreshold步骤,该步骤引发“无法将字符串转换为浮点数”错误。

让我感到困惑的是,我认为要素联合按顺序处理了这些步骤,其中一步的输出是下一步的输入(如管道)。在这种情况下,第一步应该将分类分类编码为数字,因此无关紧要。

重现错误的代码:

    data = datasets.make_classification(n_features = 10, n_informative = 8, n_redundant = 2,n_samples= 1000, random_state = 3)
    X = pd.DataFrame(data[0] )
    X.columns = np.array(X.columns).astype(str)
    
    # Add categorical columns to be transformed
    X['cat'] = np.random.choice(['a','b','c'], X.shape[0])
    X['cat'] = X['cat'].astype('category')
    y = data[1]
    
    t= pd.DataFrame(union.fit_transform(X))

>>> could not convert string to float: 'a'

现在,如果从工会中注释掉最终的方差阈值步骤,并在其正常运行后自行应用:

numeric_transformer = Pipeline(steps=[('StandardScaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder( sparse = False, handle_unknown='ignore' ))])
preprocessor = ColumnTransformer(
                    transformers=[
                        ('num', numeric_transformer, selector(dtype_exclude="category")),
                        ('cat', categorical_transformer, selector(dtype_include="category"))])

union_short = FeatureUnion([("preprocessor", preprocessor)
                    #, ('variance drop', VarianceThreshold()) ### Final step which causes issue
                                       ])

data = datasets.make_classification(n_features = 10, n_informative = 8, n_redundant = 2,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
X.columns = np.array(X.columns).astype(str)

# Add categorical columns to be transformed
X['cat'] = np.random.choice(['a','b','c'], X.shape[0])
X['cat'] = X['cat'].astype('category')
y = data[1]

t= pd.DataFrame(union_short.fit_transform(X))
tt = VarianceThreshold().fit_transform(t)
tt.shape

>>> (1000, 13)

我在这里想念什么?如何获得功能联合以按顺序应用这些步骤?我更喜欢将某些预处理步骤捆绑到功能部件联合中,因为我可以轻松地执行fit_transform来检查到该点的输出...

0 个答案:

没有答案