我有一个这样构造的要素联合
from sklearn import datasets
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.feature_selection import VarianceThreshold
import pandas as pd
import numpy as np
numeric_transformer = Pipeline(steps=[('StandardScaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder( sparse = False, handle_unknown='ignore' ))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_exclude="category")),
('cat', categorical_transformer, selector(dtype_include="category"))])
union_short = FeatureUnion([("preprocessor", preprocessor)
#, ('variance drop', VarianceThreshold()) ### Final step which causes issue
])
联合工作正常,直到我添加最后的VarianceThreshold步骤,该步骤引发“无法将字符串转换为浮点数”错误。
让我感到困惑的是,我认为要素联合按顺序处理了这些步骤,其中一步的输出是下一步的输入(如管道)。在这种情况下,第一步应该将分类分类编码为数字,因此无关紧要。
重现错误的代码:
data = datasets.make_classification(n_features = 10, n_informative = 8, n_redundant = 2,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
X.columns = np.array(X.columns).astype(str)
# Add categorical columns to be transformed
X['cat'] = np.random.choice(['a','b','c'], X.shape[0])
X['cat'] = X['cat'].astype('category')
y = data[1]
t= pd.DataFrame(union.fit_transform(X))
>>> could not convert string to float: 'a'
现在,如果从工会中注释掉最终的方差阈值步骤,并在其正常运行后自行应用:
numeric_transformer = Pipeline(steps=[('StandardScaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder( sparse = False, handle_unknown='ignore' ))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, selector(dtype_exclude="category")),
('cat', categorical_transformer, selector(dtype_include="category"))])
union_short = FeatureUnion([("preprocessor", preprocessor)
#, ('variance drop', VarianceThreshold()) ### Final step which causes issue
])
data = datasets.make_classification(n_features = 10, n_informative = 8, n_redundant = 2,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
X.columns = np.array(X.columns).astype(str)
# Add categorical columns to be transformed
X['cat'] = np.random.choice(['a','b','c'], X.shape[0])
X['cat'] = X['cat'].astype('category')
y = data[1]
t= pd.DataFrame(union_short.fit_transform(X))
tt = VarianceThreshold().fit_transform(t)
tt.shape
>>> (1000, 13)
我在这里想念什么?如何获得功能联合以按顺序应用这些步骤?我更喜欢将某些预处理步骤捆绑到功能部件联合中,因为我可以轻松地执行fit_transform来检查到该点的输出...