在pandas或sklearn中是否可以将分类值转换为唯一的数字/浮点索引并包含在管道中? 必须坚持使用sklearn 18.1,因为这是服务器上可用的功能
ex:
c1 | c2
--------
cat1 | 1.0
cat2 | 2.0
cat2 | 2.0
cat3 | 3.0
遇到问题,因为labelEncoder和OneHotEncoder无法在管道中一起运行,并且无法通过OHE输入字符串数据...
为了应用预测,某些分类值并不总是存在,并且模型失败,因为我无法在管道中使用handle_unknown =“ ignore”
因此,我不得不将pandas.get_dummies()应用于OHE,由于缺少类别,我不想使用它
# tried
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
cat_pipeline = Pipeline([\
("select", DataFrameSelector(cat_features)),\
("index", LabelEncoder()),\
("ohe", OneHotEncoder(handle_unknown="ignore"))\
])
test = cat_pipeline.fit_transform(X_train)
test.shape
# currently using but fails when applying predictions to new data w/o all the categories that were OHE via pandas.get_dummies()
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
("cat", Pipeline([\
("select", DataFrameSelector(cat_ohe))\
])),\