为了使RandomForestRegressor既适合数字列又适合分类列,我想创建一个干净的管道。我知道我必须包括OneHotEncoding才能处理分类数据,但是我无法正确格式化所有内容。
# Since I have either string or double column types, I can split my
# columns in two lists
cat_cols = [c for c, dtype in data.dtypes if dtype == 'string']
num_cols = [c for c in data.columns if c not in cat_cols]
为了对分类数据应用OneHotEncoding,我需要首先应用StringIndexing:
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=cat_var, outputCol=cat_var+'_indexed', handleInvalid='keep') for cat_var in cat_cols]
# These two lists are here to lighten the code
cat_indexed = [cat_var+"_indexed" for cat_var in cat_cols]
cat_encoded = [cat_var+"_encoded" for cat_var in cat_cols]
从2.3开始,建议使用OneHotEncoderEstimator。 问题是没有教程显示如何将其与StringIndexers列表组合并将其包装到管道中。
from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=cat_indexed, outputCols=cat_encoded)
# with version < 2.3 it would have been:
# encoders = [OneHotEncoder(inputCol=x, outputCol=y) for x,y in zip(cat_indexed, cat_encoded)]
用Pipeline包装所有内容:
from pyspark.ml import Pipeline
from pyspark.ml.features import VectorAssembler
# I have to get my numerical columns back:
cols_now = cat_encoded + num_cols
assembler = VectorAssembler(inputCols=cols_now, outputCol='features')
# If my target variable had been a string, I would have apply StringIndexing for this also:
labelIndexer = StringIndexer(inputCol='target_var', outpuCol='label')
# I need to concatenate a list of indexers, encoders, the vector assembler and my target_var
# What I would do if I had a list of encoders with the deprecated function OneHotEncoder:
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
tmp += [assembler, labelIndexer]
pipeline = Pipeline(stages=tmp)
但是: