Question

为了使RandomForestRegressor既适合数字列又适合分类列，我想创建一个干净的管道。我知道我必须包括OneHotEncoding才能处理分类数据，但是我无法正确格式化所有内容。

# Since I have either string or double column types, I can split my
# columns in two lists
cat_cols = [c for c, dtype in data.dtypes if dtype == 'string']
num_cols = [c for c in data.columns if c not in cat_cols]

为了对分类数据应用OneHotEncoding，我需要首先应用StringIndexing：

from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=cat_var, outputCol=cat_var+'_indexed', handleInvalid='keep') for cat_var in cat_cols]

# These two lists are here to lighten the code
cat_indexed = [cat_var+"_indexed" for cat_var in cat_cols]
cat_encoded = [cat_var+"_encoded" for cat_var in cat_cols]

从2.3开始，建议使用OneHotEncoderEstimator。问题是没有教程显示如何将其与StringIndexers列表组合并将其包装到管道中。

from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=cat_indexed, outputCols=cat_encoded)

# with version < 2.3 it would have been:
# encoders = [OneHotEncoder(inputCol=x, outputCol=y) for x,y in zip(cat_indexed, cat_encoded)]

用Pipeline包装所有内容：

from pyspark.ml import Pipeline
from pyspark.ml.features import VectorAssembler

# I have to get my numerical columns back:
cols_now = cat_encoded + num_cols
assembler = VectorAssembler(inputCols=cols_now, outputCol='features')

# If my target variable had been a string, I would have apply StringIndexing for this also:
labelIndexer = StringIndexer(inputCol='target_var', outpuCol='label')

# I need to concatenate a list of indexers, encoders, the vector assembler and my target_var
# What I would do if I had a list of encoders with the deprecated function OneHotEncoder:
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]

tmp += [assembler, labelIndexer]
pipeline = Pipeline(stages=tmp)

但是：

我有一个编码器对象，并且没有可迭代的对象，因此上述代码无法正常工作
我的目标var是数字变量，所以我应该将其原始添加而不是labelIndexer吗？

Pyspark ml使用分类和数值列创建管道

0 个答案: