Pyspark ml使用分类和数值列创建管道

时间:2019-04-16 09:34:26

标签: python pyspark apache-spark-ml

为了使RandomForestRegressor既适合数字列又适合分类列,我想创建一个干净的管道。我知道我必须包括OneHotEncoding才能处理分类数据,但是我无法正确格式化所有内容。

# Since I have either string or double column types, I can split my
# columns in two lists
cat_cols = [c for c, dtype in data.dtypes if dtype == 'string']
num_cols = [c for c in data.columns if c not in cat_cols]

为了对分类数据应用OneHotEncoding,我需要首先应用StringIndexing:

from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=cat_var, outputCol=cat_var+'_indexed', handleInvalid='keep') for cat_var in cat_cols]

# These two lists are here to lighten the code
cat_indexed = [cat_var+"_indexed" for cat_var in cat_cols]
cat_encoded = [cat_var+"_encoded" for cat_var in cat_cols]

从2.3开始,建议使用OneHotEncoderEstimator。 问题是没有教程显示如何将其与StringIndexers列表组合并将其包装到管道中。

from pyspark.ml.feature import OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=cat_indexed, outputCols=cat_encoded)

# with version < 2.3 it would have been:
# encoders = [OneHotEncoder(inputCol=x, outputCol=y) for x,y in zip(cat_indexed, cat_encoded)]

Pipeline包装所有内容:

from pyspark.ml import Pipeline
from pyspark.ml.features import VectorAssembler

# I have to get my numerical columns back:
cols_now = cat_encoded + num_cols
assembler = VectorAssembler(inputCols=cols_now, outputCol='features')

# If my target variable had been a string, I would have apply StringIndexing for this also:
labelIndexer = StringIndexer(inputCol='target_var', outpuCol='label')

# I need to concatenate a list of indexers, encoders, the vector assembler and my target_var
# What I would do if I had a list of encoders with the deprecated function OneHotEncoder:
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]

tmp += [assembler, labelIndexer]
pipeline = Pipeline(stages=tmp)

但是:

  • 我有一个编码器对象,并且没有可迭代的对象,因此上述代码无法正常工作
  • 我的目标var是数字变量,所以我应该将其原始添加而不是labelIndexer吗?

0 个答案:

没有答案