PysparkpipelineModel.transform错误'字段“ cut_catVec”不存在。\ n可用字段

时间:2019-11-09 16:49:35

标签: pyspark

我正在尝试在Pyspark中运行MLlip以预测价格,并且我正在使用具有以下架构的数据框:

[('cut', 'string'),
 ('color', 'string'),
 ('clarity', 'string'),
 ('carat', 'double'),
 ('table', 'int'),
 ('x', 'double'),
 ('y', 'double'),
 ('z', 'double'),
 ('price', 'int')]

因此在确定了分类列和数字列之后:

`categ_col= ['cut', 'color' ,'clarity']
num_col= ['carat','table', 'x', 'y', 'z']``

在下面的脚本I中:

  • 首先使用StringIndexer将字符串/文本值转换为数值,然后再使用OneHotEncoderEstimator
  • 通过Spark MLLib将每个Stringindexed或转换后的值转换为One Hot Encoded值。
  • VectorAssembler用于将所有特征从多个包含double类型的列中组合成一个向量
  • 还将过程的每个步骤附加在一个stages数组中

    `from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, 
    VectorAssembler
    stages = []
    for catcol in categ_col:
       stringIndexer = StringIndexer(inputCol = catcol, outputCol =                         catcol + 'Index')
     OHencoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[catcol + "_catVec"])
    stages += [stringIndexer, OHencoder]
    assemblerInputs = [c + "_catVec" for c in categ_col] + num_col
    Vectassembler = VectorAssembler(inputCols=assemblerInputs,                                 outputCol="features")
    stages += [Vectassembler]`
    

``

当我进入下一步时:

    `from pyspark.ml import Pipeline
    cols = mllipdf.columns
    pipeline = Pipeline(stages = stages)
    pipelineModel = pipeline.fit(mllipdf)
    mllipdf = pipelineModel.transform(mllipdf)
    selectedCols = ['features']+cols
    mllipdf = mllipdf.select(selectedCols)
    pd.DataFrame(mllipdf.take(5), columns=mllipdf.columns)`

我遇到了一个错误 鳕鱼 mllipdf = pipelineModel.transform(mllipdf)" line saying "IllegalArgumentException: 'Field "cut_catVec" does not exist.\nAvailable fields: cut, color, clarity, carat, table, x, y, z, price, clarityIndex, clarity_catVec'

不确定这里会发生什么

0 个答案:

没有答案