VectorAssembler失败,并出现java.util.NoSuchElementException:参数handleInvalid不存在

时间:2019-11-28 09:57:06

标签: apache-spark pyspark apache-spark-mllib apache-spark-ml apache-spark-2.0

转换使用VectorAssembler的ML管道时,出现“ Param handleInvalid不存在”错误。为什么会这样?我想念什么吗?我是PySpark的新手。

我正在按照代码使用它,将给定的列列表合并为一个向量列:

for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index').setHandleInvalid("keep")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'response', outputCol = 'label')
stages += [label_stringIdx]

numericCols = ['date_acct_', 'date_loan_', 'amount', 'duration', 'payments', 'birth_number_', 'min1', 'max1', 'mean1', 'min2', 'max2', 'mean2', 'min3', 'max3', 'mean3', 'min4', 'max4', 'mean4', 'min5', 'max5', 'mean5', 'min6', 'max6', 'mean6', 'gen', 'has_card']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="feature")
print(assembler)
stages += [assembler]

df_features是保存所有列的主要数据帧。我试图将handleInvalid ='keep'和handleInvalid ='skip'保持在那里,但不幸的是遇到了同样的错误。

出现以下错误:

Traceback (most recent call last):
  File "spark_model_exp_.py", line 275, in <module>
    feature_df = assembler.transform(features)
  File "/usr/local/lib/python3.6/site-packages/pyspark/ml/base.py", line 173, in transform
    return self._transform(dataset)
  File "/usr/local/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 311, in _transform
    self._transfer_params_to_java()
  File "/usr/local/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 124, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._paramMap[param])
  File "/usr/local/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 113, in _make_java_param_pair
    java_param = self._java_obj.getParam(param.name)
  File "/usr/local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o1072.getParam.
: java.util.NoSuchElementException: Param handleInvalid does not exist.
        at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
        at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
        at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:43)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:745)

我以前尝试过什么?

categoricalColumns = ['frequency', 'type_disp', 'type_card']
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index').setHandleInvalid("keep")
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    stages += [stringIndexer, encoder]

label_stringIdx = StringIndexer(inputCol = 'response', outputCol = 'label')
stages += [label_stringIdx]
numericCols = ['date_acct_', 'date_loan_', 'amount', 'duration', 'payments', 'birth_number_', 'min1', 'max1', 'mean1', 'min2', 'max2', 'mean2', 'min3', 'max3', 'mean3', 'min4', 'max4', 'mean4', 'gen', 'has_card']
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="feature")
stages += [assembler]
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(features)
features = pipelineModel.transform(features)
features.show(n=2)
selectedCols = ['label', 'feature'] + cols
features = features.select(selectedCols)
print(features.dtypes)

在上面的代码含义中,也通过使用Pipeline,我在Pipeline的转换函数中遇到了错误。当我尝试上面的代码时,我在VectorAssembler转换函数上没有出错,在管道转换函数上也得到了相同的错误(Param handleInvalid不存在)。

请让我知道更多详细信息。我们可以尝试通过其他替代方法来实现这一目标吗?

编辑:我得到了为什么会这样的部分答案,因为在本地spark版本= 2.4上,代码可以正常工作,但是群集spark版本= 2.3,并且由于handleInvalid是从2.4版本引入的因此我得到这个错误。

但是我想知道是因为我检查了数据帧中是否没有NULL / NaN值,但是vectorAssembler如何调用handleInvalid参数?我在考虑是否可以绕过handleInvalid的隐式调用,以便我不应该遇到此错误,或者是否有其他替代方法,而不是将spark版本从2.3升级到2.4?

有人可以提出建议吗?

1 个答案:

答案 0 :(得分:0)

我从RFormula获得了解决此问题的最终解决方案,因此无需使用StringIndexer,VectorAssembler和Pipeline。 RFormula将在后台执行所有操作。 https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/ml/feature/RFormula.html

formula = RFormula(formula='response ~ .', featuresCol='features', labelCol='label')
label_df = formula.fit(df_features).transform(df_features)

其中响应​​是标签,而df_features是您的整个功能集。