Question

我正在使用Spark（Scala）训练XGBoostRegressor模型，并且我注意到预测值的数量少于使用model.transform（df）给模型的预测值。

问题是由于存在（根据我的用例，应该是）NULL值。我已经在每个阶段都使用setHandleInvalid（特别是-stringIndexer，oneHotEncoder，vectorAssembler）来处理这些问题。

但是，尽管如此，如果我使用“ keep”，则模型无法训练，但是如果我使用“ skip”（顺便说一句，仅在vectorAssembler上），那么模型就可以进行训练，但仅“丢弃“甚至1个字段为null的记录。

尝试了很多Google，但并未真正找到任何解决方案。

将感谢任何人的投入。

谢谢。

Spark，Scala，XGBoost Docs看到一些无效的PR，尝试了几种处理Null值的策略，但没有一个成功。

对于保持情况（火车出故障的地方）->

  .setInputCol("country_code")
  .setOutputCol("country_code_indexed")
  .setHandleInvalid("keep")

val oneHotEncoder = new OneHotEncoderEstimator()
.setInputCol("user_country_code_indexed")
.setOutputCol("user_country_oneHotEncoded")
.setHandleInvalid("keep")

val assembler =  new VectorAssembler()
  .setInputCols(trainUpdated.drop("label",
                               "someCol1",
                               "someCol2", 
                               "country_code", 
                               "country_code_indexed").columns)
  .setOutputCol("features")
  .setHandleInvalid("keep")

val xgboostRegressor = new XGBoostRegressor(Map[String, Any](
  "num_round" -> 100,
  "num_workers" -> 10,  //num of instances * num of cores is the max.
  "objective" -> "reg:linear",
  "eta" -> 0.1,
  "gamma" -> 0.5,
  "max_depth" -> 6, 
  "early_stopping_rounds" -> 9,
  "seed" -> 1234,
  "lambda" -> 0.4,
  "alpha" -> 0.3,
  "colsample_bytree" -> 0.6,
  "subsample" -> 0.3
  ))

然后我得到-> ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed

预期结果-模型训练具有空值的（因为这是其默认行为...），并返回与训练/测试（拟合/变换，两者的策略相同）相同的确切记录数）。

Answer 1

我想声称我已经与XGBoost创作者讨论了此问题，并且我通过相应地更新有关该文档的方式为社区做出了贡献。新文档在这里（缺少值部分）-https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html

如果存在空值（整个管道存在setHandleInvalid“ keep”），则XGBoost训练会失败

1 个答案: