Question

我一直试图理解为什么Spark ML OneHotEncoder转换在没有空字符串传递给我时会抛出empty string错误。

为了可复制性，我有以下样本df：

val df = sparkSession.createDataFrame(Seq(
  (0, "apple"),
  (1, "banana"),
  (2, ""),
  (1, "banana"),
  (2, null)
)).toDF("id", "fruit")

现在，我希望在某些ML算法中使用fruit列，因此希望将其编码为向量。为此，我首先运行StringIndexer转换，然后在该索引的输出上运行OneHotEncoder，通过管道运行整个事务：

val indexer = new StringIndexer()
  .setInputCol("fruit")
  .setOutputCol("fruit_category")
  .setHandleInvalid("keep")

val encoder = new OneHotEncoder()
  .setInputCol("fruit_category")
  .setOutputCol("fruit_vec")

// create pipleline
val transform_pipleline = new Pipeline().setStages(Array(indexer, encoder))

// run pipleline on DF to create model
val index_model = transform_pipleline.fit(df)

// use the model to actually transform a DF
val df2 = index_model.transform(df)

然而，当这样做时，我得到了

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.

但是，我传递给OneHotEncoder的列是StringIndexer（＆＃34; fruit_category＆＃34;）的输出，它没有空白字符串，即：

+---+------+--------------+
| id| fruit|fruit_category|
+---+------+--------------+
|  0| apple|           2.0|
|  1|banana|           0.0|
|  2|      |           1.0|
|  1|banana|           0.0|
|  2|  null|           3.0|
+---+------+--------------+

这里发生了什么？ OneHotEncoder是否以某种方式使用原始标签由StringIndexer保留？我虽然我不需要删除任何空字符串，因为索引器将那些字符串索引为双精度字符串？

Spark ML StringIndexer和OneHotEncoder - 空字符串错误

0 个答案: