Spark mllib Interaction在手工Vectors上引发异常

时间:2018-08-27 07:13:24

标签: apache-spark-mllib

您可以复制以下代码并将其粘贴到spark-shell中以重现该问题。

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.ml.feature.Interaction

val df = spark.createDataFrame(Seq(
    (1, Vectors.dense(1, 2), Vectors.dense(3, 4)),
    (2, Vectors.dense(5, 6), Vectors.dense(7, 8))
)).toDF("id", "v1", "v2")

df.show

val interaction = new Interaction().setInputCols(Array("v1", "v2")).setOutputCol("interacted")

interaction.transform(df).show

scala>

输入数据df显示为:

+---+---------+---------+
| id|       v1|       v2|
+---+---------+---------+
|  1|[1.0,2.0]|[3.0,4.0]|
|  2|[5.0,6.0]|[7.0,8.0]|
+---+---------+---------+

预期结果应该是:

+---+---------+---------+---------------------+
| id|       v1|       v2|           interacted|
+---+---------+---------+---------------------+
|  1|[1.0,2.0]|[3.0,4.0]|    [3.0,4.0,6.0,8.0]|
|  2|[5.0,6.0]|[7.0,8.0]|[35.0,40.0,42.0,48.0]|
+---+---------+---------+---------------------+

但是它会引发异常:

org.apache.spark.SparkException: Vector attributes must be defined for interaction.
  at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1$$anonfun$4.apply(Interaction.scala:138)
  at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1$$anonfun$4.apply(Interaction.scala:138)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1.apply(Interaction.scala:137)
  at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1.apply(Interaction.scala:132)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.apache.spark.ml.feature.Interaction.getFeatureEncoders(Interaction.scala:132)
  at org.apache.spark.ml.feature.Interaction.transform(Interaction.scala:73)
  ... 49 elided

我在Spark 2.3.1,Spark 2.3.0,Spark 2.2.0上进行了测试,所有这些都存在相同的问题。

但是,如果矢量是由VectorAssembler制成的,就像the example一样,它就可以正确运行。

+---+---+---+---+---------+---------+--------------------+
|  a|  b|  c|  d|       ab|       cd|          interacted|
+---+---+---+---+---------+---------+--------------------+
|1.0|2.0|3.0|4.0|[1.0,2.0]|[3.0,4.0]|   [3.0,4.0,6.0,8.0]|
|5.0|6.0|7.0|8.0|[5.0,6.0]|[7.0,8.0]|[35.0,40.0,42.0,4...|
+---+---+---+---+---------+---------+--------------------+

我还对那些可以产生矢量的变压器进行了测试。有些还可以,但是有些抛出异常。

createDataFrame().toDF  exception
FeatureHasher           exception
HashingTF               exception
Interaction             OK
OneHotEncoder           OK
OneHotEncoderEstimator  OK
VectorAssembler         OK
VectorSlicer            OK

是我的错误还是误解?

0 个答案:

没有答案