您可以复制以下代码并将其粘贴到spark-shell
中以重现该问题。
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.ml.feature.Interaction
val df = spark.createDataFrame(Seq(
(1, Vectors.dense(1, 2), Vectors.dense(3, 4)),
(2, Vectors.dense(5, 6), Vectors.dense(7, 8))
)).toDF("id", "v1", "v2")
df.show
val interaction = new Interaction().setInputCols(Array("v1", "v2")).setOutputCol("interacted")
interaction.transform(df).show
scala>
输入数据df
显示为:
+---+---------+---------+
| id| v1| v2|
+---+---------+---------+
| 1|[1.0,2.0]|[3.0,4.0]|
| 2|[5.0,6.0]|[7.0,8.0]|
+---+---------+---------+
预期结果应该是:
+---+---------+---------+---------------------+
| id| v1| v2| interacted|
+---+---------+---------+---------------------+
| 1|[1.0,2.0]|[3.0,4.0]| [3.0,4.0,6.0,8.0]|
| 2|[5.0,6.0]|[7.0,8.0]|[35.0,40.0,42.0,48.0]|
+---+---------+---------+---------------------+
但是它会引发异常:
org.apache.spark.SparkException: Vector attributes must be defined for interaction.
at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1$$anonfun$4.apply(Interaction.scala:138)
at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1$$anonfun$4.apply(Interaction.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1.apply(Interaction.scala:137)
at org.apache.spark.ml.feature.Interaction$$anonfun$getFeatureEncoders$1.apply(Interaction.scala:132)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.ml.feature.Interaction.getFeatureEncoders(Interaction.scala:132)
at org.apache.spark.ml.feature.Interaction.transform(Interaction.scala:73)
... 49 elided
我在Spark 2.3.1,Spark 2.3.0,Spark 2.2.0上进行了测试,所有这些都存在相同的问题。
但是,如果矢量是由VectorAssembler制成的,就像the example一样,它就可以正确运行。
+---+---+---+---+---------+---------+--------------------+
| a| b| c| d| ab| cd| interacted|
+---+---+---+---+---------+---------+--------------------+
|1.0|2.0|3.0|4.0|[1.0,2.0]|[3.0,4.0]| [3.0,4.0,6.0,8.0]|
|5.0|6.0|7.0|8.0|[5.0,6.0]|[7.0,8.0]|[35.0,40.0,42.0,4...|
+---+---+---+---+---------+---------+--------------------+
我还对那些可以产生矢量的变压器进行了测试。有些还可以,但是有些抛出异常。
createDataFrame().toDF exception
FeatureHasher exception
HashingTF exception
Interaction OK
OneHotEncoder OK
OneHotEncoderEstimator OK
VectorAssembler OK
VectorSlicer OK
是我的错误还是误解?