是将复杂类型用作Spark Ml变压器的输入列的一种方法

时间:2019-07-16 10:00:36

标签: apache-spark apache-spark-mllib

我正在尝试使用结构内部的Vector作为spark mllib转换的输入列。像这样...

import org.apache.spark.ml.linalg._
case class State(id: String, features: Vector)
val ds  = Seq[(State,State)]().toDS
ds.printSchema()
root
|-- _1: struct (nullable = true)
|    |-- id: string (nullable = true)
|    |-- features: vector (nullable = true)
|-- _2: struct (nullable = true)
|    |-- id: string (nullable = true)
|    |-- features: vector (nullable = true)

但是不能将_1.features作为输入列传递给转换器...

val pca = new PCA().
setInputCol("_1.features").
setOutputCol("output").
setK(3).
fit(ds) 
java.lang.IllegalArgumentException: Field "_1.features" does not exist.
Available fields: _1, _2
  at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
  at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:59)
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
  at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:41)
  at org.apache.spark.ml.feature.PCAParams$class.validateAndTransformSchema(PCA.scala:56)
  at org.apache.spark.ml.feature.PCA.validateAndTransformSchema(PCA.scala:70)
  at org.apache.spark.ml.feature.PCA.transformSchema(PCA.scala:105)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.feature.PCA.fit(PCA.scala:94)

重命名该字段很短,有人知道解决此问题的方法吗?

1 个答案:

答案 0 :(得分:0)

一个简单的解决方法是仅选择数据集中需要的部分,因为Spark ML似乎存在嵌套列问题。

例如以下应该起作用:

val dsFeatures = ds.select("_1.id", "_1.features")
val pca = new PCA().setInputCol("features").setOutputCol("output").setK(3).fit(dsFeatures)

或者至少由于不同的原因而失败! :)