我正在尝试使用结构内部的Vector作为spark mllib转换的输入列。像这样...
import org.apache.spark.ml.linalg._
case class State(id: String, features: Vector)
val ds = Seq[(State,State)]().toDS
ds.printSchema()
root
|-- _1: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- features: vector (nullable = true)
|-- _2: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- features: vector (nullable = true)
但是不能将_1.features
作为输入列传递给转换器...
val pca = new PCA().
setInputCol("_1.features").
setOutputCol("output").
setK(3).
fit(ds)
java.lang.IllegalArgumentException: Field "_1.features" does not exist.
Available fields: _1, _2
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:274)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:273)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:41)
at org.apache.spark.ml.feature.PCAParams$class.validateAndTransformSchema(PCA.scala:56)
at org.apache.spark.ml.feature.PCA.validateAndTransformSchema(PCA.scala:70)
at org.apache.spark.ml.feature.PCA.transformSchema(PCA.scala:105)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.PCA.fit(PCA.scala:94)
重命名该字段很短,有人知道解决此问题的方法吗?
答案 0 :(得分:0)
一个简单的解决方法是仅选择数据集中需要的部分,因为Spark ML似乎存在嵌套列问题。
例如以下应该起作用:
val dsFeatures = ds.select("_1.id", "_1.features")
val pca = new PCA().setInputCol("features").setOutputCol("output").setK(3).fit(dsFeatures)
或者至少由于不同的原因而失败! :)