我有一个带有以下架构的Spark DataFrame:
root
|-- mapkey: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- bt: string (nullable = true)
| | | |-- bp: double (nullable = true)
| | | |-- z: struct (nullable = true)
| | | | |-- w: integer (nullable = true)
| | | | |-- h: integer (nullable = true)
|-- uid: string (nullable = true)
我想编写一个UDF来过滤mapkey,使得键等于uid,并且只返回通过过滤器的值。我正在尝试以下方法:
val filterMap = udf((m: Map[String, Seq[Row]], uid: String) => {
val s = Set(uid)
m.filterKeys { s.contains(_) == true }
})
但是我收到以下错误:
java.lang.UnsupportedOperationException:不支持类型为org.apache.spark.sql.Row的架构 在org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:762) at org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:704) at scala.reflect.internal.tpe.TypeConstraints $ UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection $ class.cleanUpReflectionObjects(ScalaReflection.scala:809) 在org.apache.spark.sql.catalyst.ScalaReflection $ .cleanUpReflectionObjects(ScalaReflection.scala:39) 在org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor(ScalaReflection.scala:703) 在org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:722) at org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:704) at scala.reflect.internal.tpe.TypeConstraints $ UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection $ class.cleanUpReflectionObjects(ScalaReflection.scala:809) 在org.apache.spark.sql.catalyst.ScalaReflection $ .cleanUpReflectionObjects(ScalaReflection.scala:39) 在org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor(ScalaReflection.scala:703) 在org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:726) at org.apache.spark.sql.catalyst.ScalaReflection $$ anonfun $ schemaFor $ 1.apply(ScalaReflection.scala:704) at scala.reflect.internal.tpe.TypeConstraints $ UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection $ class.cleanUpReflectionObjects(ScalaReflection.scala:809) 在org.apache.spark.sql.catalyst.ScalaReflection $ .cleanUpReflectionObjects(ScalaReflection.scala:39) 在org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor(ScalaReflection.scala:703) 在org.apache.spark.sql.catalyst.ScalaReflection $ .schemaFor(ScalaReflection.scala:700) 在org.apache.spark.sql.functions $ .udf(functions.scala:3200)
有人能指出UDF有什么问题吗?
答案 0 :(得分:1)
看起来您唯一的选择是使用与此Row
的内部结构相匹配的案例类:
case class MyStruct(w: Int, h: Int)
case class Element(id: String, bt: String, bp: Double, z: MyStruct)
然后你可以在你的UDF中使用它(令人惊讶的是):
// sample data:
val df = Seq(
(Map(
"key1" -> Array(Element("1", "bt1", 0.1, MyStruct(1, 2)), Element("11", "bt11", 0.2, MyStruct(1, 3))),
"key2" -> Array(Element("2", "bt2", 0.2, MyStruct(12, 22)))
), "key2")
).toDF("mapkey", "uid")
df.printSchema() // prints the right schema, as expected in post
// define UDF:
val filterMap = udf((m: Map[String, Seq[Element]], uid: String) => {
m.filterKeys(_ == uid)
})
// use UDF:
df.withColumn("result", filterMap($"mapkey", $"uid")).show(false)
// prints:
// +-----------------------------------------------------------------+
// |result |
// +-----------------------------------------------------------------+
// |Map(key1 -> WrappedArray([1,bt1,0.1,[1,2]], [11,bt11,0.2,[1,3]]))|
// +-----------------------------------------------------------------+