我有以下情况
// using spark 2.2
case class myAttribute(name:String,value:String)
case class person(id:String,attr:List[myAttribute])
val myFunc: (Seq[myAttribute] => Map[String, String]) = { array =>
array.map {
case myAttribute(name: String, value: String) =>
(name, value)
}.toMap
}
val myUDF = udf(myFunc)
将myUDF
用作
val df = List(person("1",List(myAttribute("s","s"))),person("2",List(myAttribute("b","b")))).toDF()
df.withColumn("tst",myUDF2($"attr")).show
我收到以下错误
org.apache.spark.SparkException:
Failed to execute user defined function($anonfun$1: (array<struct<name:string,value:string>>) => map<string,string>)
.......
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to myAttribute
at $anonfun$1$$anonfun$apply$1.apply(<pastie>:30)
但是当我在udf中将myAttribute
更改为Row
时,它可以正常工作。但为什么 ?列attr
的类型为Seq[myAttribute]
。我想念什么吗?
编辑,我经历了Defining a UDF that accepts an Array of objects in a Spark DataFrame?,这说明应该使用Row
,这一点我知道。我想问一问,为什么我的列的类型为myAttribute
,但在UDF中却必须使用通用类型Row
而不是myAttribute
?