如何为UDF
创建outer join
比如说,我的列下面有类型如下
ColA: String
ColB: Seq[Row]
DF1:
ColA ColB
1 [(1,2),(1,3)]
2 [(2,3),(3,4)]
DF2:
ColA ColB
1 [(1,2),(1,4)]
3 [(2,5),(3,4)]
结果:
ColA newCol
1 [(1,2),(1,3)]
2 [(2,3),(3,4)]
3 [(2,5),(3,4)]
代码示例:
val joinDf=DF1.join(DF2,DF1(ColA)===DF2(ColA),"outer")
.withColumn("newCol", when(DF1(ColB).isNull,DF2(ColB))
.otherwise(when(DF2(ColB).isNull,DF1(ColB)).otherwise(DF1(ColB))))
.select(col("colA"),col("newCol"))
val joinUdf=udf((a: Seq[Row],b: Seq[Row]) => (a,b) match {
case (null,b) => a
case (a,null) => b
case (a,b) => b
}
这会引发错误。
Java.lang.UnsupportedOperationException ::不支持类型为org.apache.spark.sql.Row的架构
答案 0 :(得分:1)
鉴于第一个数据框Counter({'ah': 1, 'hey': 1, 'you': 2})
的{{1}}是
schema
您必须将DF1
重命名为
root
|-- ColA: integer (nullable = false)
|-- ColB: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
和DF2
+----+--------------+
|ColA|ColC |
+----+--------------+
|1 |[[1,2], [1,3]]|
|3 |[[2,5], [3,4]]|
+----+--------------+
并使用以下代码,您甚至不需要schema
函数,因为使用了内置root
|-- ColA: integer (nullable = false)
|-- ColC: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
函数
udf
你应该得到你想要的输出
when