用户定义的连接函数

时间:2017-07-30 02:03:10

标签: scala apache-spark dataframe

如何为UDF创建outer join 比如说,我的列下面有类型如下

ColA: String
ColB: Seq[Row]

DF1:

ColA ColB
1    [(1,2),(1,3)]
2    [(2,3),(3,4)]

DF2:

ColA ColB
1    [(1,2),(1,4)]
3    [(2,5),(3,4)]

结果:

 ColA    newCol
    1    [(1,2),(1,3)]
    2    [(2,3),(3,4)]
    3    [(2,5),(3,4)]

代码示例:

val joinDf=DF1.join(DF2,DF1(ColA)===DF2(ColA),"outer")
    .withColumn("newCol", when(DF1(ColB).isNull,DF2(ColB))
        .otherwise(when(DF2(ColB).isNull,DF1(ColB)).otherwise(DF1(ColB))))
    .select(col("colA"),col("newCol"))

val joinUdf=udf((a: Seq[Row],b: Seq[Row]) => (a,b) match {
    case (null,b) => a
    case (a,null) => b
    case (a,b) => b
}

这会引发错误。

  
    

Java.lang.UnsupportedOperationException ::不支持类型为org.apache.spark.sql.Row的架构

  

1 个答案:

答案 0 :(得分:1)

鉴于第一个数据框Counter({'ah': 1, 'hey': 1, 'you': 2}) 的{​​{1}}是

schema

您必须将DF1重命名为

root
 |-- ColA: integer (nullable = false)
 |-- ColB: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: integer (nullable = false)
 |    |    |-- _2: integer (nullable = false)

DF2

+----+--------------+
|ColA|ColC          |
+----+--------------+
|1   |[[1,2], [1,3]]|
|3   |[[2,5], [3,4]]|
+----+--------------+

并使用以下代码,您甚至不需要schema函数,因为使用了内置root |-- ColA: integer (nullable = false) |-- ColC: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- _1: integer (nullable = false) | | |-- _2: integer (nullable = false) 函数

udf

你应该得到你想要的输出

when