根据条件加入多个Spark Dataframe

时间:2017-09-30 08:26:39

标签: scala apache-spark apache-spark-sql spark-dataframe

基于" SC"代码我需要将SRCTable与RefTable-1或RefTable-2

连接起来

条件: 如果SC是" D" ,SRCTable与KEY = KEY1上的RefTable-1连接,以获取值。 如果SC SC是" U" ,在KEY = KEY2&上使用RefTable-2连接SRCTable FK = KEY3,得到值。

这是输入火花数据帧。

SRCTable:
    -------------
    KEY |SC  |FK 
    -------------
    66  |D   | a
    67  |U   | b
    70  |D   | y
    71  |U   | q
    -------------
 RefTable-1:
    --------------
    KEY1 |Value  | 
    --------------
    66   |xyz1   | 
    67   |abc1   | 
    68   |fgr1   |
    69   |yte1   |
    70   |erx1   |
    71   |ter1   |
    --------------
 RefTable-2:
    --------------------
    KEY2 |KEY3  |Value  | 
    --------------------
    66   | a    |xyz2   | 
    67   | c    |abc2   | 
    67   | b    |fgr2   |
    69   | g    |yte2   |
    70   | y    |erx2   |
    71   | q    |ter2   |
    --------------------

预期产出:

    --------------------
    KEY |SC  |FK |Value |
    -------------------- 
    66  |D   | a |xyz1  |
    67  |U   | b |fgr2  |
    70  |D   | y |erx1  |
    71  |U   | q |ter2  |
    ---------------------

注意:输入表将包含数百万条记录,因此需要优化解决方案

1 个答案:

答案 0 :(得分:2)

以下是您只能使用DataFrame

上的联接函数进行测试的代码
val SRCTable = Seq((66, "D", "a"), (67, "U", "b"), (70, "D", "y"), (71, "U", "q")).toDF("KEY", "SC", "FK")
val RefTable1 = Seq((66, "xyz1"),(67, "abc1"),(68, "fgr1"),(69, "yte1"),(70, "erx1"),(71, "ter1")).toDF("KEY1", "Value")
val RefTable2 = Seq((66, "a", "xyz2"), (67, "c", "abc2"), (67, "b", "fgr2"), (69, "g", "yte2"), (70, "y", "erx2"), (71, "q", "ter2")).toDF("KEY2", "KEY3", "Value")

val join1 = SRCTable.where(SRCTable.col("SC").equalTo("D")).join(RefTable1, SRCTable.col("KEY") === RefTable1.col("KEY1")).select("KEY", "SC", "FK", "Value")
val join2 = SRCTable.where(SRCTable.col("SC").equalTo("U")).join(RefTable2, SRCTable.col("KEY") === RefTable2.col("KEY2") && SRCTable.col("FK") === RefTable2.col("KEY3") ).select("KEY", "SC", "FK", "Value")

join1.unionAll(join2).show 

如果您有任何性能问题,我建议您查看如何对数据进行分区,并在您的某个DataFrame很小的情况下查看Broadcast对象