两个数据帧之间的通用联接spark / scala

时间:2018-08-21 00:09:37

标签: scala apache-spark

我有两个数据框,即左和右。我有可以解决我问题的解决方案。我需要一种使其通用的方法。我的问题在这里结束。

leftDF:

+------+---------+-------+-------+
|leftId|leftAltId|leftCur|leftAmt|
+------+---------+-------+-------+
|1     |100      |USD    |20     |
|2     |200      |INR    |100    |
|4     |500      |MXN    |100    |
+------+---------+-------+-------+

rightDF:

+-------+----------+--------+--------+
|rightId|rightAltId|rightCur|rightAmt|
+-------+----------+--------+--------+
|1      |300       |USD     |20      |
|3      |400       |MXN     |100     |
|4      |600       |MXN     |200     |
+-------+----------+--------+--------+

我想在这两个数据帧之间执行联接,我希望有四个数据帧作为输出

    在leftDF中存在的
  1. 交易,而在rightDF中不存在

  2. rightDF中存在的
  3. 交易,而leftDF中不存在

  4. 事务具有两个数据帧之间通用的id之一

    3.a严格匹配:相同的货币,两个数据框之间的金额。示例:ID为1的交易。

    3.b轻松匹配:具有相同ID,但货币/金额组合不同的交易。 ID为4的示例交易。

这是所需的输出:

  1. leftDF中存在的事务,而rightDF中不存在

    +------+---------+-------+-------+-------+----------+--------+--------+
    |leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
    +------+---------+-------+-------+-------+----------+--------+--------+
    |2     |200      |INR    |100    |null   |null      |null    |null    |
    +------+---------+-------+-------+-------+----------+--------+--------+
    
  2. 在rightDF中存在的
  3. 交易,而在leftDF中不存在

    +------+---------+-------+-------+-------+----------+--------+--------+
    |leftId|leftAtId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
    +------+---------+-------+-------+-------+----------+--------+--------+
    |null  |null     |null   |null   |3      |400       |MXN     |100     |
    +------+---------+-------+-------+-------+----------+--------+--------+
    
  4. 事务具有两个数据帧之间通用的id之一

    +------+---------+-------+-------+-------+----------+--------+--------+
    |leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
    +------+---------+-------+-------+-------+----------+--------+--------+
    |1     |100      |USD    |20     |1      |300       |USD     |20      |
    |4     |500      |MXN    |100    |4      |600       |MXN     |200     |
    +------+---------+-------+-------+-------+----------+--------+--------+
    

    3.a严格匹配:相同的货币,两个数据框之间的金额。示例:ID为1的交易。

    +------+---------+-------+-------+-------+----------+--------+--------+        
    |leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
    +------+---------+-------+-------+-------+----------+--------+--------+
    |1     |100      |USD    |20     |1      |300       |USD     |20      |
    +------+---------+-------+-------+-------+----------+--------+--------+
    

    3.b轻松匹配:具有相同ID,但货币/金额组合不同的交易。 ID为4的示例交易。

     +------+---------+-------+-------+-------+----------+--------+--------+
    |leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
    +------+---------+-------+-------+-------+----------+--------+--------+
    |4     |500      |MXN    |100    |4      |600       |MXN     |200     |
    +------+---------+-------+-------+-------+----------+--------+--------+
    

这是我的工作代码:

import sparkSession.implicits._

val leftDF: DataFrame = Seq((1, 100, "USD", 20), (2, 200, "INR", 100), (4, 500, "MXN", 100)).toDF("leftId", "leftAltId", "leftCur", "leftAmt")
val rightDF: DataFrame = Seq((1, 300, "USD", 20), (3, 400, "MXN", 100), (4, 600, "MXN", 200)).toDF("rightId", "rightAltId", "rightCur", "rightAmt")

leftDF.show(false)
rightDF.show(false)
val idMatchQuery = leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
val currencyMatchQuery = leftDF("leftCur") === rightDF("rightCur") && leftDF("leftAmt") === rightDF("rightAmt")
val leftOnlyQuery = (col("leftId").isNotNull && col("rightId").isNull) || (col("leftAltId").isNotNull && col("rightAltId").isNull)
val rightOnlyQuery = (col("rightId").isNotNull && col("leftId").isNull) || (col("rightAltId").isNotNull && col("leftAltId").isNull)
val matchQuery = (col("rightId").isNotNull && col("leftId").isNotNull) || (col("rightAltId").isNotNull && col("leftAltId").isNotNull)

val result = leftDF.join(rightDF, idMatchQuery, "fullouter")

val leftOnlyDF = result.filter(leftOnlyQuery)
val rightOnlyDF = result.filter(rightOnlyQuery)

val matchDF = result.filter(matchQuery)
val strictMatchDF = matchDF.filter(currencyMatchQuery.equalTo(true))
val relaxedMatchDF = matchDF.filter(currencyMatchQuery.equalTo(false))

leftOnlyDF.show(false)
rightOnlyDF.show(false)
matchDF.show(false)
strictMatchDF.show(false)
relaxedMatchDF.show(false)

我正在寻找的东西:

我希望能够将要连接的列名作为列表并使代码通用。

例如

    val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
    val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))

1 个答案:

答案 0 :(得分:0)

  
    

我希望能够将要连接的列名作为列表并使代码通用。

  

这不是一个完美的建议,但肯定会帮助您获得概括。建议与foldLeft

一起使用
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val rHead = relaxedJoinList.head

val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
val sHead = strictJoinList.head

val idMatchQuery = relaxedJoinList.tail.foldLeft(leftDF(rHead._1) === rightDF(rHead._2)){(x, y) => x || leftDF(y._1) === rightDF(y._2)}
val currencyMatchQuery = strictJoinList.tail.foldLeft(leftDF(sHead._1) === rightDF(sHead._2)){(x, y) => x && leftDF(y._1) === rightDF(y._2)}
val leftOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNull}
val rightOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNull && col(y._2).isNotNull}
val matchQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNotNull}

其余代码与您一样

我希望答案会有所帮助