我有两个数据框,即左和右。我有可以解决我问题的解决方案。我需要一种使其通用的方法。我的问题在这里结束。
+------+---------+-------+-------+
|leftId|leftAltId|leftCur|leftAmt|
+------+---------+-------+-------+
|1 |100 |USD |20 |
|2 |200 |INR |100 |
|4 |500 |MXN |100 |
+------+---------+-------+-------+
+-------+----------+--------+--------+
|rightId|rightAltId|rightCur|rightAmt|
+-------+----------+--------+--------+
|1 |300 |USD |20 |
|3 |400 |MXN |100 |
|4 |600 |MXN |200 |
+-------+----------+--------+--------+
我想在这两个数据帧之间执行联接,我希望有四个数据帧作为输出
交易,而在rightDF中不存在
交易,而leftDF中不存在
事务具有两个数据帧之间通用的id之一
3.a严格匹配:相同的货币,两个数据框之间的金额。示例:ID为1的交易。
3.b轻松匹配:具有相同ID,但货币/金额组合不同的交易。 ID为4的示例交易。
这是所需的输出:
leftDF中存在的事务,而rightDF中不存在
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|2 |200 |INR |100 |null |null |null |null |
+------+---------+-------+-------+-------+----------+--------+--------+
交易,而在leftDF中不存在
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAtId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|null |null |null |null |3 |400 |MXN |100 |
+------+---------+-------+-------+-------+----------+--------+--------+
事务具有两个数据帧之间通用的id之一
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.a严格匹配:相同的货币,两个数据框之间的金额。示例:ID为1的交易。
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|1 |100 |USD |20 |1 |300 |USD |20 |
+------+---------+-------+-------+-------+----------+--------+--------+
3.b轻松匹配:具有相同ID,但货币/金额组合不同的交易。 ID为4的示例交易。
+------+---------+-------+-------+-------+----------+--------+--------+
|leftId|leftAltId|leftCur|leftAmt|rightId|rightAltId|rightCur|rightAmt|
+------+---------+-------+-------+-------+----------+--------+--------+
|4 |500 |MXN |100 |4 |600 |MXN |200 |
+------+---------+-------+-------+-------+----------+--------+--------+
这是我的工作代码:
import sparkSession.implicits._
val leftDF: DataFrame = Seq((1, 100, "USD", 20), (2, 200, "INR", 100), (4, 500, "MXN", 100)).toDF("leftId", "leftAltId", "leftCur", "leftAmt")
val rightDF: DataFrame = Seq((1, 300, "USD", 20), (3, 400, "MXN", 100), (4, 600, "MXN", 200)).toDF("rightId", "rightAltId", "rightCur", "rightAmt")
leftDF.show(false)
rightDF.show(false)
val idMatchQuery = leftDF("leftId") === rightDF("rightId") || leftDF("leftAltId") === rightDF("rightAltId")
val currencyMatchQuery = leftDF("leftCur") === rightDF("rightCur") && leftDF("leftAmt") === rightDF("rightAmt")
val leftOnlyQuery = (col("leftId").isNotNull && col("rightId").isNull) || (col("leftAltId").isNotNull && col("rightAltId").isNull)
val rightOnlyQuery = (col("rightId").isNotNull && col("leftId").isNull) || (col("rightAltId").isNotNull && col("leftAltId").isNull)
val matchQuery = (col("rightId").isNotNull && col("leftId").isNotNull) || (col("rightAltId").isNotNull && col("leftAltId").isNotNull)
val result = leftDF.join(rightDF, idMatchQuery, "fullouter")
val leftOnlyDF = result.filter(leftOnlyQuery)
val rightOnlyDF = result.filter(rightOnlyQuery)
val matchDF = result.filter(matchQuery)
val strictMatchDF = matchDF.filter(currencyMatchQuery.equalTo(true))
val relaxedMatchDF = matchDF.filter(currencyMatchQuery.equalTo(false))
leftOnlyDF.show(false)
rightOnlyDF.show(false)
matchDF.show(false)
strictMatchDF.show(false)
relaxedMatchDF.show(false)
我希望能够将要连接的列名作为列表并使代码通用。
例如
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
答案 0 :(得分:0)
我希望能够将要连接的列名作为列表并使代码通用。
这不是一个完美的建议,但肯定会帮助您获得概括。建议与foldLeft
val relaxedJoinList = Array(("leftId", "rightId"), ("leftAltId", "rightAltId"))
val rHead = relaxedJoinList.head
val strictJoinList = Array(("leftCur", "rightCur"), ("leftAmt", "rightAmt"))
val sHead = strictJoinList.head
val idMatchQuery = relaxedJoinList.tail.foldLeft(leftDF(rHead._1) === rightDF(rHead._2)){(x, y) => x || leftDF(y._1) === rightDF(y._2)}
val currencyMatchQuery = strictJoinList.tail.foldLeft(leftDF(sHead._1) === rightDF(sHead._2)){(x, y) => x && leftDF(y._1) === rightDF(y._2)}
val leftOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNull}
val rightOnlyQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNull && col(y._2).isNotNull}
val matchQuery = relaxedJoinList.tail.foldLeft(col(rHead._1).isNotNull && col(rHead._2).isNotNull){(x, y) => x || col(y._1).isNotNull && col(y._2).isNotNull}
其余代码与您一样
我希望答案会有所帮助