我有两个数据帧。
DF1
+--------+-------------------
|id | amount | fee |
|1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 |
|3 | 90 | 130.0 |
|4 | 120.0 | 35.0 |
DF2
+--------+--------------------
|exId | exAmount | exFee|
|1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 |
|3 | 20.0 | 3.0 |
|4 | 120.0 | 3.0 |
我需要执行以下操作
所以输出看起来像这样
完全匹配
+--------+---------------------------------------------
|id | amount | fee | exId | exAmount | exFee |
|1 | 10.00 | 5.0 | 1 | 10.00 | 5.0 |
|2 | 20.0 | 3.0 | 2 | 20.00 | 3.0 |
+--------+---------------------------------------------
非完全匹配
+--------+------------------------------------------------------------
|id | amount | fee | exId | exAmount | exFee | mismatchFields|
|3 | 90.00 | 130.0 | 3 | 20.00 | 3.0 | [fee, amount]|
|4 | 120.0 | 35.0 | 4 | 120.00 | 3.0 | [fee] |
+--------+------------------------------------------------------------
有什么想法吗?
答案 0 :(得分:2)
查找所有三列匹配的公共行,例如上例中的id为1,2。
这很简单,你只需在加入时检查所有列是否相等
df1.join(df2, df1("id") === df2("exId") && df1("amount") === df2("exAmount") && df1("fee") === df2("exFee")).show(false)
应该给你
+---+------+---+----+--------+-----+
|id |amount|fee|exId|exAmount|exFee|
+---+------+---+----+--------+-----+
|1 |10.00 |5.0|1 |10.00 |5.0 |
|2 |20.0 |3.0|2 |20.0 |3.0 |
+---+------+---+----+--------+-----+
查找其中(id,exId)匹配的常见行,但其他行不同,例如, 3&在上面的例子中有4个。如果我们确定哪些列不匹配将是有用的。
为此,您必须检查第一列的相等性,但是其余两列的 en-equality 并且当条件到达最后一列时执行一些柱
import org.apache.spark.sql.functions._
df1.join(df2, df1("id") === df2("exId") && (df1("amount") =!= df2("exAmount") || df1("fee") =!= df2("exFee")))
.withColumn("mismatchFields", when(col("amount") =!= col("exAmount") && col("fee") =!= col("exFee"), array(lit("amount"), lit("fee"))).otherwise(
when(col("amount") === col("exAmount") && col("fee") =!= col("exFee"), array(lit("fee"))).otherwise(array(lit("amount")))
)).show(false)
应该给你
+---+------+-----+----+--------+-----+--------------+
|id |amount|fee |exId|exAmount|exFee|mismatchFields|
+---+------+-----+----+--------+-----+--------------+
|3 |90 |130.0|3 |20.0 |3.0 |[amount, fee] |
|4 |120.0 |35.0 |4 |120.0 |3.0 |[fee] |
+---+------+-----+----+--------+-----+--------------+
我希望答案很有帮助
答案 1 :(得分:0)
val joinedDF = df1.join(df2,df1.col("id")===df2.col("exId"))
.withColumn("match",when(col("fee")===col("exFee") &&
col("amount")===col("exAmount"),lit(1))
.otherwise(lit(0)))
匹配数据:
val matchedDF = joinedDF.filter("match=1")
非匹配数据:
val notMatchedDF = joinedDF.filter("match=0")
.withColumn("mismatchedFields",when(col("fee")!=col("exFee") &&
col("amount")!=col("exAmount"),array("fee","amount"))
.otherwise(when(col("fee")!=col("exFee") ,array("fee"))))