我们这里有两个数据框:
Table1:
+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|
AB| Utah|5678| SaltLakeCity|
CD| Nevada|8901| Lasvegas|
CD| Arizona|2345| Pheonix|
EFGH|Washington|0009|Redmond|
+------+---------+--------+----------+-------+--------+
表2:
+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|
AB| Utah|0000| SaltLakeCity|
CD| Nevada|8901| Lasvegas|
CD| Arizona|2345| Sedona|
EF|Texas|6789|ElPaso|
+------+---------+--------+----------+-------+--------+
Expected output:
+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |Result
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|Match
AB| Utah|0000| SaltLakeCity|Data in Col3 not matched
CD| Nevada|8901| Lasvegas|Match
CD| Arizona|2345| Sedona| Data in col4 not matched
EF|Texas|6789|ElPaso|Extra row in table2
EFGH|Washington|0009|Redmond| Missing row in Table2
+------+---------+--------+----------+-------+--------+
对于上面两个表,想要比较两个表并找出两个表之间是否存在差异并输出差异。
我为表创建了两个数据帧 -
import session.implicits._
import spark.implicits._
import spark.sql
val df1 = sql("select * from table1")
val df2 = sql("select * from table2")
我对Spark,Scala非常陌生,并且非常感谢比较这两个表的任何帮助,这两个表没有加入的密钥,并且返回差异。