在scala中的多个列上连接数据框

时间:2017-07-24 13:55:29

标签: scala

我们这里有两个数据框:

Table1:
+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|
AB| Utah|5678| SaltLakeCity|
CD| Nevada|8901| Lasvegas|
CD| Arizona|2345| Pheonix|
EFGH|Washington|0009|Redmond|
+------+---------+--------+----------+-------+--------+

表2:

+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|
AB| Utah|0000| SaltLakeCity|
CD| Nevada|8901| Lasvegas|
CD| Arizona|2345| Sedona|
EF|Texas|6789|ElPaso|
+------+---------+--------+----------+-------+--------+

Expected output:
+------+---------+--------+----------+-------+--------+
Col1|Col2| Col3|Col4 |Result 
+------+---------+--------+----------+-------+--------+
AB| California|1234| SFO|Match
AB| Utah|0000| SaltLakeCity|Data in Col3 not matched
CD| Nevada|8901| Lasvegas|Match
CD| Arizona|2345| Sedona| Data in col4 not matched
EF|Texas|6789|ElPaso|Extra row in table2
EFGH|Washington|0009|Redmond| Missing row in Table2
+------+---------+--------+----------+-------+--------+

对于上面两个表,想要比较两个表并找出两个表之间是否存在差异并输出差异。

我为表创建了两个数据帧 -

    import session.implicits._
    import spark.implicits._
    import spark.sql
    val df1 = sql("select * from table1")
    val df2 = sql("select * from table2")

我对Spark,Scala非常陌生,并且非常感谢比较这两个表的任何帮助,这两个表没有加入的密钥,并且返回差异。

0 个答案:

没有答案