我有2个数据帧
DF1
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9874| 880 |
|2016-04-30| 14|FR | 9875| 13 |
|2017-06-10| 15| PQR| 9867| 57721 |
+----------+----------------+--------------------+--------------+-------------+
DF2
+----------+----------------+--------------------+--------------+-------------+
| WEEK|DIM1 |DIM2 |T1 | T2 |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02| 14| NULL| 9879| 820 |
|2016-04-30| 14|FR | 9785| 9 |
|2017-06-10| 15| XYZ| 9967| 57771 |
+----------+----------------+--------------------+--------------+-------------+
我想在spark中写一个比较器,比较WEEK,DIM1,DIM2和T1的两个数据帧中的T1,T2,df1中的T2应该大于T1,T2是3.我想返回所有没有的行匹配上述标准与数据帧之间的T1,T2之间的差异。我还希望df1中的行不存在于df2中,反之亦然,以下组合为WEEK,DIM1,DIM2。
输出应该是这样的
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
| WEEK|DIM1 |DIM2 |T1_dIFF | T2_dIFF | Presenent_In_DF1 | Presenent_In_DF2|
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|2016-04-30| 14|FR | 90| 4 | Y | Y |
|2017-06-10| 15|PQR | 9867| 57721 | Y | N |
|2017-06-10| 15|XYZ | 9967| 57771 | N | Y |
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
解决这个问题的最佳方法是什么?
我已实施以下内容,但在此之后不知道如何继续 -
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
联合看起来像这样 -
+----------+----+----+----+-----+----+-----+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|
+----------+----+----+----+-----+----+-----+
|2016-04-02| 14|NULL|9874| 880|9879| 820|
|2017-06-10| 15| PQR|9867|57721|null| null|
|2017-06-10| 15| XYZ|null| null|9967|57771|
|2016-04-30| 14| FR|9875| 13|9785| 9|
+----------+----+----+----+-----+----+-----+
我不知道如何以一种好的方式继续进行,相对较新的scala。
答案 0 :(得分:0)
一个简单的解决方案是将WEEK
作为唯一密钥加入df1和df2。在连接的数据中,您需要保留df1和df2中的所有列。
然后,您可以对数据框执行映射操作以生成其余列。
像
这样的东西df1.createOrReplaceTempTable("df1")
df2.createOrReplaceTempTable("df2")
val df = spark.sql("select df1.*, df2.DIM1 as df2_DIM1, df2.DIM2 as df2_DIM2, df2.T1 as df2_T1, df2.T2 as df2_T2 from df1 join df2 on df1.WEEK = df2.WEEK")
// Now map on the dataframe to produce the diff dataframe
// Or you can use the SQL to do that.