比较Spark中的DataFrame

时间:2018-04-23 17:36:20

标签: scala apache-spark dataframe spark-dataframe

我有2个数据帧

DF1

+----------+----------------+--------------------+--------------+-------------+
|      WEEK|DIM1            |DIM2                |T1            |  T2         |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02|              14|                NULL|          9874|   880       |
|2016-04-30|              14|FR                  |          9875|    13       |
|2017-06-10|              15|                 PQR|          9867| 57721       |
+----------+----------------+--------------------+--------------+-------------+

DF2

+----------+----------------+--------------------+--------------+-------------+
|      WEEK|DIM1            |DIM2                |T1            |  T2         |
+----------+----------------+--------------------+--------------+-------------+
|2016-04-02|              14|                NULL|          9879|   820       |
|2016-04-30|              14|FR                  |          9785|    9        |
|2017-06-10|              15|                 XYZ|          9967| 57771       |
+----------+----------------+--------------------+--------------+-------------+

我想在spark中写一个比较器,比较WEEK,DIM1,DIM2和T1的两个数据帧中的T1,T2,df1中的T2应该大于T1,T2是3.我想返回所有没有的行匹配上述标准与数据帧之间的T1,T2之间的差异。我还希望df1中的行不存在于df2中,反之亦然,以下组合为WEEK,DIM1,DIM2。

输出应该是这样的

+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|      WEEK|DIM1            |DIM2                |T1_dIFF       |  T2_dIFF    | Presenent_In_DF1 | Presenent_In_DF2|
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+
|2016-04-30|              14|FR                  |            90|    4        | Y                | Y               |
|2017-06-10|              15|PQR                 |          9867|    57721    | Y                | N               |
|2017-06-10|              15|XYZ                 |          9967|    57771    | N                | Y               |
+----------+----------------+--------------------+--------------+-------------+------------------+-----------------+

解决这个问题的最佳方法是什么?

我已实施以下内容,但在此之后不知道如何继续 -

val df1 = Seq(
  ("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")

val df2 = Seq(
  ("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")

import org.apache.spark.sql.functions._

val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")

联合看起来像这样 -

+----------+----+----+----+-----+----+-----+
|      WEEK|DIM1|DIM2|  T1|   T2|  T1|   T2|
+----------+----+----+----+-----+----+-----+
|2016-04-02|  14|NULL|9874|  880|9879|  820|
|2017-06-10|  15| PQR|9867|57721|null| null|
|2017-06-10|  15| XYZ|null| null|9967|57771|
|2016-04-30|  14|  FR|9875|   13|9785|    9|
+----------+----+----+----+-----+----+-----+

我不知道如何以一种好的方式继续进行,相对较新的scala。

1 个答案:

答案 0 :(得分:0)

一个简单的解决方案是将WEEK作为唯一密钥加入df1和df2。在连接的数据中,您需要保留df1和df2中的所有列。

然后,您可以对数据框执行映射操作以生成其余列。

这样的东西
df1.createOrReplaceTempTable("df1")
df2.createOrReplaceTempTable("df2")
val df = spark.sql("select df1.*, df2.DIM1 as df2_DIM1, df2.DIM2 as df2_DIM2, df2.T1 as df2_T1, df2.T2 as df2_T2 from df1 join df2 on df1.WEEK = df2.WEEK")
// Now map on the dataframe to produce the diff dataframe
// Or you can use the SQL to do that.