如何联合DataFrames并仅添加缺少的行?

时间:2017-06-03 19:05:01

标签: scala apache-spark apache-spark-sql

我有一个数据帧df1,其中包含以下数据:

**customer_id**   **product**   **Val_id**    **rule_name**
     1               A            1               rule1
     2               B            X               rule1

我有另一个数据帧df2,其中包含以下数据:

**customer_id**   **product**   **Val_id**    **rule_name**
     1               A            1               rule2
     2               B            X               rule2
     3               C            y               rule2

两个数据框中的rule_name值始终是固定的

我想要一个新的联合数据帧df3。它应该包含来自dataframe df1的所有客户以及来自dataframe df2的所有其他客户,这些客户在df1中不存在。所以最终的df3应该是这样的:

**customer_id**   **product**   **Val_id**        **rule_name**
         1               A            1               rule1
         2               B            X               rule1
         3               C            y               rule2

任何人都可以帮助我实现这一结果。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

给出以下数据集:

val df1 = Seq(
  (1, "A", "1", "rule1"),
  (2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")

val df2 = Seq(
  (1, "A", "1", "rule2"),
  (2, "B", "X", "rule2"),
  (3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")

要求:

  

它应该拥有来自dataframe df1的所有客户和来自dataframe df2的所有其他客户,这些客户在df1中不存在。

我的第一个解决方案如下:

val missingCustomers = df2.
  join(df1, Seq("customer_id"), "leftanti").
  select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
|          1|      A|     1|    rule1|
|          2|      B|     X|    rule1|
|          3|      C|     y|    rule2|
+-----------+-------+------+---------+

另一个(也许更慢)解决方案可能如下:

// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
  select("customer_id").
  except(df1.select("customer_id")).
  as[Int].
  collect

// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))

scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
|          1|      A|     1|    rule1|
|          2|      B|     X|    rule1|
|          3|      C|     y|    rule2|
+-----------+-------+------+---------+