我有一个数据帧df1,其中包含以下数据:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
我有另一个数据帧df2,其中包含以下数据:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule2
2 B X rule2
3 C y rule2
两个数据框中的rule_name值始终是固定的
我想要一个新的联合数据帧df3。它应该包含来自dataframe df1的所有客户以及来自dataframe df2的所有其他客户,这些客户在df1中不存在。所以最终的df3应该是这样的:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
3 C y rule2
任何人都可以帮助我实现这一结果。任何帮助将不胜感激。
答案 0 :(得分:1)
给出以下数据集:
val df1 = Seq(
(1, "A", "1", "rule1"),
(2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")
val df2 = Seq(
(1, "A", "1", "rule2"),
(2, "B", "X", "rule2"),
(3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")
要求:
它应该拥有来自dataframe df1的所有客户和来自dataframe df2的所有其他客户,这些客户在df1中不存在。
我的第一个解决方案如下:
val missingCustomers = df2.
join(df1, Seq("customer_id"), "leftanti").
select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+
另一个(也许更慢)解决方案可能如下:
// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
select("customer_id").
except(df1.select("customer_id")).
as[Int].
collect
// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))
scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+