我有两个需要加入的Spark DF。仅选择df1中存在的df2中的值,不应重复行。
例如:
df1:
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101730| 202003101700|1930860788|
| 202003101730| 202003101600| 101713|
| 202003101720| 202003101700|1261542412|
| 202003101720| 202003101600| 1824155|
| 202003101710| 202003101700| 912601761|
+-------------+---------------+----------+
df2:
+-------------+---------------+
|a |b |
+-------------+---------------+
| 202003101800| 202003101700|
| 202003101800| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101750| 202003101700|
| 202003101740| 202003101700|
| 202003101740| 202003101700|
+-------------+---------------+
我正在执行以下操作:
df1.join(df2, Seq("a", "b"), "leftouter").where(col("val").isNotNull)
但是我的输出有几行重复。
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+
如果val从df1中删除,我正在尝试实现类似异常的操作。但是except
似乎无效。
例如,以下是所需的操作
df1.drop(col("val")).except("df2")
df1的架构如下:
root
|-- a: String (nullable = true)
|-- b: String (nullable = true)
|-- val: long (nullable = true)
此外,左外连接与except之间到底有什么区别? 预期输出:
+-------------+---------------+----------+
|a |b |val |
+-------------+---------------+----------+
| 202003101750| 202003101700|1712384842|
| 202003101740| 202003101700|1590554927||
+-------------+---------------+----------+
答案 0 :(得分:1)
您可以使用函数dropDuplicates()
删除所有重复的行:
uniqueDF = df.dropDuplicates()
或者您可以指定要匹配的列:
uniqueDF = df.dropDuplicates("a","b")
答案 1 :(得分:0)
LeftOuter
连接将从左侧表中获取所有行,并从右侧表中获取匹配的行。
Except
将给出第二个数据框中与第一个数据框中不存在的行(无重复)。
对于您的情况,可以将inner
(或)outer
与dropDuplicates结合使用。
df1.join(df2, Seq("a", "b"), "inner").dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+
df1.join(df2, Seq("a", "b"), "rightouter").where(col("val").isNotNull).dropDuplicates().show()
//+------------+------------+----------+
//| a| b| val|
//+------------+------------+----------+
//|202003101740|202003101700|1590554927|
//|202003101750|202003101700|1712384842|
//+------------+------------+----------+