Question

我正在尝试在spark（1.6.2）中执行左外连接，但它不起作用。我的SQL查询是这样的：

np

结果如下：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
where t.created_year = 2016 
and p.created_year = 2016").show()

我使用LEFT JOIN或LEFT OUTER JOIN得到了相同的结果（第二个uuid不为空）。

我希望第二个uuid列只能为null。如何正确地进行左外连接？

===其他信息==

如果我使用数据帧进行左外连接，我得到了正确的结果。

+--------------------+--------------------+--------------------+
|                type|                uuid|                uuid|
+--------------------+--------------------+--------------------+
|              tained|89759dcc-50c0-490...|89759dcc-50c0-490...|
|             swapper|740cd0d4-53ee-438...|740cd0d4-53ee-438...|

我得到了这样的结果：

s = sqlCtx.sql('select * from symptom_type where created_year = 2016')
p = sqlCtx.sql('select * from plugin where created_year = 2016')

s.join(p, s.uuid == p.uuid, 'left_outer')
.select(s.type, s.uuid.alias('s_uuid'), 
        p.uuid.alias('p_uuid'), s.created_date, p.created_year, p.created_month).show()

谢谢，

Answer 1

我的代码中没有任何问题。 “左连接”或“左外连接”都可以正常工作。请再次检查数据您显示的数据是否匹配。

您还可以使用以下方法执行Spark SQL连接：

//左外连接显式

df1.join(df2, df1("col1") === df2("col1"), "left_outer")

Answer 2

您正在使用

过滤掉p.created_year（和p.uuid）的空值

where t.created_year = 2016 
and p.created_year = 2016

避免这种情况的方法是将p的过滤子句移到ON语句：

sqlContext.sql("select t.type, t.uuid, p.uuid
from symptom_type t LEFT JOIN plugin p 
ON t.uuid = p.uuid 
and p.created_year = 2016
where t.created_year = 2016").show()

这是正确但效率低下的，因为我们还需要在连接发生之前对t.created_year进行过滤。因此建议使用子查询：

sqlContext.sql("select t.type, t.uuid, p.uuid
from (
  SELECT type, uuid FROM symptom_type WHERE created_year = 2016 
) t LEFT JOIN (
  SELECT uuid FROM plugin WHERE created_year = 2016
) p 
ON t.uuid = p.uuid").show()

Answer 3

我认为您只需要使用LEFT OUTER JOIN代替LEFT JOIN关键字即可。有关更多信息，请查看Spark documentation。

如何在spark sql中进行左外连接？

3 个答案: