不同列上的条件连接

时间:2018-11-02 10:11:58

标签: sql pyspark

在pyspark中编写此代码最有效的方法是什么

伪代码:

table1 inner join table2 
on cookies if table1.cookie is not Null 
else join on ids

表1:

id, cookie
1, 1q2w
2, Null

表2:

id, cookie
1, 1q2w
2, 3e4r

5 个答案:

答案 0 :(得分:1)

您可以使用OR作为到目前为止提交的答案。但是,以我的经验,与or的联接的效果非常差。您也可以使用UNION|UNION ALL

select * 
from table1 
inner join table2 
on table1.cookies= table2.cookies

UNION (ALL) -- UNION removes duplicates, UNION ALL keeps them.

select * 
from table1 
inner join table2 
on table1.id=table2.id

答案 1 :(得分:1)

在pyspark方面,您可以根据table1.cookie是否为空创建两个df,然后将它们合并

>>> import pyspark.sql.functions as F
>>> df1 = table1.where(F.isnull('cookie')==True).join(table2, table1.id == table2.id, 'inner').select(table1.id,table2.cookie)
>>> df2 = table1.where(F.isnull('cookie')==False).join(table2, table1.cookie == table2.cookie, 'inner').select(table1.id,table2.cookie)
>>> df1.union(df2).show()
+---+------+                                                                    
| id|cookie|
+---+------+
|  2|  3e4r|
|  1|  1q2w|
+---+------+

答案 2 :(得分:0)

您可以尝试使用OR

DEMO

select * from 
table1 inner join table2 
on table1.cookies= table2.cookies or table1.id=table2.id

答案 3 :(得分:0)

在table1.cookie为空时加入ID上的Cookie

select *
from table1 t1
join table2 t2 on t1.cookie = t2.cookie
               or (t1.cookie is null and t1.id = t2.id)

答案 4 :(得分:0)

最有效的方法通常是使用left join

select . . .,
       coalesce(t2c.colx, t2i.colx) as colx
from table1 t1 inner join
     table2 t2c
     on t1.cookie = t2.cookie left join
     table2 t2i
     on t1.id = t2i.id and t2c.cookie is null