在pyspark中编写此代码最有效的方法是什么
伪代码:
table1 inner join table2
on cookies if table1.cookie is not Null
else join on ids
表1:
id, cookie
1, 1q2w
2, Null
表2:
id, cookie
1, 1q2w
2, 3e4r
答案 0 :(得分:1)
您可以使用OR
作为到目前为止提交的答案。但是,以我的经验,与or
的联接的效果非常差。您也可以使用UNION|UNION ALL
:
select *
from table1
inner join table2
on table1.cookies= table2.cookies
UNION (ALL) -- UNION removes duplicates, UNION ALL keeps them.
select *
from table1
inner join table2
on table1.id=table2.id
答案 1 :(得分:1)
在pyspark方面,您可以根据table1.cookie是否为空创建两个df,然后将它们合并
>>> import pyspark.sql.functions as F
>>> df1 = table1.where(F.isnull('cookie')==True).join(table2, table1.id == table2.id, 'inner').select(table1.id,table2.cookie)
>>> df2 = table1.where(F.isnull('cookie')==False).join(table2, table1.cookie == table2.cookie, 'inner').select(table1.id,table2.cookie)
>>> df1.union(df2).show()
+---+------+
| id|cookie|
+---+------+
| 2| 3e4r|
| 1| 1q2w|
+---+------+
答案 2 :(得分:0)
您可以尝试使用OR
select * from
table1 inner join table2
on table1.cookies= table2.cookies or table1.id=table2.id
答案 3 :(得分:0)
在table1.cookie为空时加入ID上的Cookie 或:
select *
from table1 t1
join table2 t2 on t1.cookie = t2.cookie
or (t1.cookie is null and t1.id = t2.id)
答案 4 :(得分:0)
最有效的方法通常是使用left join
:
select . . .,
coalesce(t2c.colx, t2i.colx) as colx
from table1 t1 inner join
table2 t2c
on t1.cookie = t2.cookie left join
table2 t2i
on t1.id = t2i.id and t2c.cookie is null