我有2个数据帧:
A:
+----------+------+-------------+-------------+
|title |name |product |available |
+----------+------+-------------+-------------+
|AAAAA |WW |indoor camera|true |
|A121AA |AA |indoor camera|true |
|AACCCA |YY |indoor camera|true |
+----------+------+-------------+-------------+
B:
+-------------+----------+-------------------+
| product | title | name |
+-------------+----------+-------------------+
|indoor camera|FFFFF |WW |
|indoor camera|F1FFF |WW |
|indoor camera|FYFFF |YY |
|indoor camera|BBB |MNMN |
|indoor camera|CCC |MNMN |
|indoor camera|DDD |BBBNNN |
+-------------+----------+-------------------+
我需要获取如下所示的联接数据:
+----------+------+-------------+-------------+
|title |name |product |available |
+----------+------+-------------+-------------+
|AAAAA |WW |indoor camera|true |
|AACCCA |YY |indoor camera|true |
|A121AA |AA |indoor camera|true |
|BBB |MNMN |indoor camera|null |
|CCC |MNMN |indoor camera|null |
|DDD |BBBNNN|indoor camera|null |
+----------+------+-------------+-------------+
我想基于“产品”加入并获取加入的数据。如果“名称”在A中,则最终的联接数据应具有该名称的名称(仅来自WW)(从A)(标题),并从B获得其余信息。我不确定为此需要哪种联接。有人可以建议我任何想法吗?
答案 0 :(得分:0)
使用完全加入
a.join(b, ['title'], how='full').show()
或合并两个表列
import pyspark.sql.functions as F
a.join(b, a.title == b.title , how='full').select(
F.coalesce(a.title , b.title ).alias('title'), a.name , a.product,a.available
).show()
答案 1 :(得分:0)
只是看看我是否正确理解了这一点。 您想同时加入“产品”,“名称”和“标题”上的框架,但仅保留数据在A中存在的框架。 如果是这样,您可以尝试:
a.join(b, on=['product', 'name', 'title'], how='left').show()