如何根据复杂条件加入2个数据框

时间:2019-03-06 06:25:17

标签: sql scala apache-spark dataframe join

我有2个数据帧:

A:

+----------+------+-------------+-------------+
|title     |name  |product      |available    |
+----------+------+-------------+-------------+
|AAAAA     |WW    |indoor camera|true         |
|A121AA    |AA    |indoor camera|true         |
|AACCCA    |YY    |indoor camera|true         |
+----------+------+-------------+-------------+

B:

+-------------+----------+-------------------+
| product     | title    | name              |
+-------------+----------+-------------------+
|indoor camera|FFFFF     |WW                 |
|indoor camera|F1FFF     |WW                 |
|indoor camera|FYFFF     |YY                 |
|indoor camera|BBB       |MNMN               |
|indoor camera|CCC       |MNMN               |
|indoor camera|DDD       |BBBNNN             |
+-------------+----------+-------------------+

我需要获取如下所示的联接数据:

+----------+------+-------------+-------------+
|title     |name  |product      |available    |
+----------+------+-------------+-------------+
|AAAAA     |WW    |indoor camera|true         |
|AACCCA    |YY    |indoor camera|true         |
|A121AA    |AA    |indoor camera|true         |
|BBB       |MNMN  |indoor camera|null         |
|CCC       |MNMN  |indoor camera|null         |
|DDD       |BBBNNN|indoor camera|null         |
+----------+------+-------------+-------------+

我想基于“产品”加入并获取加入的数据。如果“名称”在A中,则最终的联接数据应具有该名称的名称(仅来自WW)(从A)(标题),并从B获得其余信息。我不确定为此需要哪种联接。有人可以建议我任何想法吗?

2 个答案:

答案 0 :(得分:0)

使用完全加入

a.join(b, ['title'], how='full').show()

或合并两个表列

import pyspark.sql.functions as F
a.join(b, a.title == b.title , how='full').select(
    F.coalesce(a.title , b.title ).alias('title'), a.name , a.product,a.available
).show()

答案 1 :(得分:0)

只是看看我是否正确理解了这一点。 您想同时加入“产品”,“名称”和“标题”上的框架,但仅保留数据在A中存在的框架。 如果是这样,您可以尝试:

a.join(b, on=['product', 'name', 'title'], how='left').show()