Question

在Apache Hive中，我必须使用左表连接的表来保留左数据中的所有数据，并在可能的情况下从右表中添加数据。为此，我使用两个连接，因为连接基于两个字段（material_id和location_id）。这适用于两个传统的左连接：

SELECT 
   a.*, 
   b.*
FROM a
INNER JOIN (some more complex select) b
   ON a.material_id=b.material_id 
   AND a.location_id=b.location_id;

对于location_id，数据库只包含两个不同的值，例如1和2.

我们现在要求如果没有“完美匹配”，这意味着只有material_id可以连接，并且对于连接没有material_id和location_id的正确组合（例如material_id = 100和location_id = 1）对于b表中的location_id，连接应该“默认”或“回退”到location_id的另一个可能的值，例如material_id = 001，location_id = 2，反之亦然。这应该只是location_id的情况。

我们已经用CASE等调查了所有可能的答案，但没有取得胜利。像

这样的设置

...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;

我们尝试了或者没有弄清楚如何在hive查询语言中真正做到。

感谢您的帮助！也许有人有一个聪明的主意。

以下是一些示例数据：

Table a
| material_id | location_id | other_column_a |
| 100         | 1           | 45            |
| 101         | 1           | 45            |
| 103         | 1           | 45            |
| 103         | 2           | 45            |



Table b
| material_id | location_id | other_column_b |
| 100         | 1           | 66            |
| 102         | 1           | 76            |
| 103         | 2           | 88            |


Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100         | 1           | 45            | 66
| 101         | 1           | 45            | NULL (mat. not in b)
| 103         | 1           | 45            | DEFAULT TO where location_id=2 (88)
| 103         | 2           | 45            | 88

PS：如上所述here存在等在子查询ON中不起作用。

Answer 1

解决方案是在没有a.location_id = b.location_id的情况下离开联接，并按优先顺序对所有行进行编号。然后按row_number过滤。在下面的代码中，连接将首先复制行，因为所有匹配的material_id将被连接，然后row_number()函数将1分配给a.location_id = b.location_id的行，2分配给a.location_id <> b.location_id行，如果存在行其中a.location_id = b.location_id和1如果不存在这样的话。 b.location_id已添加到row_number（）函数中的order by，因此它将更喜欢＆＃34; b.location_id较低的行，以防没有完全匹配。我希望你能抓住这个主意。

select * from 
(
SELECT 
   a.*, 
   b.*,
   row_number() over(partition by material_id 
                     order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
   ON a.material_id=b.material_id 
)s 
where rn=1
;

Answer 2

也许这对将来有人有帮助：

我们也提出了不同的方法。

首先，我们创建另一个表来计算表b中基于所有（！）位置的material_id的平均值。

其次，在连接表中我们创建了三列： c1 - material_id和location_id匹配的值（表a的左连接与表b的结果）。如果没有完美匹配，则此列为null。

c2 - 我们从此material_id的平均值（后备）表中写入数字的表中的值（无论位置如何）

c3 - “实际值”列，我们使用case语句来判断第1列是否为NULL（材料和位置没有完美匹配）然后我们使用第2列的值（所有的平均值）进一步计算的材料的其他位置。

Hive / SQL - 使用回退进行左连接

2 个答案: