Question

我有两张桌子。 Table 1 包含我感兴趣的所有独特地点（30 行）：

<头>

地点
日本
中国
印度
...

Table 2 包含 ID、访问地点和日期的所有信息。

<头>

id	地点	日期
10001	日本	20210204
10001	澳大利亚	20210204
10001	中国	20210204
10001	阿根廷	20210205
10002	西班牙	20210204
10002	印度	20210204
10002	中国	20210205
10003	阿根廷	20210204
10003	葡萄牙	20210204

我感兴趣的是：

对于特定日期（比如 20210204）
从表 2 中找出所有访问过表 1 中至少一个 IDs 的唯一 places
将那些唯一的 IDs 保存到临时表中。

这是我尝试过的：

create temporary table imp.unique_ids_tmp
as select distinct(final.id) from
(select t2.id
from table2 as t2
where t2.date = '20210204'
and t2.places in 
(select * from table1)) final;

我正在努力合并“至少一个”逻辑，以便一旦找到令人满意的 id，它就不再查看那些 id 记录。

谢谢。

Answer 1

使用left semi join（以有效的方式实现不相关的EXISTS），它将只过滤加入的记录，然后应用distinct：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
       left semi join table1 t1 on t2.places = t1.places
 where t2.date = '20210204'
;

至少满足“至少一次”条件：没有连接记录的 ID 将不会出现在数据集中。

另一种方法是使用相关的 EXISTS：

create temporary table imp.unique_ids_tmp as
select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204' 
   --this condition is true as soon as one match is found
   and exists (select 1 from table1 t1 where t2.places = t1.places)
;

IN 也可以。

Correlated EXIST 看起来接近于“一旦找到满意的 id，它就停止查看那些 id 记录”，但所有这些方法都是使用 Hive 中的 JOIN 实现的。执行 EXPLAIN，你会看到，它会生成相同的计划，尽管它取决于你的版本中的实现。潜在的 EXISTS 可以更快，因为不需要检查子查询中的所有记录。考虑到您的 30 行 table1 小到足以放入内存，MAP-JOIN (set hive.auto.convert.join=true;) 将为您提供最佳性能。

另一种使用数组或 IN(static_list) 的快速方法。它可用于小型和静态数组。有序数组可能会给你更好的性能：

select distinct t2.id --distinct is not a function, do not need ()
  from table2 t2
 where t2.date = '20210204'
       and array_contains(array('australia', 'china', 'japan', ... ), t2.places)
       --OR use t2.places IN ('australia', 'china', 'japan', ... )

为什么这个方法更快：因为不需要启动mapper和计算splits从hdfs读取表，只会读取table2。缺点是值列表是静态的。另一方面，您可以将整个列表作为参数传递，参见 here。

Hive：如果条件在两个表之间满足，则查找唯一值

1 个答案: