SQL / HiveQL基于表将值分配给存储桶

时间:2018-11-09 01:49:41

标签: hive hiveql

我有一个表“ bucket”,其中包含存储桶的最小int值,像这样

min_value bucket_id
--------- ---------
       0      1
   12345      2
   67890      3

即大于等于0且<12345的任何值都属于存储区1,...,大于等于67890的任何值都属于存储区3。

以及这样一个int值“ value”表:

id value
-- -----
11    10
22 20000
33 80000

我想弄清楚每个值属于哪个存储桶。所以

select id, bucket_id
from (some join, or whatever, of bucket and value)

给我

id bucket_id
-- ---------
11     1
22     2
33     3

我正在尝试在HiveQL中实现此功能。有什么想法吗?

3 个答案:

答案 0 :(得分:1)

我假设min_value最大的存储桶的条件是min_value <= value(因为没有min_value大的存储桶),并且我还假设列{{1}的整数类型表value的}和表value的列min_value的重要性(这很重要,因为查询使用的比较在字符串类型的情况下是不同的,因此您需要对字符串进行类型转换)。

以下查询适用于表bucket的非负value,如果涉及负值,则必须替换
value

max(if(a.value >= b.min_value, b.min_value, 0))

max(if(a.value >= b.min_value, b.min_value, <minimum possible value that "value" field may have>))

答案 1 :(得分:1)

您可以使用窗口函数来定义存储区ID的范围,然后加入存储区表。检查一下。

> select * from bucket;
+-------------------+-------------------+--+
| bucket.min_value  | bucket.bucket_id  |
+-------------------+-------------------+--+
| 0                 | 1                 |
| 12345             | 2                 |
| 67890             | 3                 |
+-------------------+-------------------+--+

> select * from buckvalue;
+---------------+------------------+--+
| buckvalue.id  | buckvalue.value  |
+---------------+------------------+--+
| 11            | 10               |
| 22            | 20000            |
| 33            | 80000            |
+---------------+------------------+--+

> select bucket_id, min_value, lead(min_value) over(order by bucket_id)  as max1 from bucket;
INFO  : OK
+------------+------------+--------+--+
| bucket_id  | min_value  |  max1  |
+------------+------------+--------+--+
| 1          | 0          | 12345  |
| 2          | 12345      | 67890  |
| 3          | 67890      | NULL   |
+------------+------------+--------+--+

> select t1.id, t1.value, t2.bucket_id from buckvalue t1 left outer join ( select bucket_id, min_value, lead(min_value) over(order by bucket_id)  as max1 from bucket ) t2
where t1.value >= t2.min_value and t1.value < coalesce(t2.max1,99999);

+--------+-----------+---------------+--+
| t1.id  | t1.value  | t2.bucket_id  |
+--------+-----------+---------------+--+
| 11     | 10        | 1             |
| 22     | 20000     | 2             |
| 33     | 80000     | 3             |
+--------+-----------+---------------+--+

答案 2 :(得分:0)

我找到了一个非常简单的查询来执行此操作。通过查找所有值大于该桶的最小值的所有桶号,并采用最大的bucket_id来工作。

create temporary table bucket as select * from (select 0 min_value, 1 bucket_id union select 12345, 2 union select 67890, 3) a;
create temporary table value as select * from (select 11 id, 10 value union select 22, 20000 union select 33, 80000) a;

select value.id, max(bucket.bucket_id) bucket_id
from value
join bucket
where value.value > bucket.min_value
group by value.id;