我有一个表“ bucket”,其中包含存储桶的最小int值,像这样
min_value bucket_id
--------- ---------
0 1
12345 2
67890 3
即大于等于0且<12345的任何值都属于存储区1,...,大于等于67890的任何值都属于存储区3。
以及这样一个int值“ value”表:
id value
-- -----
11 10
22 20000
33 80000
我想弄清楚每个值属于哪个存储桶。所以
select id, bucket_id
from (some join, or whatever, of bucket and value)
给我
id bucket_id
-- ---------
11 1
22 2
33 3
我正在尝试在HiveQL中实现此功能。有什么想法吗?
答案 0 :(得分:1)
我假设min_value
最大的存储桶的条件是min_value <= value
(因为没有min_value
大的存储桶),并且我还假设列{{1}的整数类型表value
的}和表value
的列min_value
的重要性(这很重要,因为查询使用的比较在字符串类型的情况下是不同的,因此您需要对字符串进行类型转换)。
以下查询适用于表bucket
的非负value
,如果涉及负值,则必须替换
value
与
max(if(a.value >= b.min_value, b.min_value, 0))
:
max(if(a.value >= b.min_value, b.min_value, <minimum possible value that "value" field may have>))
答案 1 :(得分:1)
您可以使用窗口函数来定义存储区ID的范围,然后加入存储区表。检查一下。
> select * from bucket;
+-------------------+-------------------+--+
| bucket.min_value | bucket.bucket_id |
+-------------------+-------------------+--+
| 0 | 1 |
| 12345 | 2 |
| 67890 | 3 |
+-------------------+-------------------+--+
> select * from buckvalue;
+---------------+------------------+--+
| buckvalue.id | buckvalue.value |
+---------------+------------------+--+
| 11 | 10 |
| 22 | 20000 |
| 33 | 80000 |
+---------------+------------------+--+
> select bucket_id, min_value, lead(min_value) over(order by bucket_id) as max1 from bucket;
INFO : OK
+------------+------------+--------+--+
| bucket_id | min_value | max1 |
+------------+------------+--------+--+
| 1 | 0 | 12345 |
| 2 | 12345 | 67890 |
| 3 | 67890 | NULL |
+------------+------------+--------+--+
> select t1.id, t1.value, t2.bucket_id from buckvalue t1 left outer join ( select bucket_id, min_value, lead(min_value) over(order by bucket_id) as max1 from bucket ) t2
where t1.value >= t2.min_value and t1.value < coalesce(t2.max1,99999);
+--------+-----------+---------------+--+
| t1.id | t1.value | t2.bucket_id |
+--------+-----------+---------------+--+
| 11 | 10 | 1 |
| 22 | 20000 | 2 |
| 33 | 80000 | 3 |
+--------+-----------+---------------+--+
答案 2 :(得分:0)
我找到了一个非常简单的查询来执行此操作。通过查找所有值大于该桶的最小值的所有桶号,并采用最大的bucket_id来工作。
create temporary table bucket as select * from (select 0 min_value, 1 bucket_id union select 12345, 2 union select 67890, 3) a;
create temporary table value as select * from (select 11 id, 10 value union select 22, 20000 union select 33, 80000) a;
select value.id, max(bucket.bucket_id) bucket_id
from value
join bucket
where value.value > bucket.min_value
group by value.id;