假设我有一个包含20列的表格订单。我只对前4列感兴趣:id,department_id,region_id,datetime,其中id是客户ID,datetime是客户下订单的时间。其他列则更具体地针对产品详细信息(例如product_id),因此在给定的订单上,您可能会有多行。我正在努力编写查询以获取每个客户最早的部门和地区,因为同一位客户可以具有department_id和region_id的多个组合。
SELECT a.*
FROM (
SELECT id,
department_id,
region_id,
min(DATETIME) AS ts
FROM orders
GROUP BY id,
department_id,
region_id
) a
INNER JOIN (
SELECT id,
min(DATETIME) AS ts
FROM orders
GROUP BY id
) b
ON a.id = b.id
AND a.ts = b.ts
这似乎可行,但效率不高且编写不佳。有没有更好的方法来写这个?该表本身很大,因此查询速度很慢。
答案 0 :(得分:0)
我认为您也许可以使用具有这样的功能:
SELECT id, department_id, region_id, min(datetime) AS ts
FROM orders
GROUP BY id, department_id, region_id
HAVING ts=min(datetime)
答案 1 :(得分:0)
使用dense_rank()
分析函数:
SELECT
id,
department_id,
region_id,
min(DATETIME) AS ts
FROM
(
SELECT id,
department_id,
region_id,
DATETIME,
dense_rank() over(partition by id order by DATETIME) AS rnk
FROM orders
)s
WHERE rnk=1 --records with minimal date by id
GROUP BY id,
department_id,
region_id;
此查询与您的查询相同,但是表将被扫描一次,而无需联接。
答案 2 :(得分:0)
我会做:
SELECT id, department_id, region_id, datetime
FROM (SELECT o.*
row_number() over (partition by id order by datetime) as seqnum
FROM orders o
) o
where seqnum = 1;
编辑:
您可以尝试使用此版本以查看其是否更好:
select o.*
from orders o join
(select id, min(datetime) as min_datetime
from orders
group by id
) oo
on oo.id = o.id and oo.datetime = o.datetime;
在大多数数据库中,row_number()
版本可能具有更好的性能。但是,Hive可以做出神秘的优化决策,这可能会更好。