想象一下,我们有两个表:客户和购买。 购买有customerID,purchaseDateTime等。
为hive或impala SQL中的所有客户选择最近一次购买的最佳方法是什么?
我看过这个问题:
With recent as (
select customerID, max(purchaseDateTime) as dt
from purchases group by customerID
)
Select *
from customer c
join recent r
on c.customerID = r.customerID
join purchases p
on r.customerId = p.customerid and
p.purchaseDateTime = dt
似乎没那么高效......
答案 0 :(得分:1)
我会使用row_number()
:
Select c.*, p.*
from customer c join
(select p.*,
row_number() over (partition by p.customerid order by p.purchaseDateTime desc) as seqnum
from purchases p
) p
on c.customerId = p.customerid and p.purchaseDateTime = dt
where seqnum = 1;
row_number()
是ANSI标准功能,因此它是标准SQL。一般来说,它应该比明确的group by
和join
更快。
一个区别是 - 如果是关系 - 这会返回一行。您的查询将返回多行。如果您需要此行为,请将row_number()
更改为rank()
。