我们有一个要求,我们希望通过它们的价格总和找到前N个区域,然后找到每个区域的前N个客户。
示例数据。
REGION_NAME,CUSTOMER_NAME,PRICE
RG1,Customer1,100
RG1,Customer2,200
RG1,Customer3,100
RG2,Customer4,100
RG2,Customer5,200
RG2,Customer6,400
RG3,Customer7,100
RG3,Customer8,200
RG3,Customer9,500
RG3,Customer9,200
假设我们希望通过汇总价格
来获得每个地区的前2名和前2名客户Region_name,Region_sum,Customer_name,Customer_price(Sum)
RG3,1000,Customer9,700 (Sum of customer price)
RG3,1000,Customer8,200
RG2,700,Customer6,400
RG2,700,customer5,200
如何为此编写HIVE查询?我们无法想到如何使用HIVE来写这个。我们可能要编写MapReduce或PIG?
答案 0 :(得分:0)
您可以使用分析功能和自我加入在Hive中执行此操作:
select regions_ranked.region_name, regions_ranked.region_sum, customers_ranked.customer_name, customers_ranked.customer_sum from
(
select region_name, customer_name, customer_sum, rank() over (partition by region_name order by customer_sum desc) as customer_rank from (
select region_name, customer_name, sum(price) as customer_sum
from foo group by region_name, customer_name
) customers_sum
) customers_ranked
join
(
select region_name, region_sum, rank() over (order by region_sum desc) as region_rank from (
select region_name, sum(price) as region_sum
from foo group by region_name
) regions_sum
) regions_ranked
on customers_ranked.region_name = regions_ranked.region_name
where region_rank <= 2 and customer_rank <= 2;
这给出了您正在寻找的确切输出,尽管不按顺序排列。您可以通过&#34;进行排序。如果你想要的话,最后一句。