Question

我在hive中有一个名为purchase_data的表格，其中列出了所有购买的清单我需要查询此表并查找客户购买的最昂贵产品的cust_id，product_id和价格 purchase_data表中的数据如下所示：

cust_id         product_id      price   purchase_data
--------------------------------------------------------
aiman_sarosh    apple_iphone5s  55000   01-01-2014
aiman_sarosh    apple_iphone6s  65000   01-01-2017
jeff_12         apple_iphone6s  65000   01-01-2017
jeff_12         dell_vostro     70000   01-01-2017
missy_el        lenovo_thinkpad 70000   01-02-2017

我已经编写了下面的代码，但它没有获取正确的行有些行正在重复：

select master.cust_id, master.product_id, master.price
from
(
  select cust_id, product_id, price
  from purchase_data
) as master
join
(
  select cust_id, max(price) as price
  from purchase_data
  group by cust_id
) as max_amt_purchase
on max_amt_purchase.price = master.price;

输出：

aiman_sarosh    apple_iphone6s  65000.0
jeff_12         apple_iphone6s  65000.0
jeff_12         dell_vostro     70000.0
jeff_12         dell_vostro     70000.0
missy_el        lenovo_thinkpad 70000.0
missy_el        lenovo_thinkpad 70000.0
Time taken: 21.666 seconds, Fetched: 6 row(s)

代码有问题吗？

Answer 1

使用row_number()：

select pd.*
from (select pd.*,
             row_number() over (partition by cust_id order by price_desc) as seqnum
      from purchase_data pd
     ) pd
where seqnum = 1;

每个cust_id返回一行，即使有联系。如果您在有联系时需要多行，请使用rank()或dense_rank()代替row_number()。

Answer 2

我更改了代码，现在正在运行：

select master.cust_id, master.product_id, master.price
from
purchase_data as master,
(
  select cust_id, max(price) as price
  from purchase_data
  group by cust_id
) as max_price
where master.cust_id=max_price.cust_id and master.price=max_price.price;

输出：

aiman_sarosh    apple_iphone6s  65000.0
missy_el        lenovo_thinkpad 70000.0
jeff_12         dell_vostro     70000.0

Time taken: 55.788 seconds, Fetched: 3 row(s)

Hive：无法获取GROUP BY中不存在的列

2 个答案: