Question

我将计算通过group by room_id计算其他数据的所有数据的百分位数，如下所示：

   select 
        distinct room_id,
        count(user_id) over (partition by room_id) as user_cnt,
        sum(price) over (partition by room_id) as price,
        percentile(cast(price as bigint),0.5) over () as price_median 
    from
        ods.ods_trade
    where day = '2017-08-08' and trade_status = 1

以上代码可以在SparkSQL中正确运行，但在hive中说明：

At least 1 group must only depend on input columns ... Expression not in GROUP BY key 'price'

percentile() over()也会返回1个值，那么为什么会出现此问题以及如何解决？任何帮助将不胜感激..

例如为：数据是：

room  user price(consume)
  a    u1    1
  a    u1    5
  a    u2    3
  b    u1    2
  b    u3    4
  c    u4    6
  c    u4    7

预期结果：

  room_id  user_cnt   price  price_median
    a        2         8         4
    b        2         6         4
    c        1         13        4

Answer 1

错误表示价格不在分组中。以下查询应该有效：

select room, count(distinct user_id) , sum(price),
price_median from (
SELECT room, user_id, price, 
percentile(cast(price as bigint),0.5) OVER () as price_median
FROM ods.ods_trade
GROUP BY room, USER_id, price
  )k1
 group by room, price_median

注意：列名可能略有不同。

Hive percentile（）over（）需要group by

1 个答案: