我正在尝试了解窗口功能avg的工作原理,并且似乎不按我的预期工作。
这是数据集:
select * from winsales; +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+ | winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped | +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+ | 30001 | NULL | 3 | b | 10 | 10 | | 10001 | NULL | 1 | c | 10 | 10 | | 10005 | NULL | 1 | a | 30 | NULL | | 40001 | NULL | 4 | a | 40 | NULL | | 20001 | NULL | 2 | b | 20 | 20 | | 40005 | NULL | 4 | a | 10 | 10 | | 20002 | NULL | 2 | c | 20 | 20 | | 30003 | NULL | 3 | b | 15 | NULL | | 30004 | NULL | 3 | b | 20 | NULL | | 30007 | NULL | 3 | c | 30 | NULL | | 30001 | NULL | 3 | b | 10 | 10 | +-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
当我触发以下查询->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
我得到以下内容->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
问题是-如何计算avg(qty)。 由于我没有使用partition by,因此我希望所有行的avg(qty)都相同。
有什么想法吗?
答案 0 :(得分:0)
如果您想让相同的avg(qty)用于所有行,然后在over子句中删除 order by sellerid
,那么您将具有<所有行的strong> 19.545454545454547 值。
查询以获取所有行的平均avg(qty):
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
如果我们在 over子句中包含order by sellerid
,则您将获得每个卖方ID的累计平均值。
即用于
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
当我们包含over子句时,这是蜂巢的预期行为。