如何计算HiveQL中每个城市中最受欢迎的设备,操作系统和浏览器?

时间:2018-12-24 11:54:09

标签: hive mapreduce hiveql apache-tez

我有一个包含用户代理字符串(我将其解析为browserosdevice列的表)和城市id的表。我想为每个browser计算出最受欢迎的osdevicecity

这是我的尝试:

select device os, browser, name, MAX(hits) as pop from 
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits 
from (select * from browserdata join citydata on cityid=id) t 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser 
GROUP BY uap.device, uap.os, uap.browser, name) t2 
GROUP BY name;

因此,最里面的别名为t的子查询只是将我的表连接到另一个将id映射到城市name的表上,因此我可以看到实际的{{1} },而不是输出中的城市name

然后,名为id的子查询计算组合键(t2devicebrowseros)的数量。外部查询将所有内容分组到city窗口中,并提取出最大用户数的行。

我得到的错误是:

  

失败:SemanticException [错误10025]:行1:7表达式不在GROUP BY键“设备”中

我明白这是什么意思。它说我需要将name包含到device中,但是如果我这样做了,那么它将无法计算我想要的内容。如何解决我的查询?

此外,我注意到我的某些配置单元查询在mapreduce上运行,但不在tez上运行。为什么会这样?

2 个答案:

答案 0 :(得分:1)

使用分析功能可以消除不必要的联接:

WITH 
t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)

select cityname, maximum,  device, os, browser
 from
     (select cityname, device, browser, os, 
             max(count) over(partition by cityname)                         as maximum,
             dense_rank() over (partition by cityname order by count desc ) as rnk      
      from t3
     ) s  where rnk =1 
;

答案 1 :(得分:0)

WITH t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
  (SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),

t4 as
    (select cityname, MAX(count) as maximum from t3 group by cityname)

select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;

这行得通,但是我想知道是否有一种方法可以对其进行优化...