group by其中一个记录配置单元的where子句

时间:2018-04-14 21:45:58

标签: hive hiveql

尝试将群组过滤为仅包含会话超过5分钟的参与者的群组。

我当前的查询:

select 
U.session_id,
U.session_date,
U.participant_duration
U.email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id

一旦数据出来......

session_id   session_date   participant_duration   email
143          20180401       0.4                    huy@gmail.com
143          20180401       1.5                    t@gmail.com
143          20180401       1.6                    att@gmail.com
143          20180401       2.3                    m@gmail.com
124          20180401       5.6                    p@gmail.com
124          20180401       3.2                    alex@gmail.com
165          20180401       4.1                    jeff@gmail.com
165          20180401       3.1                    nader@gmail.com

我想用一个where子句对其进行过滤,该子句只返回包含participant_duration >= 5的至少1条记录的组。

喜欢这样的东西: group by session_id having participant_duration >= 5

这远远不够吗?

2 个答案:

答案 0 :(得分:0)

是..您使用group byhaving时有正确的想法。

group by session_id
having sum(cast(participant_duration >= 5 as int)) >= 1

此外,您的查询可以简化为

select *
from (select U.session_id,U.session_date,U.participant_duration,U.email,
      SUM(cast(U.participant_duration >= 5 as int)) OVER(PARTITION BY U.session_id) as dur_gt_5
      from data.usage U
      join data.users M on U.session_id = M.session_id and U.name=lower(M.name)
      where M.role like 'Person%' and M.isactive = TRUE and M.data_date = '20180412'
      and U.email like '%gmail.com%' and U.data_date >= '20180101'
     ) t
where dur_gt_5>=1

答案 1 :(得分:0)

如果您在session_id 字段中使用分组,则需要在选择查询的其他字段中使用聚合函数(如sum,min,max等)。

我认为 session_id,session_date 对于记录是相同的,所以如果您不想使用分组(或)中使用这两个字段分组中的> session_date 您需要使用此字段中的任何聚合函数,例如 max(session_Date)等。

对participant_duration使用sum aggregate函数,然后在having子句中使用partition_duration来仅过滤掉值大于5的记录。

只有select语句中剩下的字段是电子邮件,它不在group by子句中,因此我使用 max aggregate function 只获取电子邮件字段的一个值。

分组中的

session_date: -

select 
U.session_id,
U.session_date,
sum(U.participant_duration) participant_duration,
max(U.email) email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id
group by U.session_id,U.session_date
having sum(cast(participant_duration >= 5 as int)) >= 1; 

<强>(或)

session_date不在group by子句中: -

select 
U.session_id,
max(U.session_date) session_date,
sum(U.participant_duration) participant_duration,
max(U.email) email
from data.usage U
left outer join
  (select 
  distinct M.session_id
  from data.usage M
  where email like '%gmail.com%'
  and data_date >= '20180101'
  and name in
    ( 
    select 
    lower(name)
    from data.users
    where role like 'Person%' 
    and isactive = TRUE
    and data_date = '20180412'
    ))M
on U.session_id = M.session_id
group by U.session_id
having sum(cast(participant_duration >= 5 as int)) >= 1;