I have a hive table with the following data: (first row is header)
session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue
I'm having trouble writing a sql query that meets the following criteria: 1) all rows for sessions that contains a status of new. 2) if the sessions contains multilpe values for status=new, only remove the first one
The output would be
a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
rows a,3,new,green
and c,4,check,blue
are omitted.
I've written this query which does indeed do the trick if you are only looking at the session
, ts
and status
columns, but I don't like the second query as it has a group-by in it
select session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status
However, as soon as you add the color
column, it falls apart. (You get both rows for session=a and status=new. One for green and one for red.
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor
Lastly, is there a better way to write this query as a whole. Maybe one without the union?
答案 0 :(得分:1)
如果使用Teradata SQL:
select session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
qualify row_number() over (partition by session,status order by ts)=1
union
select session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a
where a.status = 'new'
)
答案 1 :(得分:1)
以下是针对您的问题的HiveQL解决方案
WITH sessions
AS (SELECT DISTINCT session
FROM mp_logon3
WHERE STATUS = 'new')
,logons
AS (SELECT session
,ts
,STATUS
,color
,row_number() OVER (
PARTITION BY session
,STATUS ORDER BY ts
) AS r_num
FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
OR l.r_num = 1
ORDER BY l.session
,l.ts;