Hive SQL: Filtering out rows that contain duplicate values for a specific column

时间:2017-11-08 22:04:23

标签: sql filter hiveql

I have a hive table with the following data: (first row is header)

session,ts,status,color
a,1,new,red
a,2,check,blue
a,3,new,green
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue
c,4,check,blue

I'm having trouble writing a sql query that meets the following criteria: 1) all rows for sessions that contains a status of new. 2) if the sessions contains multilpe values for status=new, only remove the first one

The output would be

a,1,new,red
a,2,check,blue
a,4,amount,blue
a,5,end,blue
b,1,new,red
b,2,bottle,blue
b,3,end,blue

rows a,3,new,green and c,4,check,blue are omitted.

I've written this query which does indeed do the trick if you are only looking at the session, ts and status columns, but I don't like the second query as it has a group-by in it

select  session, ts, status from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status 

However, as soon as you add the color column, it falls apart. (You get both rows for session=a and status=new. One for green and one for red.

select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
union
select session, min(ts), status, flavor from mp_logon3
where status='new'
and session in (select distinct b.session from mp_logon3 b
where b.status = 'new'
)
group by session, status, flavor

Lastly, is there a better way to write this query as a whole. Maybe one without the union?

2 个答案:

答案 0 :(得分:1)

如果使用Teradata SQL:

select  session, ts, status, color
from mp_logon3
where status='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 
qualify row_number() over (partition by session,status order by ts)=1
union
select  session, ts, status, flavor from mp_logon3
where status!='new'
and session in (select distinct a.session from mp_logon3 a 
where a.status = 'new'
) 

答案 1 :(得分:1)

以下是针对您的问题的HiveQL解决方案

WITH sessions
AS (SELECT DISTINCT session
    FROM mp_logon3
    WHERE STATUS = 'new')
,logons
AS (SELECT session
        ,ts
        ,STATUS
        ,color
        ,row_number() OVER (
            PARTITION BY session
            ,STATUS ORDER BY ts
            ) AS r_num
    FROM mp_logon3)
SELECT l.*
FROM logons l
INNER JOIN sessions s ON (s.session = l.session)
WHERE l.STATUS <> 'new'
    OR l.r_num = 1
ORDER BY l.session
    ,l.ts;