尝试使用分组依据,等级和密集等级的混合来汇总数据,但没有运气

时间:2019-05-21 13:51:12

标签: sql group-by amazon-redshift rank dense-rank

我正在与一些非常可怕的旧数据集作斗争,并且需要汇总这些数据,以使其更加有用。我不太确定我是否需要rank,deny_rank或group by或3(或新的)组合。

数据的结构如下:

--[Table:]
hashed_id | visit_id | datetime            | page_name | ...
----------+----------+---------------------+-----------+-----
abc       | 1        | 2019-01-01 00:00:01 | page1     | ...
abc       | 1        | 2019-01-01 00:00:02 | page1     | ...
abc       | 1        | 2019-01-01 00:00:03 | page1     | ...
abc       | 1        | 2019-01-01 00:00:10 | page1     | ...
abc       | 1        | 2019-01-01 00:00:20 | page2     | ...
abc       | 1        | 2019-01-01 00:00:32 | page2     | ...
abc       | 1        | 2019-01-01 00:00:53 | page1     | ...
abc       | 1        | 2019-01-01 00:00:54 | page1     | ...

我想要

--[Table:]
hashed_id | visit_id | datetime            | page_name | ...
----------+----------+---------------------+-----------+-----
abc       | 1        | 2019-01-01 00:00:01 | page1     | ...
abc       | 1        | 2019-01-01 00:00:20 | page2     | ...
abc       | 1        | 2019-01-01 00:00:53 | page1     | ... 

我尝试使用等级,密集等级和分组依据,但似乎未获得预期的结果。我是白痴吗:)?

2 个答案:

答案 0 :(得分:2)

使用lag()来获得与上一页不同的页面的首次出现:

select t.*
from (select t.*,
             lag(page_name) over (partition by hashed_id, visit_id order by datetime) as prev_page_name
      from t
     ) t
where prev_page_name is null or prev_page_name <> page_name

答案 1 :(得分:0)

选择您的数据似乎需要通过hashed_id,visit_id在表和min(datetime)组之间进行联接

select * from my_table  m 
inner join  (
  select  hashed_id, visit_id, min(datetime) min_date 
  from my_table
  group by   hashed_id, visit_id
 ) t 0n t.hashed_id = m.hashed_id 
    and t.visit_id = m.visit_id 
        and t.min_date = m.datetime