我正在与一些非常可怕的旧数据集作斗争,并且需要汇总这些数据,以使其更加有用。我不太确定我是否需要rank,deny_rank或group by或3(或新的)组合。
数据的结构如下:
--[Table:]
hashed_id | visit_id | datetime | page_name | ...
----------+----------+---------------------+-----------+-----
abc | 1 | 2019-01-01 00:00:01 | page1 | ...
abc | 1 | 2019-01-01 00:00:02 | page1 | ...
abc | 1 | 2019-01-01 00:00:03 | page1 | ...
abc | 1 | 2019-01-01 00:00:10 | page1 | ...
abc | 1 | 2019-01-01 00:00:20 | page2 | ...
abc | 1 | 2019-01-01 00:00:32 | page2 | ...
abc | 1 | 2019-01-01 00:00:53 | page1 | ...
abc | 1 | 2019-01-01 00:00:54 | page1 | ...
我想要
--[Table:]
hashed_id | visit_id | datetime | page_name | ...
----------+----------+---------------------+-----------+-----
abc | 1 | 2019-01-01 00:00:01 | page1 | ...
abc | 1 | 2019-01-01 00:00:20 | page2 | ...
abc | 1 | 2019-01-01 00:00:53 | page1 | ...
我尝试使用等级,密集等级和分组依据,但似乎未获得预期的结果。我是白痴吗:)?
答案 0 :(得分:2)
使用lag()
来获得与上一页不同的页面的首次出现:
select t.*
from (select t.*,
lag(page_name) over (partition by hashed_id, visit_id order by datetime) as prev_page_name
from t
) t
where prev_page_name is null or prev_page_name <> page_name
答案 1 :(得分:0)
选择您的数据似乎需要通过hashed_id,visit_id在表和min(datetime)组之间进行联接
select * from my_table m
inner join (
select hashed_id, visit_id, min(datetime) min_date
from my_table
group by hashed_id, visit_id
) t 0n t.hashed_id = m.hashed_id
and t.visit_id = m.visit_id
and t.min_date = m.datetime