我有一个像这样的在线会话的表格(空行只是为了更好的可见性):
ip_address | start_time | stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:12
10.10.10.10 | 2016-04-02 08:11 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:10
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:08
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:11
10.10.10.10 | 2016-04-02 09:02 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:10 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
我需要"信封"在线时间跨度:
ip_address | full_start_time | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 08:20
10.10.10.10 | 2016-04-02 09:00 | 2016-04-02 09:15
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
我有这个返回所需结果的查询:
WITH t AS
-- Determine full time-range of each IP
(SELECT ip_address, MIN(start_time) AS min_start_time, MAX(stop_time) AS max_stop_time FROM IP_SESSIONS GROUP BY ip_address),
t2 AS
-- compose ticks
(SELECT DISTINCT ip_address, min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE AS ts
FROM t
CONNECT BY min_start_time + (LEVEL-1) * INTERVAL '1' MINUTE <= max_stop_time),
t3 AS
-- get all "online" ticks
(SELECT DISTINCT ip_address, ts
FROM t2
JOIN IP_SESSIONS USING (ip_address)
WHERE ts BETWEEN start_time AND stop_time),
t4 AS
(SELECT ip_address, ts,
LAG(ts) OVER (PARTITION BY ip_address ORDER BY ts) AS previous_ts
FROM t3),
t5 AS
(SELECT ip_address, ts,
SUM(DECODE(previous_ts,NULL,1,0 + (CASE WHEN previous_ts + INTERVAL '1' MINUTE <> ts THEN 1 ELSE 0 END)))
OVER (PARTITION BY ip_address ORDER BY ts ROWS UNBOUNDED PRECEDING) session_no
FROM t4)
SELECT ip_address, MIN(ts) AS full_start_time, MAX(ts) AS full_stop_time
FROM t5
GROUP BY ip_address, session_no
ORDER BY 1,2;
然而,我对表现感到担忧。该表有数亿行,时间分辨率为毫秒(不是示例中给出的一分钟)。因此,CTE t3
将是巨大的。有没有人有解决方案可以避免自我加入和#34; CONNECT BY&#34;?
单个智能Analytic Function会很棒。
答案 0 :(得分:3)
Try this one, too. I tested it the best I could, I believe it covers all the possibilities, including coalescing adjacent intervals (10:15 to 10:30 and 10:30 to 10:40 are combined into a single interval, 10:15 to 10:40). It should also be quite fast, it doesn't use much.
with m as
(
select ip_address, start_time,
max(stop_time) over (partition by ip_address order by start_time
rows between unbounded preceding and 1 preceding) as m_time
from ip_sessions
union all
select ip_address, NULL, max(stop_time) from ip_sessions group by ip_address
),
n as
(
select ip_address, start_time, m_time
from m
where start_time > m_time or start_time is null or m_time is null
),
f as
(
select ip_address, start_time,
lead(m_time) over (partition by ip_address order by start_time) as stop_time
from n
)
select * from f where start_time is not null
/
答案 1 :(得分:1)
请测试此解决方案,它适用于您的示例,但可能有一些我没有注意到的情况。没有连接,没有自我加入。
with io as (
select * from (
select ip_address, t1, io, sum(io) over (partition by ip_address order by t1) sio
from (
select ip_address, start_time t1, 1 io from ip_sessions
union all
select ip_address, stop_time, -1 io from ip_sessions ) )
where (io = 1 and sio = 1) or (io = -1 and sio = 0) )
select ip_address, t1, t2
from (
select io.*, lead(t1) over (partition by ip_address order by t1) as t2 from io)
where io = 1
测试数据:
create table ip_sessions (ip_address varchar2(15), start_time date, stop_time date);
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:00:00', timestamp '2016-04-02 08:12:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 08:11:00', timestamp '2016-04-02 08:20:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:00:00', timestamp '2016-04-02 09:10:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:05:00', timestamp '2016-04-02 09:08:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:02:00', timestamp '2016-04-02 09:15:00');
insert into ip_sessions values ('10.10.10.10', timestamp '2016-04-02 09:10:00', timestamp '2016-04-02 09:12:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:05:00', timestamp '2016-04-02 08:07:00');
insert into ip_sessions values ('10.66.44.22', timestamp '2016-04-02 08:03:00', timestamp '2016-04-02 08:11:00');
输出:
IP_ADDRESS T1 T2
----------- ------------------- -------------------
10.10.10.10 2016-04-02 08:00:00 2016-04-02 08:20:00
10.10.10.10 2016-04-02 09:00:00 2016-04-02 09:15:00
10.66.44.22 2016-04-02 08:03:00 2016-04-02 08:11:00
答案 2 :(得分:0)
我认为使用lag()
和累积总和会有更好的表现:
select ip_address, min(start_time) as full_start_time,
max(end_time) as full_end_time
from (select t.*,
sum(case when prev_et >= start_time then 0 else 1 end) over
(partition by ip_address order by start_time) as grp
from (select s.*,
lag(end_time) over (partition by ip_address order by end_time) as prev_et
from ip_seesions s)
) t
group by grp, ip_address
order by 1, 2;
给出结果:
ip_address | full_start_time | full_stop_time
------------|------------------|------------------
10.10.10.10 | 2016-04-02 08:00 | 2016-04-02 09:15
10.10.10.10 | 2016-04-02 09:05 | 2016-04-02 09:12
10.66.44.22 | 2016-04-02 08:03 | 2016-04-02 08:11
10.66.44.22 | 2016-04-02 08:05 | 2016-04-02 08:07