通过比较之前和之后的记录来创建自定义会话标识符

时间:2018-02-02 20:41:12

标签: mysql sql-server teradata amazon-redshift vertica

我正在尝试基于数据创建一些会话(在Vertica中,但任何其他OLAP数据库SQL应该可以工作)

我有一个简单的表,其中列名为Vehicle-ID,Event,Event-Code,而“Session-ID”是我要填充的列

我尝试过分区,领导,滞后和其他分析功能,但没有运气。创建的逻辑是会话,如下所示。

会话在您第一次遇到启动事件(1)时启动,在获得最后一个停止(2)事件时结束。正如你在会话开始后在会话中看到的那样,如果我们得到更多的启动事件,我们会忽略,我们会寻找最后一次停止事件。示例Session-Id-1

由于某种原因,在停止事件之后,下一个事件没有开始(即运行等),这意味着它是一个糟糕的会话,我们想要捕获坏会话,直到我们找到一个新的开始。该示例位于session-id-2

我正在尝试创建标记使用lead和lag来查看之前和之后的记录,并添加标记,如first_start,final_end等..但它变得笨拙

更新了SQL查询以创建会话

SELECT * , SUM(FLAG) OVER ( PARTITION BY Vehicle_ID ORDER BY Event_Time ROWS UNBOUNDED PRECEDING) AS SESSION_ID
FROM (
    SELECT * ,
    Case when Prev_Start_time < Prev_Stop_time and Event != 'Started' Then 1 else 0 end as bad_data ,
    Case when 
        ( Event = 'Started' and Prev_Start_time < Prev_Stop_time ) OR 
        --( Event = 'Stopped' and Prev_Event = 'Stopped' ) OR 
        ( Event = 'Running' and Prev_Start_time < Prev_Stop_time) OR 
        ( Prev_Event IS NULL) 
    THEN 1 END AS FLAG
    --Case when ( Event = 'Stopped' and Next_Event = 'Stopped' ) OR ( Event != 'Started' and Prev_Start_time < Prev_Stop_time) OR ( Prev_Event IS NULL) THEN 1 END AS FLAG
    FROM (
    WITH
        input(Vehicle_ID,Event_time,Event,Event_Code) AS (
                  SELECT 1,TIME '09:01:00','Started',1
        UNION ALL SELECT 1,TIME '09:02:00','Started',1
        UNION ALL SELECT 1,TIME '09:03:00','Running',3
        UNION ALL SELECT 1,TIME '09:04:00','Started',1
        UNION ALL SELECT 1,TIME '09:05:00','Running',3
        UNION ALL SELECT 1,TIME '09:06:00','Running',3
        UNION ALL SELECT 1,TIME '09:07:00','Running',3
        UNION ALL SELECT 1,TIME '09:08:00','Stopped',2
        UNION ALL SELECT 1,TIME '09:09:00','Stopped',2
        UNION ALL SELECT 1,TIME '09:10:00','Running',3
        UNION ALL SELECT 1,TIME '09:11:00','Running',3
        UNION ALL SELECT 1,TIME '09:12:00','Running',3
        UNION ALL SELECT 1,TIME '09:13:00','Started',1
        UNION ALL SELECT 1,TIME '09:14:00','Started',1
        UNION ALL SELECT 1,TIME '09:15:00','Running',3
        UNION ALL SELECT 1,TIME '09:16:00','Started',1
        UNION ALL SELECT 1,TIME '09:17:00','Running',3
        UNION ALL SELECT 1,TIME '09:18:00','Running',3
        UNION ALL SELECT 1,TIME '09:19:00','Running',3
        UNION ALL SELECT 1,TIME '09:20:00','Stopped',2
        UNION ALL SELECT 1,TIME '09:21:00','Started',1
        UNION ALL SELECT 1,TIME '09:22:00','Started',1
        UNION ALL SELECT 1,TIME '09:23:00','Running',3
        UNION ALL SELECT 1,TIME '09:24:00','Started',1
        UNION ALL SELECT 1,TIME '09:25:00','Running',3
        UNION ALL SELECT 1,TIME '09:26:00','Running',3
        UNION ALL SELECT 1,TIME '09:27:00','Running',3
        UNION ALL SELECT 1,TIME '09:28:00','Stopped',2
        )
    SELECT *, 
    Max( Case Event when 'Started' then Event_time end ) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time Rows between unbounded preceding and 1 preceding ) AS Prev_Start_time,
    Max( Case Event when 'Stopped' then Event_time end ) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time Rows between unbounded preceding and 1 preceding ) AS Prev_Stop_time,
    LAG(Event) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time ) AS Prev_Event,
    LEAD(Event) OVER (PARTITION BY Vehicle_ID ORDER BY Event_time ) AS Next_Event
    FROM input ) AS T1
) AS T2 

根据更新的查询新输入

Vehicle_ID  Event_time      Event       Event_Code  Prev_Start_time Prev_Stop_time      Prev_Event      Next_Event      bad_data    FLAG        SESSION_ID
1           9:01:00         Started         1           NULL            NULL            NULL            Started         0           1           1
1           9:02:00         Started         1           9:01:00         NULL            Started         Running         0           NULL            1
1           9:03:00         Running         3           9:02:00         NULL            Started         Started         0           NULL            1
1           9:04:00         Started         1           9:02:00         NULL            Running         Running         0           NULL            1
1           9:05:00         Running         3           9:04:00         NULL            Started         Running         0           NULL            1
1           9:06:00         Running         3           9:04:00         NULL            Running         Running         0           NULL            1
1           9:07:00         Running         3           9:04:00         NULL            Running         Stopped         0           NULL            1
1           9:08:00         Stopped         2           9:04:00         NULL            Running         Stopped         0           NULL            1
1           9:09:00         Stopped         2           9:04:00         9:08:00         Stopped         Running         1           NULL            1
1           9:10:00         Running         3           9:04:00         9:09:00         Stopped         Running         1           1           2
1           9:11:00         Running         3           9:04:00         9:09:00         Running         Running         1           1           3
1           9:12:00         Running         3           9:04:00         9:09:00         Running         Started         1           1           4
1           9:13:00         Started         1           9:04:00         9:09:00         Running         Started         0           1           5
1           9:14:00         Started         1           9:13:00         9:09:00         Started         Running         0           NULL            5
1           9:15:00         Running         3           9:14:00         9:09:00         Started         Started         0           NULL            5
1           9:16:00         Started         1           9:14:00         9:09:00         Running         Running         0           NULL            5
1           9:17:00         Running         3           9:16:00         9:09:00         Started         Running         0           NULL            5
1           9:18:00         Running         3           9:16:00         9:09:00         Running         Running         0           NULL            5
1           9:19:00         Running         3           9:16:00         9:09:00         Running         Stopped         0           NULL            5
1           9:20:00         Stopped         2           9:16:00         9:09:00         Running         Started         0           NULL            5
1           9:21:00         Started         1           9:16:00         9:20:00         Stopped         Started         0           1           6
1           9:22:00         Started         1           9:21:00         9:20:00         Started         Running         0           NULL            6
1           9:23:00         Running         3           9:22:00         9:20:00         Started         Started         0           NULL            6
1           9:24:00         Started         1           9:22:00         9:20:00         Running         Running         0           NULL            6
1           9:25:00         Running         3           9:24:00         9:20:00         Started         Running         0           NULL            6
1           9:26:00         Running         3           9:24:00         9:20:00         Running         Running         0           NULL            6
1           9:27:00         Running         3           9:24:00         9:20:00         Running         Stopped         0           NULL            6
1           9:28:00         Stopped         2           9:24:00         9:20:00         Running         NULL            0           NULL            6

2 个答案:

答案 0 :(得分:3)

在Vertica中,我使用了MATCH()子句。它也会遗漏掉不需要的行 - 那些正在运行的行 - &#39;没有意义。 试试这个:

    WITH
    -- your input as you gave it 
    input(tm,Vehicle_ID,Col1,Event,Event_Code,Session_ID) AS (
              SELECT TIME '09:01:00',1,'A','Started',1,1
    UNION ALL SELECT TIME '09:02:00',1,'B','Started',1,1
    UNION ALL SELECT TIME '09:03:00',1,'C','Running',3,1
    UNION ALL SELECT TIME '09:04:00',1,'A','Started',1,1
    UNION ALL SELECT TIME '09:05:00',1,'B','Running',3,1
    UNION ALL SELECT TIME '09:06:00',1,'C','Running',3,1
    UNION ALL SELECT TIME '09:07:00',1,'A','Running',3,1
    UNION ALL SELECT TIME '09:08:00',1,'A','Stopped',2,1
    UNION ALL SELECT TIME '09:09:00',1,'B','Stopped',2,1
    UNION ALL SELECT TIME '09:10:00',1,'C','Running',3,2
    UNION ALL SELECT TIME '09:11:00',1,'A','Running',3,2
    UNION ALL SELECT TIME '09:12:00',1,'B','Running',3,2
    UNION ALL SELECT TIME '09:13:00',1,'A','Started',1,3
    UNION ALL SELECT TIME '09:14:00',1,'B','Started',1,3
    UNION ALL SELECT TIME '09:15:00',1,'C','Running',3,3
    UNION ALL SELECT TIME '09:16:00',1,'A','Started',1,3
    UNION ALL SELECT TIME '09:17:00',1,'B','Running',3,3
    UNION ALL SELECT TIME '09:18:00',1,'C','Running',3,3
    UNION ALL SELECT TIME '09:19:00',1,'A','Running',3,3
    UNION ALL SELECT TIME '09:20:00',1,'A','Stopped',2,3
    )
    -- here is where the real select starts ..
    SELECT
      pattern_id()
    , match_id()
    , event_name()
    , *
    FROM input
    MATCH(
      PARTITION BY vehicle_id
      ORDER BY tm
      DEFINE
        started_event AS (event='Started')
      , running_event AS (event='Running')
      , stopped_event AS (event='Stopped')
      PATTERN p AS (started_event+ (running_event|started_event)* stopped_event+)
    )
    ;


    pattern_id|match_id|event_name   |tm      |Vehicle_ID|Col1|Event  |Event_Code|Session_ID
             1|       1|started_event|09:01:00|         1|A   |Started|         1|         1
             1|       2|started_event|09:02:00|         1|B   |Started|         1|         1
             1|       3|running_event|09:03:00|         1|C   |Running|         3|         1
             1|       4|started_event|09:04:00|         1|A   |Started|         1|         1
             1|       5|running_event|09:05:00|         1|B   |Running|         3|         1
             1|       6|running_event|09:06:00|         1|C   |Running|         3|         1
             1|       7|running_event|09:07:00|         1|A   |Running|         3|         1
             1|       8|stopped_event|09:08:00|         1|A   |Stopped|         2|         1
             1|       9|stopped_event|09:09:00|         1|B   |Stopped|         2|         1
             2|       1|started_event|09:13:00|         1|A   |Started|         1|         3
             2|       2|started_event|09:14:00|         1|B   |Started|         1|         3
             2|       3|running_event|09:15:00|         1|C   |Running|         3|         3
             2|       4|started_event|09:16:00|         1|A   |Started|         1|         3
             2|       5|running_event|09:17:00|         1|B   |Running|         3|         3
             2|       6|running_event|09:18:00|         1|C   |Running|         3|         3
             2|       7|running_event|09:19:00|         1|A   |Running|         3|         3

答案 1 :(得分:1)

以下内容似乎与您的描述相符,我假设有一个列(名为无论)可以唯一地为您的数据排序(可能是时间戳)。

这将在Teradata中产生两个STAT步骤:

SELECT dt.*
  ,Sum(flag) -- (cumulative sum or COUNT(*) to create the session id
   Over (PARTITION BY Vehicle_ID
         ORDER BY whatever
         ROWS Unbounded Preceding) AS session_id_
FROM
(
   SELECT mytable.* 

     -- previous start
     ,Max(CASE event_code WHEN 1 THEN whatever END) 
      Over (PARTITION BY Vehicle_ID
            ORDER BY whatever 
            ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_start

     -- previous stop
     ,Max(CASE event_code WHEN 2 THEN whatever END) 
      Over (PARTITION BY Vehicle_ID
            ORDER BY whatever
            ROWS BETWEEN Unbounded Preceding AND 1 Preceding) AS prev_stop

     -- previous event
     ,Lag(event_code) 
      Over (PARTITION BY Vehicle_ID
            ORDER BY whatever) AS lag_event

     -- no new session started after previous stop and current event is not start = bad data
     ,CASE WHEN prev_start < prev_stop AND event_code <> 1 THEN 1 ELSE 0 END AS bad_data

                                                               -- new session starts at
     ,CASE WHEN   (event_code <> 2 AND (lag_event = 2) )       -- first row after a stop (ignore consecutive stops)
               OR (event_code =  1 AND prev_start < prev_stop) -- first row after bad data
               OR lag_event IS NULL                            -- first row
           THEN 1 
      END AS flag

   FROM mytable
) AS dt

对于Oracle / SQL Server,您需要添加另一个嵌套级别才能使用CASE中的别名。