有效的查询以将步骤持续时间从sql中的事件日志表获取到累积快照事实中

时间:2019-06-25 20:45:51

标签: mysql sql-server data-warehouse

此示例内置于SQL Server 2016中,但它也应适用于MySQL 8.X。

我将事件日志数据存储在表fact_user_event_activity中,其中包含以下示例数据:

event_date_key  user_key    step_key    session_id  event_timestamp
20140411        123         1           1000        2014-04-11 08:00:00.000
20140411        123         2           1000        2014-04-11 08:10:00.000
20140411        123         3           1000        2014-04-11 08:20:00.000
20140411        123         4           1000        2014-04-11 08:30:00.000
20140411        125         1           1001        2014-04-11 09:10:00.000
20140411        123         5           1000        2014-04-11 08:31:00.000
20140411        125         2           1001        2014-04-11 09:30:00.000
20140411        125         3           1001        2014-04-11 09:50:00.000  <-- 
20140411        125         3           1001        2014-04-11 09:51:00.000  <--
20140411        125         4           1001        2014-04-11 09:52:00.000

假设

  • 所有按user_key输入的记录均按日期排序。但是,记录不是按user_key排序的。例如,在125上查看user_key 2014-04-11 09:10:00.000
  • 步骤是可以预测的。此过程将始终包含5个步骤,最后一个步骤表示退出
  • 同一会话中的步骤可以在不同的日期记录多次

期望

查询以下内容的最有效方法是什么?

user_key     session_id    step_1_duration_mins    step_2_duration_mins     step_3_duration_mins    step_4_duration_mins
123             1000           10                         10                       10                    1
125             1001           20                         20                        2                 NULL

这将用作累积快照的ETL查询

设置

DROP TABLE IF EXISTS  [fact_user_event_activity]
;
CREATE TABLE [fact_user_event_activity] (
  [event_date_key] INT DEFAULT NULL,
  [user_key] BIGINT NOT NULL,
  [step_key] BIGINT NOT NULL,
  [session_id] BIGINT NOT NULL,
  [event_timestamp] datetime NOT NULL
)
;
INSERT INTO [fact_user_event_activity]
VALUES (20140411, 123, 1, 1000, N'2014-04-11 08:00:00'),
(20140411, 123, 2, 1000, N'2014-04-11 08:10:00'),
(20140411, 123, 3, 1000, N'2014-04-11 08:20:00'),
(20140411, 123, 4, 1000, N'2014-04-11 08:30:00'),
(20140411, 125, 1, 1001, N'2014-04-11 09:10:00'),
(20140411, 123, 5, 1000, N'2014-04-11 08:31:00'),
(20140411, 125, 2, 1001, N'2014-04-11 09:30:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:50:00'),
(20140411, 125, 3, 1001, N'2014-04-11 09:51:00'),
(20140411, 125, 4, 1001, N'2014-04-11 09:52:00'),
(20140411, 129, 1, 1005, N'2014-04-11 09:08:00'),
(20140411, 129, 2, 1005, N'2014-04-11 09:10:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:12:00'),
(20140411, 129, 3, 1005, N'2014-04-11 09:13:00'),
(20140411, 129, 4, 1005, N'2014-04-11 09:14:00'),
(20140411, 129, 5, 1005, N'2014-04-11 09:18:00')
;

我的尝试

为了轻松理解代码,我分两个步骤进行了处理:

  1. 获取从开始(会话开始)开始的每一步的持续时间
  2. 计算每个步骤的duration_from_start之间的差异

这返回了我期望的结果,但是我确定我可能会使事情变得过于复杂,这会影响〜500 M记录,因此我想知道是否有更好的方法或是否缺少某些东西

-- Step 1
-- to improve performance, use temp table instead of CTE
-- Use TIMESTAMPDIFF in MySQL instead of DATEDIFF
WITH durations_from_start_tmp AS
    (
    SELECT session_id, user_key, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp) first_login,
    DENSE_RANK() OVER(PARTITION BY user_key, step_key, fuea.session_id ORDER BY fuea.event_timestamp) AS rnk,
    CASE WHEN step_key = 2 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_1_duration_from_start,
    CASE WHEN step_key = 3 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_2_duration_from_start,
    CASE WHEN step_key = 4 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_3_duration_from_start,
    CASE WHEN step_key = 5 THEN DATEDIFF(MINUTE, FIRST_VALUE(fuea.event_timestamp) OVER(PARTITION BY user_key, fuea.session_id ORDER BY fuea.event_timestamp), fuea.event_timestamp) END AS step_4_duration_from_start
    FROM [fact_user_event_activity] fuea
    --WHERE event_timestamp > watermark --for incremental load
    )

-- Step 2
SELECT user_key, session_id, SUM(step_1_duration_from_start) AS step_1_duration_mins,
 SUM(step_2_duration_from_start) - SUM(step_1_duration_from_start) AS step_2_duration_mins ,
 SUM(step_3_duration_from_start) - SUM(step_2_duration_from_start) AS step_3_duration_mins ,
 SUM(step_4_duration_from_start) - SUM(step_3_duration_from_start) AS step_4_duration_mins
 FROM durations_from_start_tmp
 -- deals with repeated steps
 WHERE rnk = 1
 GROUP BY  user_key, session_id

参考

这可能与获取答案无关,只是在您不熟悉数据建模概念的情况下

Accumulating Snapshots Definition

1 个答案:

答案 0 :(得分:1)

因此,您可能会采用的一种方法是添加一个索引(假设您可以添加一个),例如:

CREATE INDEX [SomeIndexName] ON [fact_user_event_activity] (user_key, session_id, step_key, event_timestamp);

(或者,如果您担心500m行的索引大小,则可以在step_key和event_timestamp上进行包含)。

然后使用窗口函数跳过查询,如下所示:

SELECT user_key,
       session_id,
       step_1_duration = DATEDIFF(MINUTE, step_1_timestamp, step_2_timestamp),
       step_2_duration = DATEDIFF(MINUTE, step_2_timestamp, step_3_timestamp),
       step_3_duration = DATEDIFF(MINUTE, step_3_timestamp, step_4_timestamp),
       step_4_duration = DATEDIFF(MINUTE, step_4_timestamp, step_5_timestamp)
FROM 
(
    SELECT user_key, session_id,
           step_1_timestamp = MIN(CASE WHEN step_key = 1 THEN event_timestamp END),
           step_2_timestamp = MIN(CASE WHEN step_key = 2 THEN event_timestamp END),
           step_3_timestamp = MIN(CASE WHEN step_key = 3 THEN event_timestamp END),
           step_4_timestamp = MIN(CASE WHEN step_key = 4 THEN event_timestamp END),
           step_5_timestamp = MIN(CASE WHEN step_key = 5 THEN event_timestamp END)
    FROM fact_user_event_activity
    GROUP BY user_key, session_id
) AS T;

(理论上,这将只进行索引扫描而不需要任何种类。)