雪花分析查询设计

时间:2021-04-17 13:38:14

标签: sql snowflake-cloud-data-platform analytical

我有一个棘手的查询设计要求,我尝试了不同类型/不同的分析函数组合,以从以下数据集中获得我的结果。我的另一个计划是编写存储过程,但是我想在改变方向之前联系专家组。

输入数据集:

enter image description here

带有组列的必需输出数据集:当会话 id 中的会话 id 发生变化并且如果我再次取回相同的会话 id 时,我必须有一个不同的组。我尝试使用 LEAD/LAG 组合,但无法获得以下所需的输出,一种或另一种情况正在中断。

enter image description here

谢谢!

3 个答案:

答案 0 :(得分:1)

SQL 语言的表达能力足以为复杂的需求找到声明式解决方案。

Snowflake 最近实施了 SQL 2016 标准条款:MATCH_RECOGNIZE,旨在以非常直接的方式解决此类情况。

<块引用>

Identifying Sequences of Rows That Match a Pattern

在某些情况下,您可能需要识别与模式匹配的表行序列。例如,您可能需要:

  • 确定哪些用户在打开支持请求或进行购买之前关注了您网站上的特定页面和操作序列。

  • 找出价格在一段时间内跟随 V 型或 W 型复苏的股票。

  • 在传感器数据中寻找可能预示即将发生系统故障的模式。

数据准备:

CREATE OR REPLACE TABLE t
AS
SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:30:00'::DATE AS Trans_dt, 1 AS VERSION_ID
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:35:00'::DATE AS Trans_dt, 2
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:37:00'::DATE AS Trans_dt, 3
UNION ALL SELECT 102 SESS_ID, 1 POL_ID, '2021-04-17 09:38:00'::DATE AS Trans_dt, 4
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:39:00'::DATE AS Trans_dt, 5
UNION ALL SELECT 101 SESS_ID, 1 POL_ID, '2021-04-17 09:40:00'::DATE AS Trans_dt, 6;

查询:

SELECT *
FROM t
MATCH_RECOGNIZE (
    PARTITION BY POL_ID
    ORDER BY VERSION_ID
    MEASURES MATCH_NUMBER() AS group_id
            --,CLASSIFIER() as cks
    ALL ROWS PER MATCH
    PATTERN (a+b*)
    DEFINE a as sess_id = FIRST_VALUE(sess_id)
          ,b AS sess_id != FIRST_VALUE(sess_id)
) mr
ORDER BY POL_ID, VERSION_ID;

db<>fiddle demo(Oracle)

输出:

SESS_ID POL_ID  TRANS_DT    VERSION_ID  GROUP_ID
101 1   2021-04-17  1   1
101 1   2021-04-17  2   1
102 1   2021-04-17  3   1
102 1   2021-04-17  4   1
101 1   2021-04-17  5   2
101 1   2021-04-17  6   2

工作原理:

  1. 定义一个模式:(a+b*) 是 Perl 风格的正则表达式,a(一个或多个)b(零个或多个)
  2. 定义模式组件a(sess_id与组的第一个元素相同),b(sess_id与组的第一个元素不同)
  3. 定义度量MATCH_NUMBER() -“返回匹配的序列号”
  4. 对每个 POL_ID 执行此操作并使用 VERSION_ID 作为排序列

答案 1 :(得分:1)

因此在下面,您希望 group_id 如何与 pol_id 相关并不明显,所以我忽略了它。

所以使用 CTE 只是为了伪造 data

WITH data AS (
    SELECT * FROM VALUES
      (101, 1, '2021-04-17 09:30:00', 1),
      (101, 1, '2021-04-17 09:35:00', 2),
      (102, 1, '2021-04-17 09:37:00', 3),
      (102, 1, '2021-04-17 09:38:00', 4),
      (101, 1, '2021-04-17 09:39:00', 5),
      (101, 1, '2021-04-17 09:40:00', 6)
    v(sess_id, pol_id, trans_dt, version_id)
)

然后我想编写这些操作:

SELECT *
    ,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
    ,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
    ,r1- r2 as r3
    ,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
    ,IFF(lag_r3 != r3, 1, 0) as sess_edge
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM data

所以 r1r2 发现当 sess_idtrans_dt 之间存在差距时,您需要 r3lag_r3 相对于 trans_dt,这些是您想要计算的边,因此 SUM 是从零开始的,因此 +1 来获得您的值想要。

现在以上在雪花中无效,因此需要分层才能工作:

SELECT
    *
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (  
    SELECT
        *
        ,LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) as lag_r3
        ,IFF(lag_r3 != r3, 1, 0) as sess_edge
    FROM (
        SELECT *
            ,ROW_NUMBER() OVER (ORDER BY trans_dt) AS r1
            ,ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) AS r2
            ,r1- r2 as r3
        FROM data
    )
)
ORDER BY trans_dt;

给出:

SESS_ID POL_ID  TRANS_DT            VERSION_ID  R1  R2  R3  LAG_R3  SESS_EDGE   GROUP_ID
101     1       2021-04-17 09:30:00 1           1   1   0   null    0           1
101     1       2021-04-17 09:35:00 2           2   2   0   0       0           1
102     1       2021-04-17 09:37:00 3           3   1   2   null    0           1
102     1       2021-04-17 09:38:00 4           4   2   2   2       0           1
101     1       2021-04-17 09:39:00 5           5   3   2   0       1           2
101     1       2021-04-17 09:40:00 6           6   4   2   2       0           2   

所以可以看到它是如何工作的。然后可以将其压缩为:

SELECT
    sess_id
    ,pol_id
    ,trans_dt
    ,version_id
    ,SUM(sess_edge) OVER (ORDER BY trans_dt)+1 as GROUP_ID
FROM (  
    SELECT
        *
        ,IFF(LAG(r3) OVER (PARTITION BY sess_id ORDER BY trans_dt ) != r3, 1, 0) as sess_edge
    FROM (
        SELECT *
            ,ROW_NUMBER() OVER (ORDER BY trans_dt)- ROW_NUMBER() OVER (PARTITION BY sess_id ORDER BY trans_dt) as r3
        FROM data
    )
)
ORDER BY trans_dt;

这比 Gordon 的答案复杂得多,后者重写为与我的相同的形式:

select *
    ,sum(edge) over ( partition by pol_id, sess_id order by trans_dt ) as grouping
from (
    select *
        ,lag(sess_id) over (partition by pol_id order by trans_dt) as prev_session_id
        ,iff(prev_session_id = sess_id, 0, 1) AS edge
    from data
)
ORDER BY 2,3;

这是相当聪明的,因为 SUMing per sess_id

但是如果你添加额外的数据:

WITH data AS (
    SELECT * FROM VALUES
      (101, 1, '2021-04-17 09:30:00', 1),
      (101, 1, '2021-04-17 09:35:00', 2),
      (102, 1, '2021-04-17 09:37:00', 3),
      (102, 1, '2021-04-17 09:38:00', 4),
      (101, 1, '2021-04-17 09:39:00', 5),
      (101, 1, '2021-04-17 09:40:00', 6),
      (102, 1, '2021-04-17 09:41:00', 7),
      (102, 1, '2021-04-17 09:42:00', 8),
      (103, 1, '2021-04-17 09:43:00', 9),
      (103, 1, '2021-04-17 09:44:00', 10)
    v(sess_id, pol_id, trans_dt, VERSION_ID)
)

Gordon 的回答会将最后两个会话分配为第 1 组,而我的将分配 2 组,Lukasz 也是如此,这取决于您的预期结果。

此外,当 pol_id 更改时,您希望 group 成为全局计数,还是第二个 pol 的值再次为 1?

答案 2 :(得分:0)

基本上,您希望使用 lag() 来查看会话 ID 何时更改。 然后你想要一个累积总和,但只在每个会话 ID 内:

select t.*,
       sum(case when prev_session_id = session_id then 0 else 1 end) over (
           partition by pol_id, session_id
           order by trans_dt
          ) as grouping
from (select t.*,
             lag(session_id) over (partition by pol_id order by trans_dt) as prev_session_id
      from t
     ) t;

这是群体和岛屿问题的一个棘手变体。更正常的情况是将三对行枚举为 1、2 和 3。为此,您只需从 session_id 中的 partition by 中删除 sum()

相关问题