使用 SQL 进行群组分析(雪花)

时间:2021-03-23 01:35:47

标签: sql group-by aggregate-functions snowflake-cloud-data-platform common-table-expression

我正在使用表 TRANSACTIONS 进行群组分析。下面是表架构,

USER_ID              NUMBER,
PAYMENT_DATE_UTC     DATE,
IS_PAYMENT_ADDED     BOOLEAN

下面是一个快速查询,用于查看 USER_ID 12345(一个示例)如何根据提供的日期过滤器查看不同的同类群组,

WITH RESULT(
SELECT
USER_ID,
TO_DATE(PAYMENT_DATE_UTC) AS PAYMENT_DATE,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY 1,2
HAVING PAYMENT_ADDED_COUNT>=1
ORDER BY 2
)
SELECT
COUNT(DISTINCT r.USER_ID),
SUM(r.PAYMENT_ADDED_COUNT)
FROM RESULT r
WHERE r.USER_ID=12345
AND (r.PAYMENT_DATE>='2021-02-01' AND r.PAYMENT_DATE<'2021-02-15')

此查询的时间范围(两周)的结果是

| 1 | 55 |

并且此 USER_ID 将被归类为提供的日期过滤器的常规用户群组(付款次数超过 10 次的群组)

如果在时间范围内运行相同的查询,例如说 '2021-02-07',结果将是

| 1 | 10 |

对于提供的日期过滤器,此 USER_ID 将被归类为临时用户群组(支付 1 到 10 次的用户)

我有以下查询,可根据添加的付款总额将 USER_ID 放入两个不同的群组中,

WITH
ALL_USER_COHORT AS 
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
),
OCASSIONAL_USER_COHORT AS 
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING (PAYMENT_ADDED_COUNT>=1 AND PAYMENT_ADDED_COUNT<=10)
),
REGULAR_USER_COHORT AS 
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING PAYMENT_ADDED_COUNT>10
)
SELECT
COUNT(DISTINCT ou.USER_ID) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.USER_ID) AS "REGULAR USERS"
FROM ALL_USER_COHORT au
LEFT JOIN OCASSIONAL_USER_COHORT ou ON au.USER_ID=ou.USER_ID
LEFT JOIN REGULAR_USER_COHORT ru ON au.USER_ID=ru.USER_ID
LEFT JOIN TRANSACTIONS t ON au.USER_ID=t.USER_ID
WHERE au.USER_ID=12345
AND TO_DATE(t.PAYMENT_DATE_UTC)>='2021-02-07'

理想情况下,USER_ID 12345 应根据提供的日期过滤器分类为“OCCASIONAL USERS”,但查询将其分类为“REGULAR USERS”。

1 个答案:

答案 0 :(得分:1)

对于初学者来说,CTE 可以像这样删除冗余:

WITH all_user_cohort AS (
    SELECT
        USER_ID,
        SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
    FROM transactions
    GROUP BY user_id
), ocassional_user_cohort AS (
    SELECT * FROM all_user_cohort
    WHERE PAYMENT_ADDED_COUNT between 1 AND 10
), regular_user_cohort AS (
    SELECT * FROM all_user_cohort
    WHERE PAYMENT_ADDED_COUNT > 10
)
SELECT
COUNT(DISTINCT ou.user_id) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.user_id) AS "REGULAR USERS"
FROM all_user_cohort AS au
LEFT JOIN ocassional_user_cohort ou ON au.user_id=ou.user_id
LEFT JOIN regular_user_cohort ru ON au.user_id=ru.user_id
LEFT JOIN transactions t ON au.user_id=t.user_id
WHERE au.user_id=12345
AND TO_DATE(t.payment_date_utc)>='2021-03-01'

但您遇到此问题的原因是您一直在做该做的事情。

您想要的是将日期过滤器移到 all_user_cohort 中,而不是在您可以对满足需要的行数求和时制作表格。

WITH all_user_cohort AS (
    SELECT
        USER_ID,
        SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
    FROM transactions
    WHERE TO_DATE(payment_date_utc)>='2021-03-01'
    GROUP BY user_id
)   
SELECT
    SUM(IFF(payment_added_count between 1 AND 10, 1,0)) AS "OCCASIONAL USERS"
    SUM(IFF(payment_added_count > 10, 1,0)) AS "REGULAR USERS"
FROM transactions 
WHERE au.user_id=12345

也可以采用不同的方式,如果这更符合您的要求,或者出于其他原因。

WITH all_user_cohort AS (
    SELECT
        USER_ID,
        SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
    FROM transactions
    WHERE TO_DATE(payment_date_utc)>='2021-03-01'
    GROUP BY user_id
), classify_users AS (
    SELECT user_id
        ,CASE 
            WHEN payment_added_count between 1 AND 10 THEN 'OCCASIONAL USERS'
            WHEN payment_added_count > 10 THEN 'REGULAR USERS'
            ELSE 'users with zero payments'
        END AS classified
    FROM all_user_cohort
)
SELECT classified
    ,count(*)
FROM classify_users
WHERE user_id=12345
GROUP BY 1