我正在处理俱乐部会员数据。
这里的目的是,在两个日期之间的一段时间内,在 MySQL 中计算以下内容:
例如,假设我们有以下成员资格数据:
msid id start cancelled
1 1 2020-01-01 09:00:00 null
2 2 2020-01-01 09:00:00 2020-12-31 09:00:00
3 2 2021-01-01 09:00:00 null
4 3 2020-01-01 09:00:00 2020-06-30 09:00:00
5 3 2020-02-01 09:00:00 2020-06-30 09:00:00
6 3 2020-07-01 09:00:00 null
其中 msid
是该表的主键,id
是成员 ID。
对于以下选择的时间段,我们应该返回以下内容:
period_start period_end new cancelled rejoined
2020-01-01 00:00:00 2020-01-01 23:59:59 3 0 0
2020-01-01 00:00:00 2020-06-30 23:59:59 3 1 0
2020-01-01 00:00:00 2020-07-01 23:59:59 3 1 1
2020-01-01 00:00:00 2020-12-31 23:59:59 3 2 1
2020-01-01 00:00:00 2021-01-01 23:59:59 3 2 2
2020-07-01 00:00:00 2021-01-01 23:59:59 0 1 2
2021-01-01 00:00:00 2021-01-01 23:59:59 0 0 1
一个会员可以拥有多个当前会员资格,如 ID 3 的情况,但在取消时只能计算一次。
Here is a db<>fiddle 具有表 dt
中的成员资格数据和表 periods
中的时间段
答案 0 :(得分:2)
由于我仍在学习自己,我不确定这是最小的解决方案,但它确实可以。
据我所知,您将成员定义为新,当他们开始成为第一个成员时(而不是在开始第二个新成员时,或成员 { {1}} 会被视为 new 两次)。这意味着为每个成员选择最短开始日期将为您提供相应的日期。
3
取消日期非常简单,它们基本上是相应列中的非 SELECT DISTINCT id, MIN(start) OVER (PARTITION BY id) AS start FROM dt
条目。
NULL
根据您的示例,我认为重新加入的日期应该是上次取消之后的开始日期。因此,我们可以重新使用上面的第二个查询来获取它们。我选择 SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL
以确保每个日期只为每个用户列出一次(例如,如果某个成员多次取消并重新加入)
DISTINCT
总而言之,我首先使用嵌套的 WITH
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL)
SELECT DISTINCT dt.id, dt.start as rejoined
FROM dt
INNER JOIN tab_cancelled tc
ON tc.id = dt.id
AND dt.start > tc.cancelled
检查各个日期是否在每个时期内,并使用 LEFT JOINS
获取每个时期的金额:
COUNT() OVER PARTITION
或者,也可以使用 WITH
tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
period_end,
new,
cancelled,
COUNT(dt3.id) over (partition by period_start, period_end) as rejoined
FROM (SELECT DISTINCT period_start,
period_end,
new,
COUNT(dt2.id) over (partition by period_start, period_end) as cancelled
FROM (SELECT DISTINCT period_start,
period_end,
COUNT(dt1.id) over (partition by period_start, period_end) as new
FROM periods
LEFT JOIN tab_start as dt1
ON dt1.start between period_start and period_end) u
LEFT JOIN tab_cancelled as dt2
ON dt2.cancelled between period_start and period_end) v
LEFT JOIN tab_rejoined as dt3
ON dt3.rejoined between period_start and period_end
和 COUNT(DISTINCT x)
在没有窗口函数的情况下实现相同的效果:
GROUP BY
无论如何,这应该会给你想要的结果:
period_start | period_end | 新 | 取消 | 重新加入 |
---|---|---|---|---|
2020-01-01 00:00:00 | 2020-01-01 23:59:59 | 3 | 0 | 0 |
2020-01-01 00:00:00 | 2020-06-30 23:59:59 | 3 | 1 | 0 |
2020-01-01 00:00:00 | 2020-07-01 23:59:59 | 3 | 1 | 1 |
2020-01-01 00:00:00 | 2020-12-31 23:59:59 | 3 | 2 | 1 |
2020-01-01 00:00:00 | 2021-01-01 23:59:59 | 3 | 2 | 2 |
2020-07-01 00:00:00 | 2021-01-01 23:59:59 | 0 | 1 | 2 |
2021-01-01 00:00:00 | 2021-01-01 23:59:59 | 0 | 0 | 1 |
请注意,如果取消应在不同日期发生,则这两种情况都会(按照设计)为具有多个会员资格的单个用户分别计算多次取消。如果同一用户在一段时间内多次退出和加入,情况也是如此。如果您不希望这两种情况都发生,您可以在每个时间段内 WITH
tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt),
tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL),
tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled)
SELECT DISTINCT period_start,
period_end,
COUNT(DISTINCT dt1.id) as new,
COUNT(DISTINCT dt2.id, dt2.cancelled) as cancelled,
COUNT(DISTINCT dt3.id, dt3.rejoined) as rejoined
FROM periods
LEFT JOIN tab_start as dt1
ON dt1.start between period_start and period_end
LEFT JOIN tab_cancelled as dt2
ON dt2.cancelled between period_start and period_end
LEFT JOIN tab_rejoined as dt3
ON dt3.rejoined between period_start and period_end
GROUP BY period_start, period_end
,即计算不同的 ids 而不是 datetime-id-combinations。
COUNT(DISTINCT id)
对于您的样本数据,结果是相同的,因此您必须查看您的边缘情况并决定如何计算它们。
正如我所写的,也许这可以以某种方式压缩或提高性能,但这就是我能做的。
您可以找到相应的db<>fiddle here。