计算一段时间内的俱乐部会员数据(新的、过去的和当前的)

时间:2021-05-14 20:22:31

标签: mysql sql

我正在处理俱乐部会员数据。

这里的目的是,在两个日期之间的一段时间内,在 MySQL 中计算以下内容:

  • 此期间的新成员数量。
  • 在此期间取消的会员数量
  • 在此期间重新加入的成员数量

例如,假设我们有以下成员资格数据:

msid  id  start                 cancelled
1     1   2020-01-01 09:00:00   null
2     2   2020-01-01 09:00:00   2020-12-31 09:00:00
3     2   2021-01-01 09:00:00   null
4     3   2020-01-01 09:00:00   2020-06-30 09:00:00
5     3   2020-02-01 09:00:00   2020-06-30 09:00:00
6     3   2020-07-01 09:00:00   null

其中 msid 是该表的主键,id 是成员 ID。

对于以下选择的时间段,我们应该返回以下内容:

period_start         period_end           new  cancelled  rejoined
2020-01-01 00:00:00  2020-01-01 23:59:59  3    0          0
2020-01-01 00:00:00  2020-06-30 23:59:59  3    1          0
2020-01-01 00:00:00  2020-07-01 23:59:59  3    1          1
2020-01-01 00:00:00  2020-12-31 23:59:59  3    2          1
2020-01-01 00:00:00  2021-01-01 23:59:59  3    2          2
2020-07-01 00:00:00  2021-01-01 23:59:59  0    1          2
2021-01-01 00:00:00  2021-01-01 23:59:59  0    0          1

一个会员可以拥有多个当前会员资格,如 ID 3 的情况,但在取消时只能计算一次。

Here is a db<>fiddle 具有表 dt 中的成员资格数据和表 periods 中的时间段

1 个答案:

答案 0 :(得分:2)

由于我仍在学习自己,我不确定这是最小的解决方案,但它确实可以。

  1. 据我所知,您将成员定义为,当他们开始成为第一个成员时(而不是在开始第二个新成员时,或成员 { {1}} 会被视为 new 两次)。这意味着为每个成员选择最短开始日期将为您提供相应的日期。

    3
  2. 取消日期非常简单,它们基本上是相应列中的非 SELECT DISTINCT id, MIN(start) OVER (PARTITION BY id) AS start FROM dt 条目。

    NULL
  3. 根据您的示例,我认为重新加入的日期应该是上次取消之后的开始日期。因此,我们可以重新使用上面的第二个查询来获取它们。我选择 SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL 以确保每个日期只为每个用户列出一次(例如,如果某个成员多次取消并重新加入)

    DISTINCT

总而言之,我首先使用嵌套的 WITH tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL) SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled 检查各个日期是否在每个时期内,并使用 LEFT JOINS 获取每个时期的金额:

COUNT() OVER PARTITION

或者,也可以使用 WITH tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt), tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL), tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled) SELECT DISTINCT period_start, period_end, new, cancelled, COUNT(dt3.id) over (partition by period_start, period_end) as rejoined FROM (SELECT DISTINCT period_start, period_end, new, COUNT(dt2.id) over (partition by period_start, period_end) as cancelled FROM (SELECT DISTINCT period_start, period_end, COUNT(dt1.id) over (partition by period_start, period_end) as new FROM periods LEFT JOIN tab_start as dt1 ON dt1.start between period_start and period_end) u LEFT JOIN tab_cancelled as dt2 ON dt2.cancelled between period_start and period_end) v LEFT JOIN tab_rejoined as dt3 ON dt3.rejoined between period_start and period_end COUNT(DISTINCT x) 在没有窗口函数的情况下实现相同的效果:

GROUP BY

无论如何,这应该会给你想要的结果:

<头>
period_start period_end 取消 重新加入
2020-01-01 00:00:00 2020-01-01 23:59:59 3 0 0
2020-01-01 00:00:00 2020-06-30 23:59:59 3 1 0
2020-01-01 00:00:00 2020-07-01 23:59:59 3 1 1
2020-01-01 00:00:00 2020-12-31 23:59:59 3 2 1
2020-01-01 00:00:00 2021-01-01 23:59:59 3 2 2
2020-07-01 00:00:00 2021-01-01 23:59:59 0 1 2
2021-01-01 00:00:00 2021-01-01 23:59:59 0 0 1

请注意,如果取消应在不同日期发生,则这两种情况都会(按照设计)为具有多个会员资格的单个用户分别计算多次取消。如果同一用户在一段时间内多次退出和加入,情况也是如此。如果您不希望这两种情况都发生,您可以在每个时间段内 WITH tab_start as (SELECT DISTINCT id, min(start) over (partition by id) as start FROM dt), tab_cancelled as (SELECT DISTINCT id, cancelled FROM dt WHERE cancelled IS NOT NULL), tab_rejoined as (SELECT DISTINCT dt.id, dt.start as rejoined FROM dt INNER JOIN tab_cancelled tc ON tc.id = dt.id AND dt.start > tc.cancelled) SELECT DISTINCT period_start, period_end, COUNT(DISTINCT dt1.id) as new, COUNT(DISTINCT dt2.id, dt2.cancelled) as cancelled, COUNT(DISTINCT dt3.id, dt3.rejoined) as rejoined FROM periods LEFT JOIN tab_start as dt1 ON dt1.start between period_start and period_end LEFT JOIN tab_cancelled as dt2 ON dt2.cancelled between period_start and period_end LEFT JOIN tab_rejoined as dt3 ON dt3.rejoined between period_start and period_end GROUP BY period_start, period_end ,即计算不同的 ids 而不是 datetime-id-combinations。

COUNT(DISTINCT id)

对于您的样本数据,结果是相同的,因此您必须查看您的边缘情况并决定如何计算它们。

正如我所写的,也许这可以以某种方式压缩或提高性能,但这就是我能做的。

您可以找到相应的db<>fiddle here