每天查询计数,包括多周的日期限制

时间:2014-11-13 00:50:00

标签: sql postgresql date aggregate-functions postgresql-9.3

我每天都在尝试查找#个活跃用户。

用户在每周超过 10次请求<连续4周时处于有效状态

即。在2014年10月31日,如果用户每周总共发出超过10个请求,则该用户处于活动状态:

  1. 2014年10月24日至10月30日 AND
  2. 2014年10月17日至10月23日 AND
  3. 2014年10月10日至10月16日 AND
  4. 2014年10月3日至10月9日
  5. 我有requests的表格:

    CREATE TABLE requests (
      id text PRIMARY KEY, -- id of the request
      amount bigint,       -- sum of requests made by accounts_id to recipient_id,
                           -- aggregated on a daily basis based on "date"
      accounts_id text,    -- id of the user
      recipient_id text,   -- id of the recipient
      date timestamp       -- date that the request was made in YYYY-MM-DD
    );
    

    示例值:

    INSERT INTO requests2
    VALUES
        ('1',  19, 'a1', 'b1', '2014-10-05 00:00:00'),
        ('2',  19, 'a2', 'b2', '2014-10-06 00:00:00'),
        ('3',  85, 'a3', 'b3', '2014-10-07 00:00:00'),
        ('4',  11, 'a1', 'b4', '2014-10-13 00:00:00'),
        ('5',  2,  'a2', 'b5', '2014-10-14 00:00:00'),
        ('6',  50, 'a3', 'b5', '2014-10-15 00:00:00'),
        ('7',  787323, 'a1', 'b6', '2014-10-17 00:00:00'),
        ('8',  33, 'a2', 'b8', '2014-10-18 00:00:00'),
        ('9',  14, 'a3', 'b9', '2014-10-19 00:00:00'),
        ('10', 11, 'a4', 'b10', '2014-10-19 00:00:00'),
        ('11', 1628, 'a1', 'b11', '2014-10-25 00:00:00'),
        ('13', 101, 'a2', 'b11', '2014-10-25 00:00:00');
    

    示例输出:

    Date       | # Active users
    -----------+---------------
    10-01-2014 | 600
    10-02-2014 | 703
    10-03-2014 | 891
    

    以下是我尝试查找特定日期的活跃用户数量(例如10-01-2014):

    SELECT count(*)
    FROM
      (SELECT accounts_id
       FROM requests
       WHERE "date" BETWEEN '2014-10-01'::date - interval '2 weeks' AND '2014-10-01'::date - interval '1 week'
       GROUP BY accounts_id HAVING sum(amount) > 10) week_1
    JOIN
      (SELECT accounts_id
       FROM requests
       WHERE "date" BETWEEN '2014-10-01'::date - interval '3 weeks' AND '2014-10-01'::date - interval '2 week'
       GROUP BY accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
    JOIN
      (SELECT accounts_id
       FROM requests
       WHERE "date" BETWEEN '2014-10-01'::date - interval '4 weeks' AND '2014-10-01'::date - interval '3 week'
       GROUP BY accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
    JOIN
      (SELECT accounts_id
       FROM requests
       WHERE "date" BETWEEN '2014-10-01'::date - interval '5 weeks' AND '2014-10-01'::date - interval '4 week'
       GROUP BY accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
    

    由于这只是获取1天数的查询,因此我需要每天获得此数字。我认为这个想法是做一个联接以获取日期,所以我尝试做这样的事情:

    SELECT week_1."Date_series",
           count(*)
    FROM
      (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
              accounts_id
       FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
       WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '2 weeks' AND requests.date::date - interval '1 week'
       GROUP BY "Date_series",
                accounts_id HAVING sum(amount) > 10) week_1
    JOIN
      (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
              accounts_id
       FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
       WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '3 weeks' AND requests.date::date - interval '2 week'
       GROUP BY "Date_series",
                accounts_id HAVING sum(amount) > 10) week_2 ON week_1.accounts_id = week_2.accounts_id
    AND week_1."Date_series" = week_2."Date_series"
    JOIN
      (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
              accounts_id
       FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
       WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '4 weeks' AND requests.date::date - interval '3 week'
       GROUP BY "Date_series",
                accounts_id HAVING sum(amount) > 10) week_3 ON week_2.accounts_id = week_3.accounts_id
    AND week_2."Date_series" = week_3."Date_series"
    JOIN
      (SELECT to_char(DAY::date, 'YYYY-MM-DD') AS "Date_series",
              accounts_id
       FROM generate_series('2014-10-01'::date, CURRENT_DATE, '1 day') DAY, requests
       WHERE to_char(DAY::date, 'YYYY-MM-DD')::date BETWEEN requests.date::date - interval '5 weeks' AND requests.date::date - interval '4 week'
       GROUP BY "Date_series",
                accounts_id HAVING sum(amount) > 10) week_4 ON week_3.accounts_id = week_4.accounts_id
    AND week_3."Date_series" = week_4."Date_series"
    GROUP BY week_1."Date_series"
    

    但是,我认为我没有得到正确答案,我不知道为什么。任何提示/指导/指针非常感谢! :):)

    PS。我使用的是Postgres 9.3

1 个答案:

答案 0 :(得分:6)

如何简化您的查询,这是一个很长的答案。 :)

在我的表上构建(在您使用不同的( odd!)数据类型提供表定义之前:

CREATE TABLE requests (
   id           int
 , accounts_id  int  -- (id of the user)
 , recipient_id int  -- (id of the recipient)
 , date         date -- (date that the request was made in YYYY-MM-DD)
 , amount       int  -- (# of requests by accounts_id for the day)
);

给定日期的活跃用户

&#34;活跃用户列表&#34; 给定的一天

SELECT accounts_id
FROM  (
   SELECT w.w, r.accounts_id
   FROM  (
      SELECT w
           , day - 6 - 7 * w AS w_start
           , day     - 7 * w AS w_end   
      FROM  (SELECT '2014-10-31'::date - 1 AS day) d  -- effective date here
           , generate_series(0,3) w
      ) w
   JOIN   requests r ON r."date" BETWEEN w_start AND w_end
   GROUP  BY w.w, r.accounts_id
   HAVING sum(r.amount) > 10
   ) sub
GROUP  BY 1
HAVING count(*) = 4;

第1步

在最里面的子查询w (对于&#34;周&#34;)从给定日期的CROSS JOIN构建感兴趣的4周的界限 - 1,输出为generate_series(0-3)

要在date(不是时间戳!)中添加/减去天数,只需添加/减去integer个数字。表达式day - 7 * w从给定日期开始7天减去0-3次,到达每周的结束日期(w_end)。
从每个中减去另外6天(不是7!)以计算相应的开始w_start)。
此外,请保留周数w(0-3)以用于以后的聚合。

第2步

子查询sub 中,将requests行添加到4周的集合中,其中日期位于开始日期和结束日期之间。 GROUP BY周数waccounts_id 只有超过10个请求的周数才符合条件。

第3步

外部SELECT 计算每个用户(accounts_id)合格的周数。必须为4才能符合&#34;活跃用户&#34;

每天活跃用户数

这是 炸药 包含在一个简单的SQL函数中以简化一般用途,但查询也可以单独使用:

CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
  RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
   SELECT accounts_id, date, sum(amount)::int AS amount
   FROM   requests
   WHERE  date BETWEEN _now - (27 + _days) AND _now - 1
   GROUP  BY accounts_id, date
   )
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM  (
   SELECT accounts_id, date
        , count(w_amount > 10 OR NULL)
                         OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
                         ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
   FROM  (
      SELECT accounts_id, date, dow   
           , sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
                         ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
      FROM  (SELECT _now - i AS date, i%7 AS dow
             FROM   generate_series(1, 27 + _days) i) d -- period of interest
      CROSS  JOIN (
             SELECT accounts_id FROM r
             GROUP  BY 1
             HAVING count(*) > 3 AND sum(amount) > 39  -- enough rows & requests
             AND    max(date) > min(date) + 15) a      -- can cover 4 weeks
      LEFT   JOIN r USING (accounts_id, date)
      ) sub1
   WHERE date > _now - (22 + _days)  -- cut off 6 trailing days now - useful?
   ) sub2
GROUP  BY date
ORDER  BY date DESC
LIMIT  _days
$func$ LANGUAGE sql STABLE;

该功能需要任何一天(_now),&#34;今天&#34;默认情况下,结果中的天数(_days),默认为3。拨打:

SELECT * FROM f_active_users('2014-10-31', 5);

或者没有参数来使用默认值:

SELECT * FROM f_active_users();

该方法与第一个查询不同

SQL Fiddle包含表定义的查询和变体。

第0步

仅在感兴趣的时段内,每r次CTE (accounts_id, date)预聚合金额,以获得更好的效果。该表只扫描一次,建议的索引(见打击)将在这里开始。

第1步

在内部子查询d中生成必要的天数列表:27 + _days行,其中_days是输出中所需的行数,有效期为28天或更长。
在此过程中,计算在步骤3中用于聚合的星期几(dow)。i%7与每周时间间隔一致,查询适用于任何时间间隔但是。

在内部子查询a中生成一个唯一的用户列表(accounts_id),它存在于CTE r中并通过一些初步的表面测试(足够的行跨越足够的时间并有足够的总请求)。

第2步

da生成一个笛卡尔积,其中CROSS JOIN为每个相关用户的每一天都有一行LEFT JOINr附加请求数量(如果有)。没有WHERE条件,我们希望结果中的每一天,即使根本没有活跃用户。

使用Window functions with a custom frame.计算同一步骤中过去一周(w_amount)的总金额示例:

第3步

现在切断过去6天;这是可选,可能有助于也可能没有帮助。测试一下:WHERE date >= _now - (21 + _days)

在类似的窗口函数中计算满足最小金额(w_ct)的周数,此时由dow分区,另外在框架中过去4周只有相同的工作日(携带相应过去一周的总和)。 表达式count(w_amount > 10 OR NULL)仅计算超过10个请求的行。详细解释:

第4步

SELECT外的date组中,计算通过所有4周(count(w_ct = 4 OR NULL))的用户。在日期中添加1,以便按照1,ORDERLIMIT补偿所需的天数。

表现和展望

两个查询的完美索引是:

CREATE INDEX foo ON requests (date, accounts_id, amount);

由于新的移动聚合支持,性能应该很好,但是即将推出的Postgres 9.4 会更好(更好):

Moving-aggregate support in the Postgres Wiki.
Moving aggregates in the 9.4 manual

除此之外:不要拨打timestamp列&#34;日期&#34;,它是timestamp,而不是date。更好的是,永远不要使用datetimestamp等基本类型名称作为标识符。如初。