按时间范围选择不同的用户组

时间:2013-04-17 03:17:10

标签: sql postgresql date correlated-subquery window-functions

我有一张包含以下信息的表

 |date | user_id | week_beg | month_beg|

使用测试值创建表的SQL:

CREATE TABLE uniques
(
  date DATE,
  user_id INT,
  week_beg DATE,
  month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01') 

INPUT TABLE:

 | date       | user_id     | week_beg   | month_beg  |    
 | 2013-01-01 | 1           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-03 | 3           | 2012-12-30 | 2013-01-01 |    
 | 2013-01-06 | 4           | 2013-01-06 | 2013-01-01 |    
 | 2013-01-07 | 4           | 2013-01-06 | 2013-01-01 |  

输出表:

 | date       | time_series | cnt        |                 
 | 2013-01-01 | D           | 1          |                 
 | 2013-01-01 | W           | 1          |                 
 | 2013-01-01 | M           | 1          |                 
 | 2013-01-03 | D           | 1          |                 
 | 2013-01-03 | W           | 2          |                 
 | 2013-01-03 | M           | 2          |                 
 | 2013-01-06 | D           | 1          |                 
 | 2013-01-06 | W           | 1          |                 
 | 2013-01-06 | M           | 3          |                 
 | 2013-01-07 | D           | 1          |                 
 | 2013-01-07 | W           | 1          |                 
 | 2013-01-07 | M           | 3          |

我想计算一个日期的不同user_id的数量:

  1. 该日期

  2. 截至该日期的那一周(周至今)

  3. 截至该日期(月初至今)的月份

  4. 1很容易计算。 对于2和3我试图使用这样的查询:

    SELECT
      date,
      'W' AS "time_series",
      (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
      FROM user_subtitles
    
    SELECT
      date,
      'M' AS "time_series",
      (COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
      FROM user_subtitles
    

    Postgres不允许DISTINCT计算的窗口函数,因此这种方法不起作用。

    我也尝试过GROUP BY方法,但它不起作用,因为它给了我整周/月的数字。

    解决此问题的最佳方式是什么?

4 个答案:

答案 0 :(得分:3)

计算所有

SELECT date, '1_D' AS time_series,  count(DISTINCT user_id) AS cnt
FROM   uniques
GROUP  BY 1

UNION  ALL
SELECT DISTINCT ON (1)
       date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM   uniques

UNION  ALL
SELECT DISTINCT ON (1)
       date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM   uniques
ORDER  BY 1, time_series
  • 您的列week_begmonth_beg是100%冗余的,可以轻松替换为 <{1}}和date_trunc('week', date + 1) - 1

  • 您的一周似乎从星期日开始(一个开始),因此date_trunc('month', date)

  • + 1 .. - 1子句中使用ORDER BY的{​​{3}}使用OVER。这正是你所需要的。

  • 使用RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,而不是UNION ALL

  • UNION(D,W,M)的不幸选择并不顺利,我重命名为最终time_series

  • 此查询可以处理每天多行。计数包括一天的所有同伴。

  • 有关ORDER BY的更多信息:

每天

DISTINCT用户

要每天只为每位用户计算一次,请DISTINCT ON使用DISTINCT ON

WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   x
GROUP  BY 1

UNION ALL
SELECT DISTINCT ON (1)
       date, '2_W'
      ,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
                      ORDER BY date)
FROM   x

UNION ALL
SELECT DISTINCT ON (1)
       date, '3_M'
      ,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM   x
ORDER BY 1, 2

DISTINCT用户超过动态时间段

您始终可以使用相关子查询。大桌子往往会变慢!
以前的查询为基础:

WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
    ,d  AS (
    SELECT date
          ,(date_trunc('week', date + 1)::date - 1) AS week_beg
          ,date_trunc('month', date)::date AS month_beg
    FROM   uniques
    GROUP  BY 1
    )
SELECT date, '1_D' AS time_series,  count(user_id) AS cnt
FROM   du
GROUP  BY 1

UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.week_beg AND d.date )
FROM   d
GROUP  BY date, week_beg

UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
                     WHERE  du.date BETWEEN d.month_beg AND d.date)
FROM   d
GROUP  BY date, month_beg
ORDER  BY 1,2;

Select first row in each GROUP BY group?所有三种解决方案。

dense_rank()

更快

CTE提出了一项重大改进:使用SQL Fiddle。这是优化版本的另一个想法。立即排除每日重复项应该更快。性能增益随着每天的行数而增加。

基于简化和消毒的数据模型 - 没有冗余列 - day作为列名而不是date

date是PostgreSQL中的@Clodoaldo和基本类型名称,不应该用作标识符。

CREATE TABLE uniques(
   day date     -- instead of "date"
  ,user_id int
);

改进了查询:

WITH du AS (
   SELECT DISTINCT ON (1, 2)
          day, user_id 
         ,date_trunc('week',  day + 1)::date - 1 AS week_beg
         ,date_trunc('month', day)::date         AS month_beg
   FROM   uniques
   )
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM  (
    SELECT user_id, day
          ,dense_rank() OVER(PARTITION BY week_beg  ORDER BY user_id) AS w
          ,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
    FROM   du
    ) s
GROUP  BY day
ORDER  BY day;

window function dense_rank()展示了4种更快变体的表现。这取决于您最快的数据分布 所有这些都是相关子查询版本的10倍(相关子查询不好)。

答案 1 :(得分:2)

没有相关的子查询。 SQL Fiddle

with u as (
    select
        "date", user_id,
        date_trunc('week', "date" + 1)::date - 1 week_beg,
        date_trunc('month', "date")::date month_beg
    from uniques
)
select
    "date", count(distinct user_id) D,
    max(week_dr) W, max(month_dr) M
from (
    select
        user_id, "date",
        dense_rank() over(partition by week_beg order by user_id) week_dr,
        dense_rank() over(partition by month_beg order by user_id) month_dr
    from u
    ) s
group by "date"
order by "date"

答案 2 :(得分:0)

尝试

SELECT
  * 
FROM 
(
  SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
  UNION
  SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
  UNION
  SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis

SQLFIDDLE

答案 3 :(得分:-1)

尝试这样的查询

SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period