基于滚动队列的滚动计数

时间:2016-07-16 15:35:41

标签: sql postgresql crosstab window-functions generate-series

使用Postgres 9.5。测试数据:

create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);

insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');

我试图了解现有客户的行为。我想回答这个问题:

客户根据最后一个订单(当月,上个月(m-1)...到m-12)再次订购的可能性是多少?

可能性计算如下:

distinct count of people who ordered in current month /
distinct count of people in their cohort.

因此,我需要生成一个表格,其中列出了当前订购的人员的数量,这些人员属于给定的群组。

因此,在一个队列中有什么规则?

- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc

我使用DVD Store数据库作为样本数据来开发查询:http://linux.dell.com/dvdstore/

以下是基于7月份的群组规则和聚合的示例 "month's orders being analysed"(请注意:"month's orders being analysed"列是下面“所需输出”表中的第一列:

customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul

此数据将输出下表:

--where m = month's orders being analysed

month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |

到目前为止,我的尝试一直是:

generate_series()

row_number() over (partition by customer_id order by rental_id desc)

我还没有能够把所有东西都聚集在一起(我已经尝试了好几个小时而且还没有解决它)。

为了便于阅读,我认为将我的工作部分发布更好(如果有人希望我完整地发布sql查询请注释 - 我会添加它)。

系列查询:

(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series

排名查询:

(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked

我想做类似的事情:对系列查询返回的每一行运行orders_ranked查询,然后在每次返回orders_ranked时进行基本聚合。

类似的东西:

(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,


(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series

1 个答案:

答案 0 :(得分:2)

此查询可以完成所有操作。它在整个表格上运行,适用于任何时间范围。

基于一些假设并假设当前的Postgres版本9.5。应该至少使用第9.1页。由于您对“群组”的定义不明确,我跳过“群组中的人数列。

我希望它比你到目前为止尝试的任何东西都要快。按数量级。

SELECT *
FROM   crosstab (
   $$
   SELECT mon
        , sum(count(*)) OVER (PARTITION BY mon)::int AS m0
        , gap   -- count of months since last order
        , count(*) AS gap_ct
   FROM  (
      SELECT mon
           , mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
      FROM  (
         SELECT DISTINCT ON (1,2)
                date_trunc('month', rental_date)::date AS mon
              , customer_id                            AS c_id
              , extract(YEAR  FROM rental_date)::int * 12
              + extract(MONTH FROM rental_date)::int   AS mon_int
         FROM   rental
         ) dist_customer
      ) gap_to_last_month
   GROUP  BY mon, gap
   ORDER  BY mon, gap
   $$
 , 'SELECT generate_series(1,12)'
   ) ct (mon date, m0 int
       , m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
       , m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);

结果:

    mon     | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
 2015-01-01 | 63 |  36 |  15 |   5 |   3 |   3 |     |     |     |     |     |     |
 2015-02-01 | 56 |  35 |   9 |   9 |   2 |     |   1 |     |     |     |     |     |
...

m0 ..本月有&gt; = 1个订单的客户
m01 ..本月&gt; = 1个订单的客户和1个月前的&gt; = 1个订单(两者之间没有任何内容)
m02 ..本月&gt; = 1个订单的客户和2个月前的&gt; = 1个订单之间没有订单 等

如何?

  1. 在子查询dist_customer中,每月减少到一行,使用(mon, c_id)减少customer_id DISTINCT ON

    为了简化以后的计算,请添加日期的月数(mon_int)。相关:

    如果每个(月,客户)有多个订单,则第一步的查询技术会更快:

  2. 在子查询gap_to_last_month中添加列gap,指示本月与上个月之间的时间差与同一客户的任何订单。使用窗口函数lag()。相关:

  3. 在每个SELECT的外部(mon, gap)聚合中,获取您所追求的计数。此外,获取m0的不同客户总数。

  4. 将此查询提供给crosstab(),以将结果转换为所需的表格形式,以显示结果。基本信息:

    关于“额外”栏m0