使用Postgres 9.5。测试数据:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
我试图了解现有客户的行为。我想回答这个问题:
客户根据最后一个订单(当月,上个月(m-1)...到m-12)再次订购的可能性是多少?
可能性计算如下:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
因此,我需要生成一个表格,其中列出了当前订购的人员的数量,这些人员属于给定的群组。
因此,在一个队列中有什么规则?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
我使用DVD Store数据库作为样本数据来开发查询:http://linux.dell.com/dvdstore/
以下是基于7月份的群组规则和聚合的示例
"month's orders being analysed"
(请注意:"month's orders being analysed"
列是下面“所需输出”表中的第一列:
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
此数据将输出下表:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
到目前为止,我的尝试一直是:
generate_series()
和
row_number() over (partition by customer_id order by rental_id desc)
我还没有能够把所有东西都聚集在一起(我已经尝试了好几个小时而且还没有解决它)。
为了便于阅读,我认为将我的工作部分发布更好(如果有人希望我完整地发布sql查询请注释 - 我会添加它)。
系列查询:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
排名查询:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
我想做类似的事情:对系列查询返回的每一行运行orders_ranked查询,然后在每次返回orders_ranked时进行基本聚合。
类似的东西:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
答案 0 :(得分:2)
此查询可以完成所有操作。它在整个表格上运行,适用于任何时间范围。
基于一些假设并假设当前的Postgres版本9.5。应该至少使用第9.1页。由于您对“群组”的定义不明确,我跳过“群组中的人数列。
我希望它比你到目前为止尝试的任何东西都要快。按数量级。
SELECT *
FROM crosstab (
$$
SELECT mon
, sum(count(*)) OVER (PARTITION BY mon)::int AS m0
, gap -- count of months since last order
, count(*) AS gap_ct
FROM (
SELECT mon
, mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
FROM (
SELECT DISTINCT ON (1,2)
date_trunc('month', rental_date)::date AS mon
, customer_id AS c_id
, extract(YEAR FROM rental_date)::int * 12
+ extract(MONTH FROM rental_date)::int AS mon_int
FROM rental
) dist_customer
) gap_to_last_month
GROUP BY mon, gap
ORDER BY mon, gap
$$
, 'SELECT generate_series(1,12)'
) ct (mon date, m0 int
, m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
, m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);
结果:
mon | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12 ------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----- 2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | | 2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | | ...
m0
..本月有&gt; = 1个订单的客户
m01
..本月&gt; = 1个订单的客户和1个月前的&gt; = 1个订单(两者之间没有任何内容)
m02
..本月&gt; = 1个订单的客户和2个月前的&gt; = 1个订单之间没有订单
等
在子查询dist_customer
中,每月减少到一行,使用(mon, c_id)
减少customer_id DISTINCT ON
:
为了简化以后的计算,请添加日期的月数(mon_int
)。相关:
如果每个(月,客户)有多个订单,则第一步的查询技术会更快:
在子查询gap_to_last_month
中添加列gap
,指示本月与上个月之间的时间差与同一客户的任何订单。使用窗口函数lag()
。相关:
在每个SELECT
的外部(mon, gap)
聚合中,获取您所追求的计数。此外,获取此月m0
的不同客户总数。
将此查询提供给crosstab()
,以将结果转换为所需的表格形式,以显示结果。基本信息:
关于“额外”栏m0
: