Question

我有一张包含以下数据的表格（paypal交易）：

    txn_type    |            date            |   subscription_id
----------------+----------------------------+---------------------
 subscr_signup  | 2014-01-01 07:53:20        | S-XXX01
 subscr_signup  | 2014-01-05 10:37:26        | S-XXX02
 subscr_signup  | 2014-01-08 08:54:00        | S-XXX03
 subscr_eot     | 2014-03-01 08:53:57        | S-XXX01
 subscr_eot     | 2014-03-05 08:58:02        | S-XXX02

我希望在给定时间段内获得整体的平均订阅长度（subscr_eot是订阅的结束）。如果订阅仍在进行中（'S-XXX03'），我希望将其包含在其开始日期之前，直到现在。我将如何使用Postgres中的SQL语句执行此操作？

Answer 1

SQL Fiddle。每个订阅的订阅长度：

select
    subscription_id,
    coalesce(t2.date, current_timestamp) - t1.date as subscription_length
from
    (
        select *
        from t
        where txn_type = 'subscr_signup'
    ) t1
    left join
    (
        select *
        from t
        where txn_type = 'subscr_eot'
    ) t2 using (subscription_id)
order by t1.subscription_id

平均值：

select
    avg(coalesce(t2.date, current_timestamp) - t1.date) as subscription_length_avg
from
    (
        select *
        from t
        where txn_type = 'subscr_signup'
    ) t1
    left join
    (
        select *
        from t
        where txn_type = 'subscr_eot'
    ) t2 using (subscription_id)

Answer 2

我使用了几个常用的表表达式;你可以很容易地分开它们，看看它们做了什么。

此SQL复杂的原因之一是因为您将列名存储为数据。（subscr_signup和subscr_eot实际上是列名，而不是数据。）这是一个SQL反模式;期待它会给你带来很多痛苦。

with subscription_dates as (
  select 
      p1.subscription_id, 
      p1.date as subscr_start,
      coalesce((select min(p2.date) 
                from paypal_transactions p2
                where p2.subscription_id = p1.subscription_id
                  and p2.txn_type = 'subscr_eot'
                  and p2.date > p1.date), current_date) as subscr_end
  from paypal_transactions p1
  where txn_type = 'subscr_signup'
), subscription_days as (
  select subscription_id, subscr_start, subscr_end, (subscr_end - subscr_start) + 1 as subscr_days
  from subscription_dates 
)
select avg(subscr_days) as avg_days
from subscription_days
-- add your date range here.

avg_days
--
75.6666666666666667

我没有将你的日期范围添加为WHERE子句，因为我不清楚你的意思是什么时候＆＃34;给定的时间段＆＃34;。

Answer 3

使用window function lag()，这会变得相当短：

SELECT avg(ts_end - ts) AS avg_subscr
FROM  (
   SELECT txn_type, ts, lag(ts, 1, localtimestamp)
                OVER (PARTITION BY subscription_id ORDER BY txn_type) AS ts_end
   FROM  t
   ) sub
WHERE txn_type = 'subscr_signup';

SQL Fiddle.

lag()可以方便地为缺失的行设置默认值。完全我们需要的内容，因此我们不需要COALESCE。

该查询基于subscr_eot在subscr_signup之前排序的事实。

到目前为止，可能比提供的替代方案更快，因为它只需要一次顺序扫描 - 即使窗口函数增加了一些成本。

使用列ts代替date有三个原因：

您的“日期”实际上是timestamp。
“date”是标准SQL中的保留字（即使在Postgres中允许使用）。
绝不要将基本类型名称用作标识符。

使用localtimestamp代替now()或current_timestamp，因为您显然使用timestamp [without time zone]。

此外，您的列txn_type和subscription_id不应为text 可能是txn_type的{{3}}和integer的{{1}}。这将使表和索引更小更快。

对于手头的查询，必须读取整个表格，索引将无济于事 - 除了Postgres 9.2+中的subscription_id，如果您需要读取性能：

covering index

获取表中行对之间的平均间隔

3 个答案: