在SQL中跨时间轴汇总值

时间:2014-03-07 15:04:16

标签: sql postgresql aggregate-functions date-arithmetic window-functions

问题

我有一个PostgreSQL数据库,我试图总结一下收银机的收入。收银机可以具有状态ACTIVE或INACTIVE,但我只想总结在给定时间段内处于ACTIVE状态时创建的收益。

我有两张桌子;一个标志着收入,另一个标志着收银机状态:

CREATE TABLE counters
(
  id bigserial NOT NULL,
  "timestamp" timestamp with time zone,
  total_revenue bigint,
  id_of_machine character varying(50),
  CONSTRAINT counters_pkey PRIMARY KEY (id)
)

CREATE TABLE machine_lifecycle_events
(
  id bigserial NOT NULL,
  event_type character varying(50),
  "timestamp" timestamp with time zone,
  id_of_affected_machine character varying(50),
  CONSTRAINT machine_lifecycle_events_pkey PRIMARY KEY (id)
)

每1分钟添加一个计数器条目,而total_revenue只会增加。每次机器状态发生变化时,都会添加machine_lifecycle_events条目。

我添加了一张说明问题的图片。这是蓝色时期的收入,应该加以总结。

Timeline showing problem.

到目前为止我尝试了什么

我创建了一个查询,可以在给定的瞬间为我提供总收入:

SELECT total_revenue 
  FROM counters 
 WHERE timestamp < '2014-03-05 11:00:00' 
       AND id_of_machine='1' 
ORDER BY 
       timestamp desc 
 LIMIT 1

问题

  1. 如何计算两个时间戳之间的收入?
  2. 当我必须将machine_lifecycle_events中的时间戳与输入期间进行比较时,如何确定蓝色时段的开始和结束时间戳?
  3. 有关如何解决此问题的任何想法?

    更新

    示例数据:

    INSERT INTO counters VALUES
       (1,  '2014-03-01 00:00:00', 100,  '1')
     , (2,  '2014-03-01 12:00:00', 200,  '1')
     , (3,  '2014-03-02 00:00:00', 300,  '1')
     , (4,  '2014-03-02 12:00:00', 400,  '1')
     , (5,  '2014-03-03 00:00:00', 500,  '1')
     , (6,  '2014-03-03 12:00:00', 600,  '1')
     , (7,  '2014-03-04 00:00:00', 700,  '1')
     , (8,  '2014-03-04 12:00:00', 800,  '1')
     , (9,  '2014-03-05 00:00:00', 900,  '1')
     , (10, '2014-03-05 12:00:00', 1000, '1')
     , (11, '2014-03-06 00:00:00', 1100, '1')
     , (12, '2014-03-06 12:00:00', 1200, '1')
     , (13, '2014-03-07 00:00:00', 1300, '1')
     , (14, '2014-03-07 12:00:00', 1400, '1');
    
    INSERT INTO machine_lifecycle_events VALUES
       (1, 'ACTIVE',   '2014-03-01 08:00:00', '1')
     , (2, 'INACTIVE', '2014-03-03 00:00:00', '1')
     , (3, 'ACTIVE',   '2014-03-05 00:00:00', '1')
     , (4, 'INACTIVE', '2014-03-06 12:00:00', '1');
    

    SQL Fiddle with sample data.

    示例查询:
    “2014-03-02 08:00:00”和“2014-03-06 08:00:00”之间的收入在第一个ACTIVE期间为300. 100,在第二个ACTIVE期间为200。

3 个答案:

答案 0 :(得分:2)

数据库设计

为了让我的工作更轻松,我在处理问题之前清理了数据库设计:

CREATE TEMP TABLE counter (
    id            bigserial PRIMARY KEY
  , ts            timestamp NOT NULL
  , total_revenue bigint NOT NULL
  , machine_id    int NOT NULL
);

CREATE TEMP TABLE machine_event (
    id            bigserial PRIMARY KEY
  , ts            timestamp NOT NULL
  , machine_id    int NOT NULL
  , status_active bool NOT NULL
);

Test case in the fiddle.

重点

  • 使用ts代替“timestamp”。切勿将基本类型名称用作列名。
  • 简化&amp;统一名称machine_id并将其设为integer,而不是varchar(50)
  • event_type varchar(50)也应该是integer外键,或enum。或者甚至仅boolean仅用于活动/非活动。简化为status_active bool
  • 简化和消毒的INSERT陈述。

答案

假设

  • total_revenue only increases(每个问题)。
  • 外部时间范围的边框包含
  • machine_event中每台计算机的每个“下一行”行都相反status_active
  

1。如何计算两个时间戳之间的收入?

WITH span AS (
   SELECT '2014-03-02 12:00'::timestamp AS s_from  -- start of time range
        , '2014-03-05 11:00'::timestamp AS s_to    -- end of time range
   )
SELECT machine_id, s.s_from, s.s_to
     , max(total_revenue) - min(total_revenue) AS earned
FROM   counter c
     , span s
WHERE  ts BETWEEN s_from AND s_to                  -- borders included!
AND    machine_id =  1
GROUP  BY 1,2,3;
  

2. 当我必须将machine_event中的时间戳与输入期间进行比较时,如何确定蓝色时段的开始和结束时间戳?

在给定时间范围内{em>所有计算机的此查询(span) 在CTE WHERE machine_id = 1中添加cte以选择特定计算机。

WITH span AS (
   SELECT '2014-03-02 08:00'::timestamp AS s_from  -- start of time range
        , '2014-03-06 08:00'::timestamp AS s_to    -- end of time range
   )
, cte AS (
   SELECT machine_id, ts, status_active, s_from
        , lead(ts, 1, s_to) OVER w AS period_end
        , first_value(ts)   OVER w AS first_ts
   FROM   span          s
   JOIN   machine_event e ON e.ts BETWEEN s.s_from AND s.s_to
   WINDOW w AS (PARTITION BY machine_id ORDER BY ts)
   )
SELECT machine_id, ts AS period_start, period_end -- start in time frame
FROM   cte
WHERE  status_active

UNION  ALL                             -- active start before time frame
SELECT machine_id, s_from, ts
FROM   cte
WHERE  NOT status_active
AND    ts =  first_ts
AND    ts <> s_from

UNION  ALL       -- active start before time frame, no end in time frame
SELECT machine_id, s_from, s_to
FROM  (
   SELECT DISTINCT ON (1)
          e.machine_id, e.status_active, s.s_from, s.s_to
   FROM   span          s
   JOIN   machine_event e ON e.ts < s.s_from  -- only from before time range
   LEFT   JOIN cte c USING (machine_id)
   WHERE  c.machine_id IS NULL                -- not in selected time range
   ORDER  BY e.machine_id, e.ts DESC          -- only the latest entry
   ) sub
WHERE  status_active -- only if active
ORDER  BY 1, 2;

结果是图像中的蓝色时段列表 SQL Fiddle demonstrating both.

最近的类似问题:
Sum of time difference between rows

答案 1 :(得分:0)

好的,我有一个答案,但我不得不假设machine_lifecycle_events的id可以用来确定访问者和前身。因此,为了使我的解决方案更好地工作,您应该在活动和非活动事件之间建立链接。可能还有其他方法可以解决它,但这会增加更多的复杂性。

首先,要获得每台计算机所有活动期间的收入,您可以执行以下操作:

select c.id_of_machine, cycle_id, cycle_start, cycle_end, sum(total_revenue)
from counters c join (
    select e1.id as cycle_id, 
           e1.timestamp as cycle_start, 
           e2.timestamp as cycle_end,
           e1.id_of_affected_machine as cycle_machine_id
    from machine_lifecycle_events e1 join machine_lifecycle_events e2 
        on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
           e1.id_of_affected_machine = e2.id_of_affected_machine
    where e1.event_type = 'ACTIVE'
        ) cycle
    on c.id_of_machine = cycle_machine_id and 
       cycle_start <= c.timestamp and c.timestamp <= cycle_end
group by c.id_of_machine, cycle_id, cycle_start, cycle_end
order by c.id_of_machine, cycle_id

您可以进一步使用此查询,并在条件下添加更多条件,以便仅在一个时间范围内或特定机器获得收入:

select sum(total_revenue)
from counters c join (
    select e1.id as cycle_id, 
           e1.timestamp as cycle_start, 
           e2.timestamp as cycle_end,
           e1.id_of_affected_machine as cycle_machine_id
    from machine_lifecycle_events e1 join machine_lifecycle_events e2 
        on e1.id + 1 = e2.id and -- this should be replaced with a specific column to find cycles which belong together
           e1.id_of_affected_machine = e2.id_of_affected_machine
    where e1.event_type = 'ACTIVE'
        ) cycle
    on c.id_of_machine = cycle_machine_id and 
   cycle_start <= c.timestamp and c.timestamp <= cycle_end
where '2014-03-02 08:00:00' <= c.timestamp and c.timestamp <= '2014-03-06 08:00:00'
    and c.id_of_machine = '1'

正如开头和评论中所提到的,我找到连接事件的方式不适用于具有多台机器的任何更复杂的示例。最简单的方法是使另一列始终指向前一个事件。另一种方法是拥有一个可以找到这些事件的函数,但这个解决方案无法使用索引。

答案 2 :(得分:0)

使用自连接和构建间隔表以及每个间隔的实际状态。

with intervals as (
    select e1.timestamp time1, e2.timestamp time2, e1.EVENT_TYPE as status
    from machine_lifecycle_events e1
    left join machine_lifecycle_events e2 on e2.id = e1.id + 1
) select * from counters c
join intervals i on (timestamp between i.time1 and i.time2 or i.time2 is null) 
    and i.status = 'ACTIVE';

我没有使用聚合来显示结果集,我认为你可以做到这一点。我也错过了machineId来简化这种模式的演示。