Question

我有一个非常奇怪的问题。我有一个4400万的记录表如下：

SKU | Timestamp           | Status
A   | 21-09-2016 12:30:00 | 1  
B   | 21-09-2016 12:30:00 | 1  
C   | 21-09-2016 12:30:00 | 1  
D   | 21-09-2016 12:30:00 | 1  
A   | 21-09-2016 12:39:00 | 0  
B   | 21-09-2016 12:40:00 | 0  
C   | 21-09-2016 12:40:00 | 0  
D   | 21-09-2016 12:45:00 | 0  
A   | 21-09-2016 12:52:00 | 1  
A   | 21-09-2016 12:56:00 | 1  
A   | 21-09-2016 12:58:00 | 1  
B   | 21-09-2016 12:59:00 | 1  
A   | 21-09-2016 21:30:00 | 0

现在要求我们应该只考虑状态发生变化的记录。例如，在上表中，SKU A在21-09-2016 12:30:00以状态1开始。我们现在查看未来的记录，看看记录何时发生变化，以便当状态变为0时，在21-09-2016 21:30:00看到下一个变化。现在我们需要一个具有以下输出的表

SKU | Timestamp           | Status
A   | 21-09-2016 12:30:00 | 1  
A   | 21-09-2016 12:39:00 | 0  
A   | 21-09-2016 12:52:00 | 1  
A   | 21-09-2016 21:30:00 | 0  
B   | 21-09-2016 12:30:00 | 1  
B   | 21-09-2016 12:40:00 | 0  
B   | 21-09-2016 12:59:00 | 1  
C   | 21-09-2016 12:30:00 | 1  
C   | 21-09-2016 12:40:00 | 0  
D   | 21-09-2016 12:30:00 | 1  
D   | 21-09-2016 12:45:00 | 0

Answer 1

select sku, timestamp, status
from (
    select *, lag(status) over (partition by sku order by timestamp) as prev_status
    from example
    ) s
where prev_status is distinct from status;

Test it here

Answer 2

我想你想要SUPER_ADMIN：

lag()

注意：select t.* from (select t.*, lag(status) over (partition by sku order by timestamp) as prev_status from t ) t where (prev_status is distinct from status) ;与is distinct from非常相似，但它更直观地处理<>值。

Answer 3

此外还有klin和Gordon的回答并回答

我们应该花多少时间考虑这个4400万的记录表

这很大程度上取决于PostgreSQL可用的RAM。因为子查询的结果应存储在某处（然后再次扫描）。

如果RAM金额足以存储中间结果 - 那么一切正常，如果没有 - 你就麻烦了。

例如，在我对10,000,000行的表的测试中，我在等待超过15分钟后取消了普通查询。

或者，使用存储功能，它在大约4分钟内完成，这不是简单的有序选择（大约2分钟）。

这是我的测试：

-- Create data

--drop function if exists foo();
--drop table if exists test;
create table test (i bigserial primary key, sku char(1), ts timestamp, status smallint);

insert into test (sku, ts, status) 
  select
    chr(ascii('A') + (random()*3)::int),
    now()::date + ((random()*100)::int || ' minutes')::interval,
    (random()::int)
  from generate_series(1,10000000);

create index idx on test(sku, ts);

analyse test;

-- And function

create or replace function foo() returns setof test language plpgsql as $$
declare 
  r test;
  p test;
begin
  for r in select * from test order by sku, ts loop
    if p.status is distinct from r.status or p.sku is distinct from r.sku then
      return next r;
    end if;
    p := r;
  end loop;
  return;
end $$;

-- Test queries

explain (analyse, verbose) 
select i, sku, ts, status
from (
    select *, lag(status) over (partition by sku order by ts) as prev_status
    from test
    ) s
where prev_status is distinct from status;
-- Not completed, still working after ~ 15 min

explain analyse select * from test order by sku, ts;
-- Complete in ~2 min

explain (analyse, verbose) select * from foo();
-- Complete in ~3:30 min

仅过滤postgres表中的值更改

3 个答案: