检测并删除时间序列中的差距

时间:2014-02-24 09:36:29

标签: sql postgresql gaps-and-islands

我的数据集中有不同公司的每日时间序列,并使用PostgreSQL。我的目标是排除时间序列太不完整的公司。因此,我想排除所有具有3个或更多连续缺失值的公司。此外,我想排除数据集中第一个和最后一个日期之间缺失值超过50%的所有公司。

我们可以使用以下示例数据:

date             company    value
2012-01-01       A          5
2012-01-01       B          2
2012-01-02       A          NULL
2012-01-02       B          2
2012-01-02       C          4
2012-01-03       A          NULL
2012-01-03       B          NULL
2012-01-03       C          NULL
2012-01-04       A          NULL
2012-01-04       B          NULL
2012-01-04       C          NULL
2012-01-05       A          8
2012-01-05       B          9
2012-01-05       C          3
2012-01-06       A          8
2012-01-06       B          9
2012-01-06       C          NULL

所以A必须被排除,因为它有三个连续缺失值的间隙,而C是因为它在第一个和最后一个日期之间有超过50%的缺失值。

在本论坛中结合其他答案,我编写了以下代码:

  1. 添加自动增量主键以标识每一行

    CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date; 
    CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE; 
    ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq'); 
    UPDATE test SET id = NEXTVAL('id_seq');
    
    ALTER TABLE test ADD PRIMARY KEY (id);
    
  2. 检测时间序列中的间隙

    CREATE TABLE to_del AS WITH count3 AS 
    ( SELECT *, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY id 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) 
      AS cnt FROM test) 
    SELECT company, id FROM count3 WHERE cnt >= 3;
    
  3. 从mytable中删除差距

    DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
    
  4. 似乎可以实现检测和删除时间序列中3个或更多连续缺失值的间隙。但这种方法非常麻烦。我无法弄清楚如何添加所有缺失值超过50%的公司。

    你能想到一个比我更有效的解决方案(我只是学习使用PostgreSQL),还能设法排除缺失值超过50%的公司吗?

2 个答案:

答案 0 :(得分:2)

我只创建一个查询:

DELETE FROM mytable 
WHERE company in (
  SELECT Company 
  FROM (
    SELECT Company, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY id 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company)
      / 
      COUNT(*) 
         OVER (PARTITION BY company) As p50
  ) alias
  WHERE cnt >= 3 OR p50 > 0.5
)

(公司+值)列上的复合索引有助于获得此查询的最大速度。


修改


以上查询不起作用 我稍微纠正了一下,这是一个演示:http://sqlfiddle.com/#!15/c9bfe/7
有两件事发生了变化:
- 按公司 ORDER BY日期而不是 ORDER BY id 进行分区 - 显式转换为数字(因为整数已被截断为0):
OVER(公司分区) :: numeric

  SELECT company, cnt, p50
  FROM (
    SELECT company, 
      COUNT(CASE WHEN value IS NULL THEN 1 END) 
         OVER (PARTITION BY company ORDER BY date 
               ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
      SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) 
         OVER (PARTITION BY company)::numeric
      / 
      COUNT(*) 
         OVER (PARTITION BY company) As p50
    FROM mytable
  ) alias
--  WHERE cnt >= 3 OR p50 > 0.5 

现在删除查询应该有效:

DELETE FROM mytable 
WHERE company in (
      SELECT company
      FROM (
        SELECT company, 
          COUNT(CASE WHEN value IS NULL THEN 1 END) 
             OVER (PARTITION BY company ORDER BY date 
                   ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
          SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) 
             OVER (PARTITION BY company)::numeric
          / 
          COUNT(*) 
             OVER (PARTITION BY company) As p50
        FROM mytable
      ) alias
    WHERE cnt >= 3 OR p50 > 0.5
)

答案 1 :(得分:1)

对于50%的标准,您可以选择所有公司,其中不同日期的数量低于最小和最大日期之间天数的一半。

我没有测试过这个,但它应该给你一个想法。我使用CTE使其更容易阅读。

WITH MinMax AS 
(
    SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
     GROUP By Company
)
SELECT Company FROM MinMax
 WHERE (calendars_days / 2) > days