我的数据集中有不同公司的每日时间序列,并使用PostgreSQL。我的目标是排除时间序列太不完整的公司。因此,我想排除所有具有3个或更多连续缺失值的公司。此外,我想排除数据集中第一个和最后一个日期之间缺失值超过50%的所有公司。
我们可以使用以下示例数据:
date company value
2012-01-01 A 5
2012-01-01 B 2
2012-01-02 A NULL
2012-01-02 B 2
2012-01-02 C 4
2012-01-03 A NULL
2012-01-03 B NULL
2012-01-03 C NULL
2012-01-04 A NULL
2012-01-04 B NULL
2012-01-04 C NULL
2012-01-05 A 8
2012-01-05 B 9
2012-01-05 C 3
2012-01-06 A 8
2012-01-06 B 9
2012-01-06 C NULL
所以A必须被排除,因为它有三个连续缺失值的间隙,而C是因为它在第一个和最后一个日期之间有超过50%的缺失值。
在本论坛中结合其他答案,我编写了以下代码:
添加自动增量主键以标识每一行
CREATE TABLE test AS SELECT * FROM mytable ORDER BY company, date;
CREATE SEQUENCE id_seq; ALTER TABLE test ADD id INT UNIQUE;
ALTER TABLE test ALTER COLUMN id SET DEFAULT NEXTVAL('id_seq');
UPDATE test SET id = NEXTVAL('id_seq');
ALTER TABLE test ADD PRIMARY KEY (id);
检测时间序列中的间隙
CREATE TABLE to_del AS WITH count3 AS
( SELECT *,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
AS cnt FROM test)
SELECT company, id FROM count3 WHERE cnt >= 3;
从mytable中删除差距
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_del);
似乎可以实现检测和删除时间序列中3个或更多连续缺失值的间隙。但这种方法非常麻烦。我无法弄清楚如何添加所有缺失值超过50%的公司。
你能想到一个比我更有效的解决方案(我只是学习使用PostgreSQL),还能设法排除缺失值超过50%的公司吗?
答案 0 :(得分:2)
我只创建一个查询:
DELETE FROM mytable
WHERE company in (
SELECT Company
FROM (
SELECT Company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY id
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company)
/
COUNT(*)
OVER (PARTITION BY company) As p50
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
(公司+值)列上的复合索引有助于获得此查询的最大速度。
修改
以上查询不起作用
我稍微纠正了一下,这是一个演示:http://sqlfiddle.com/#!15/c9bfe/7
有两件事发生了变化:
- 按公司 ORDER BY日期而不是 ORDER BY id 进行分区
- 显式转换为数字(因为整数已被截断为0):
OVER(公司分区) :: numeric
SELECT company, cnt, p50
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
-- WHERE cnt >= 3 OR p50 > 0.5
现在删除查询应该有效:
DELETE FROM mytable
WHERE company in (
SELECT company
FROM (
SELECT company,
COUNT(CASE WHEN value IS NULL THEN 1 END)
OVER (PARTITION BY company ORDER BY date
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) As cnt,
SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY company)::numeric
/
COUNT(*)
OVER (PARTITION BY company) As p50
FROM mytable
) alias
WHERE cnt >= 3 OR p50 > 0.5
)
答案 1 :(得分:1)
对于50%的标准,您可以选择所有公司,其中不同日期的数量低于最小和最大日期之间天数的一半。
我没有测试过这个,但它应该给你一个想法。我使用CTE使其更容易阅读。
WITH MinMax AS
(
SELECT Company, DATE_PART('day', AGE(MIN(Date), MAX(Date))) AS calendar_days, COUNT(DISTINCT date) AS days FROM table
GROUP By Company
)
SELECT Company FROM MinMax
WHERE (calendars_days / 2) > days