我的数据集包含不同公司的每日时间序列,我使用PostgreSQL。 我的数据集中有一个指标变量,取值为1,-1且大部分时间为0.如果指标变量不为0,并且公司在当天(指标日)或下一列中的另一列中缺少值那天,公司应完全被排除在数据集之外。
我们可以想到以下示例数据:
date company indicator value
2012-01-02 A 0 2
2012-01-02 B 0 9
2012-01-02 C 0 1
2012-01-02 D 0 3
2012-01-03 A 1 NULL
2012-01-03 B 0 NULL
2012-01-03 C -1 1
2012-01-03 D 0 2
2012-01-04 A 0 1
2012-01-04 B 0 1
2012-01-04 C 0 NULL
2012-01-04 D 1 4
2012-01-05 A 0 4
2012-01-05 B 0 2
2012-01-05 C 0 1
2012-01-05 D 0 7
因此,必须排除A,因为它在指标日具有缺失值,而C因为它在指标日后的某一天缺少值。
我尝试了以下内容:
CREATE TABLE to_delete
AS SELECT * FROM mytable
WHERE company IN(
SELECT company
FROM mytable
WHERE date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
AND (SELECT date+1 FROM mytable WHERE indicator != 0)
AND indicator != 0)
AND date BETWEEN (SELECT date FROM mytable WHERE indicator != 0)
AND (SELECT date+1 FROM mytable WHERE indicator != 0)
DELETE FROM mytable WHERE company in (SELECT DISTINCT company FROM to_delete);
如果示例数据集中只存在一个不等于零的指标值,则它有效。有多个,PostgreSQL返回一个错误,说我的子查询返回多行。
我真的很难解决这个问题。您是否知道解决方案,或者可能是实现所需结果的完全其他方法?
答案 0 :(得分:1)
我会用EXISTS
semi-join进行简化。
SELECT * FROM tbl t
-- DELETE FROM tbl t
WHERE indicator <> 0
AND EXISTS (
SELECT 1
FROM tbl t1
WHERE day IN (t.day, t.day + 1)
AND t1.company = t.company
AND t1.value IS NULL
)
使用列名day
而不是date
,因为我从不使用基本类型名称作为标识符。
day + 1
属于data type date
(应该是这样),则 day
有效。
公司应完全被排除在数据集之外。
DELETE FROM tbl t
USING (
SELECT DISTINCT company
FROM tbl t
WHERE indicator <> 0
AND EXISTS (
SELECT 1
FROM tbl t1
WHERE day IN (t.day, t.day + 1)
AND t1.company = t.company
AND t1.value IS NULL
)
) del
WHERE t.company = del.company
答案 1 :(得分:1)
DELETE FROM test WHERE company IN (
WITH
for_check AS (SELECT date, company FROM test WHERE indicator != 0)
SELECT test.company
FROM test
INNER JOIN for_check fc
ON test.date IN (fc.date, fc.date + 1)
AND fc.company = test.company
WHERE test.value IS NULL
)