我的数据集包含来自不同行业的不同公司的每日(实际工作日)时间序列,我使用PostgreSQL。我的数据集中有一个指标变量,取值为1,-1和大多数时间为0.为了更好的可读性,我指的是指标变量不等于零作为指标事件的日子。
因此,对于前三个工作日内同一行业的另一个指标事件之前的所有指标事件,指标变量应更新为零。
我们可以想到以下示例数据集:
day company industry indicator
2012-01-12 A financial 1
2012-01-12 B consumer 0
2012-01-13 A financial 1
2012-01-13 B consumer -1
2012-01-16 A financial 0
2012-01-16 B consumer 0
2012-01-17 A financial 0
2012-01-17 B consumer 0
2012-01-17 C consumer 0
2012-01-18 A financial 0
2012-01-18 B consumer 0
2012-01-18 C consumer 1
因此,应更新为零的指标值是2012-01-13公司A的条目,2012-01-18是公司C的条目,因为它们之前是另一个指标事件行业在3个工作日内。
我试图通过以下方式完成它:
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
COUNT(CASE WHEN indicator <> 0 THEN 1 END)
OVER (PARTITION BY industry ORDER BY day
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) As cnt
FROM test
) alias
WHERE cnt >= 2)
我的想法是计算当前日和行业划分的前3天的指标事件。如果计数超过1,则将指标值更新为零。
弱点是,到目前为止,它超过前三行(按行业划分)而不是前三个工作日。因此,在示例数据中,它无法在2012-01-18更新公司C,因为它计算行业=消费者的最后三行,而不是计算过去三个工作日中行业=消费者的所有行。
我尝试了不同的方法,例如在代码的第三行末尾添加另一个子查询,或者在第三行最后一行之后添加WHERE EXISTS
- 子句,以确保代码在前三个日期计算。但没有任何效果。我真的不知道该怎么做(我只是学习使用PostgreSQL)。
您有任何想法如何修复它?
或许我正在思考一个完全错误的方向,你知道另一种方法如何解决我的问题?
答案 0 :(得分:1)
拳头关闭,你的桌子应该正常化。 industry
应该是引用integer
表的industry_id
的小型外键列(通常为industry
)。也许你已经有了,只是为了问题而简化了。你的实际表定义会有很长的路要走。
由于带有指标的行很少但非常有趣,因此请创建一个(可能是“覆盖”)部分索引,以便更快地使任何解决方案:
CREATE INDEX tbl_indicator_idx ON tbl (industry, day)
WHERE indicator <> 0;
Equality first, range last.
假设indicator
已定义为NOT NULL
。如果industry
是integer
,则此索引将非常有效。
此查询标识要重置的行:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
SELECT industry, day
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('ISODOW' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) sub
WHERE incident
AND ct > 1
ORDER BY 1, 2;
ISODOW
as extract()
parameter可以方便截断周末。
将其整合到UPDATE
:
WITH x AS ( -- only with indicator
SELECT DISTINCT industry, day
FROM tbl t
WHERE indicator <> 0
)
UPDATE tbl t
SET indicator = 0
FROM (
SELECT i.industry, d.day, x.day IS NOT NULL AS incident
, count(x.day) OVER (PARTITION BY industry ORDER BY day_nr
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS ct
FROM (
SELECT *, row_number() OVER (ORDER BY d.day) AS day_nr
FROM (
SELECT generate_series(min(day), max(day), interval '1d')::date AS day
FROM x
) d
WHERE extract('isodow' FROM d.day) < 6
) d
CROSS JOIN (SELECT DISTINCT industry FROM x) i
LEFT JOIN x USING (industry, day)
) u
WHERE u.incident
AND u.ct > 1
AND t.industry = u.industry
AND t.day = u.day;
这应该比具有相关子查询和每行函数调用的解决方案快得多。即使这是基于我自己以前的答案,但对于这个案例来说并不完美。
答案 1 :(得分:0)
与此同时,我自己找到了一个可能的解决方案(我希望这不违反论坛的礼仪)。
请注意,这只是一种可能的解决方案。非常欢迎您发表评论或发展 如果你愿意,可以改进。
对于第一部分,函数addbusinessdays可以添加(或减去)工作日 给定的日期,我指的是: http://osssmb.wordpress.com/2009/12/02/business-days-working-days-sql-for-postgres-2/ (我只是略微修改它,因为我不关心假期,只是为了周末)
CREATE OR REPLACE FUNCTION addbusinessdays(date, integer)
RETURNS date AS
$BODY$
with alldates as (
SELECT i,
$1 + (i * case when $2 < 0 then -1 else 1 end) AS date
FROM generate_series(0,(abs($2) + 5)*2) i
),
days as (
select i, date, extract('dow' from date) as dow
from alldates
),
businessdays as (
select i, date, d.dow from days d
where d.dow between 1 and 5
order by i
)
select date from businessdays where
case when $2 > 0 then date >=$1 when $2 < 0 then date <=$1 else date =$1 end
limit 1
offset abs($2)
$BODY$
LANGUAGE 'sql' VOLATILE
COST 100;
ALTER FUNCTION addbusinessdays(date, integer) OWNER TO postgres;
对于第二部分,我指的是这个相关问题,我正在应用Erwin Brandstetter的相关子查询方法:Window Functions or Common Table Expressions: count previous rows within range
UPDATE test SET indicator = 0
WHERE (day, industry) IN (
SELECT day, industry
FROM (
SELECT industry, day,
(SELECT COUNT(CASE WHEN indicator <> 0 THEN 1 END)
FROM test t1
WHERE t1.industry = t.industry
AND t1.day between addbusinessdays(t.day,-3) and t.day) As cnt
FROM test t
) alias
WHERE cnt >= 2)