数据集包含不同公司的每日(工作日)时间序列。还有一个指标变量(ind)取值为1或0.如果给定公司的ind为1,那么我想构建数据集的子样本,包括该指标事件之前某个时间范围内该公司的所有条目。
我们可以想到以下示例数据:
day company ind
2012-01-11 A 0
2012-01-11 B 0
2012-01-11 C 0
2012-01-12 A 0
2012-01-12 B 0
2012-01-12 C 0
2012-01-13 A 0
2012-01-13 B 1
2012-01-13 C 0
2012-01-16 A 0
2012-01-16 B 0
2012-01-16 C 0
2012-01-17 A 1
2012-01-17 B 0
2012-01-17 C 0
2012-01-18 A 0
2012-01-18 B 1
2012-01-18 C 0
我的目标是一个子样本,包括指标事件公司A和B在各自事件之前的时间范围(-2天到-1天)(确保在此时间范围内各自没有其他事件公司)。这将是我想要的结果:
day company ind
2012-01-11 B 0
2012-01-12 B 0
2012-01-13 A 0
2012-01-13 B 0
2012-01-16 A 0
2012-01-16 B 0
2012-01-17 B 0
如果数据集中只有一个公司只有一个指标事件,则以下代码有效:
CREATE TABLE temp AS
SELECT Row_Number() OVER (PARTITION BY company ORDER BY day) AS rowid, *
FROM mytable
CREATE TABLE window AS SELECT *
FROM temp t1
WHERE company IN (
SELECT company
FROM temp t2
WHERE t2.ind = 1)
AND rowid BETWEEN((SELECT rowid FROM temp where ind = 1) - 2)
AND ((SELECT rowid FROM temp where ind = 1) -1)
但我真的很难将其扩展到多个事件公司的情况,并且每个公司可能会有多个事件,例如示例数据集。
你有什么想法可以解决这个问题吗?
答案 0 :(得分:3)
由于您在尝试时按公司进行分区,我假设您不希望结果中出现以下行:
2012-01-13 B 0
如果是这种情况,您可以使用LEAD()
向前看1或2行,看看是否填充了ind
标志:
WITH cte AS (SELECT * ,LEAD(ind) OVER(PARTITION BY company ORDER BY day) AS Lead1
,LEAD(ind,2) OVER(PARTITION BY company ORDER BY day) AS Lead2
FROM Table1)
SELECT Day,Company,Ind
FROM cte
WHERE Lead1 = 1
OR Lead2 = 1
ORDER BY day,company
演示:SQL Fiddle
更新:考虑到更大的范围,这种方法更好,因为您可以指定要查看的前面行数(演示更新为包括两者):
WITH cte AS (SELECT *
, MAX(ind) OVER(PARTITION BY company ORDER BY day ROWS BETWEEN 1 following AND 2 following) Lead1
FROM Table1)
SELECT Day,Company,Ind
FROM cte
WHERE Lead1 = 1
ORDER BY day,company