嗨,我遇到了一个棘手的问题:
我有一张天气预报表(oracle 9i)(规模上有数百万的记录)。 其妆容如下:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk) integer (pk) datetime (pk) integer
其中:
stationid
是指可能会创建预测的众多气象站之一; forecastdate
是指预测的日期(仅限日期不是时间。)forecastinterval
是指forecastdate
中预测的小时数(0 - 23)。forecastcreated
指的是预测的时间,可以提前很多天。forecastvalue
指的是预测的实际值(顾名思义)。我需要确定给定的stationid
和给定的forecastdate
和forecastinterval
对,forecastvalue
增量超过名义数字(比如500)的记录。我将在这里显示一个条件表:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
'stationa' 13-dec-09 10 10-dec-09 04:50:10 0
'stationa' 13-dec-09 10 10-dec-09 17:06:13 0
'stationa' 13-dec-09 10 12-dec-09 05:20:50 300
'stationa' 13-dec-09 10 13-dec-09 09:20:50 300
在上面的场景中,我想拉出第三条记录。这是预测值增加名义(比如100)的记录。
由于表格的大小(数百万条记录中的数百万条)并且花了这么长时间才完成(事实上我的查询从未返回过很长时间),任务证明非常困难。
到目前为止,我尝试抓住这些值:
select
wtr.stationid,
wtr.forecastcreated,
wtr.forecastvalue,
(wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
(select inner.*
rank() over (partition by stationid,
(inner.forecastdate + inner.forecastinterval),
inner.forecastcreated
order by stationid,
(inner.forecastdate + inner.forecastinterval) asc,
inner.forecastcreated asc
) rk
from weathertable inner) wtr
where
wtr.forecastvalue - 100 > (
select lastvalue
from (select y.*,
rank() over (partition by stationid,
(forecastdate + forecastinterval),
forecastcreated
order by stationid,
(forecastdate + forecastinterval) asc,
forecastcreated asc) rk
from weathertable y
) z
where z.stationid = wtr.stationid
and z.forecastdate = wtr.forecastdate
and (z.forecastinterval =
wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
and wtr.rk = z.rk + 1)
答案 0 :(得分:1)
Rexem建议使用LAG()是正确的方法,但我们需要使用分区子句。一旦我们为不同的间隔和不同的站点添加行,这就变得清晰了......
SQL> select * from t
2 /
STATIONID FORECASTDATE INTERVAL FORECASTCREATED FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa 13-12-2009 10 10-12-2009 04:50:10 0
stationa 13-12-2009 10 10-12-2009 17:06:13 0
stationa 13-12-2009 10 12-12-2009 05:20:50 300
stationa 13-12-2009 10 13-12-2009 09:20:50 300
stationa 13-12-2009 11 13-12-2009 09:20:50 400
stationb 13-12-2009 11 13-12-2009 09:20:50 500
6 rows selected.
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (ORDER BY t.forecastcreated) as difference
12 FROM t) v
13 WHERE v.difference >= 100
14 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
stationa 13-12-2009 09:20:50 400 13-12-2009 11:00:00
stationb 13-12-2009 09:20:50 500 13-12-2009 11:00:00
SQL>
为了消除误报,我们按照STATIONID,FORECASTDATE和FORECASTINTERVAL对LAG()进行分组。请注意,以下内容依赖于内部查询从每个分区窗口的第一次计算返回NULL。
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (PARTITION BY t.stationid
12 , t.forecastdate
13 , t.forecastinterval
14 ORDER BY t.forecastcreated) as difference
15 FROM t) v
16 WHERE v.difference >= 100
17 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
SQL>
处理大量数据
您将表描述为包含数亿行。这样巨大的桌子就像黑洞,它们有不同的物理特性。根据您的需求,时间表,财务状况,数据库版本和版本以及系统数据的任何其他用途,有各种可能的方法。这是超过五分钟的答案。
但无论如何,这是五分钟的答案。
假设您的表是实时表,可能是通过在发生时添加预测来填充,这基本上是一个附加操作。这意味着任何给定电台的预测都散布在整个表格中。因此,只有STATIONID甚至FORECASTDATE的索引都会有一个很差的聚类因子。
根据这个假设,我建议你首先尝试的一件事是在(STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE)
上建立一个索引。这将花费一些时间(和磁盘空间)来构建,但它应该相当大地加速您的后续查询,因为它具有满足查询所需的所有列,而不需要触及表格。