这篇文章已被完全改写,以使问题更容易理解。
设置
PostgreSQL 9.5
上运行Ubuntu Server 14.04 LTS
。
数据模型
我有数据集表,我分别存储数据(时间序列),所有这些表必须共享相同的结构:
CREATE TABLE IF NOT EXISTS %s(
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
GranulityIdIn INTEGER,
GranulityId INTEGER NOT NULL,
TimeValue TIMESTAMP NOT NULL,
FloatValue FLOAT DEFAULT(NULL),
Status BIGINT DEFAULT(NULL),
QualityCodeId INTEGER NOT NULL,
DataArray FLOAT[] DEFAULT(NULL),
DataCount BIGINT DEFAULT(NULL),
Performance FLOAT DEFAULT(NULL),
StepCount INTEGER NOT NULL DEFAULT(0),
TableRegClass regclass NOT NULL,
Updated TIMESTAMP NOT NULL,
Tags TEXT[] DEFAULT(NULL),
--
CONSTRAINT PK_%s PRIMARY KEY(Id),
CONSTRAINT FK_%s_Channel FOREIGN KEY(ChannelId) REFERENCES scientific.Channel(Id),
CONSTRAINT FK_%s_GranulityIn FOREIGN KEY(GranulityIdIn) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_Granulity FOREIGN KEY(GranulityId) REFERENCES quality.Granulity(Id),
CONSTRAINT FK_%s_QualityCode FOREIGN KEY(QualityCodeId) REFERENCES quality.QualityCode(Id),
CONSTRAINT UQ_%s UNIQUE(QualityCodeId, ChannelId, GranulityId, TimeValue)
);
CREATE INDEX IDX_%s_Channel ON %s USING btree(ChannelId);
CREATE INDEX IDX_%s_Quality ON %s USING btree(QualityCodeId);
CREATE INDEX IDX_%s_Granulity ON %s USING btree(GranulityId) WHERE GranulityId > 2;
CREATE INDEX IDX_%s_TimeValue ON %s USING btree(TimeValue);
此定义来自FUNCTION
,因此%s
代表数据集名称。
UNIQUE
约束确保给定数据集中不得有重复记录。此数据集中的记录是给定时间段(floatvalue
)在给定时间(channelid
)采样的给定通道(timevalue
)的值(granulityid
),具有给定的质量(qualitycodeid
)。无论价值是多少,都不能有(channelid, timevalue, granulityid, qualitycodeid)
的副本。
数据集中的记录如下:
1;25;;1;"2015-01-01 00:00:00";0.54;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
2;25;;1;"2015-01-01 00:30:00";0.49;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
3;25;;1;"2015-01-01 01:00:00";0.47;160;6;"";;;0;"datastore.rtu";"2016-05-07 16:38:29.28106";""
我还有另一个卫星表,我为通道存储有效数字,这个参数可以随时间变化。我用以下方式存储它:
CREATE TABLE SVPOLFactor (
Id SERIAL NOT NULL,
ChannelId INTEGER NOT NULL,
StartTimestamp TIMESTAMP NOT NULL,
Factor FLOAT NOT NULL,
UnitsId VARCHAR(8) NOT NULL,
--
CONSTRAINT PK_SVPOLFactor PRIMARY KEY(Id),
CONSTRAINT FK_SVPOLFactor_Units FOREIGN KEY(UnitsId) REFERENCES Units(Id),
CONSTRAINT UQ_SVPOLFactor UNIQUE(ChannelId, StartTimestamp)
);
如果为通道定义了有效数字,则会向该表添加一行。然后该因素适用于此日期。第一个记录始终具有标记值'-infinity'::TIMESTAMP
,这意味着:该因子从一开始就适用。下一行必须具有实际定义的值。如果给定通道没有行,则表示有效数字是单一的。
此表中的记录如下:
123;277;"-infinity";0.1;"_C"
124;1001;"-infinity";0.01;"-"
125;1001;"2014-03-01 00:00:00";0.1;"-"
126;1001;"2014-06-01 00:00:00";1;"-"
127;1001;"2014-09-01 00:00:00";10;"-"
5001;5181;"-infinity";0.1;"ug/m3"
目标
我的目标是对由不同进程填充的两个数据集执行比较审计。为实现这一目标,我必须:
为此目的,我写了以下查询,其行为方式我不明白:
WITH
-- Join records before records (regard to uniqueness constraint) from datastore templated tables in order to make audit comparison:
S0 AS (
SELECT
A.ChannelId
,A.GranulityIdIn AS gidInRef
,B.GranulityIdIn AS gidInAudit
,A.GranulityId AS GranulityId
,A.QualityCodeId
,A.TimeValue
,A.FloatValue AS xRef
,B.FloatValue AS xAudit
,A.StepCount AS scRef
,B.StepCount AS scAudit
,A.DataCount AS dcRef
,B.DataCount AS dcAudit
,round(A.Performance::NUMERIC, 4) AS pRef
,round(B.Performance::NUMERIC, 4) AS pAudit
FROM
datastore.rtu AS A JOIN datastore.audit0 AS B USING(ChannelId, GranulityId, QualityCodeId, TimeValue)
),
-- Join before SVPOL factors in order to determine decimal factor applied to records:
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
),
-- Audit computation:
S2 AS (
SELECT
S1.*
,xaudit - xref AS dx
,(xaudit - xref)/NULLIF(xref, 0) AS rdx
,round(xaudit*pow(10, k))*pow(10, -k) AS xroundfloat
,round(xaudit::NUMERIC, k) AS xroundnum
,0.5*pow(10, -k) AS epsilon
FROM S1
)
SELECT
*
,ABS(dx) AS absdx
,ABS(rdx) AS absrdx
,(xroundfloat - xref) AS dxroundfloat
,(xroundnum - xref) AS dxroundnum
,(ABS(dx) - epsilon) AS dxeps
,(ABS(dx) - epsilon)/epsilon AS rdxeps
,(xroundfloat - xroundnum) AS dfround
FROM
S2
ORDER BY
k DESC
,ABS(rdx) DESC
,ChannelId;
查询可能有点难以理解,大致我希望它能够:
S0
); LEFT JOIN
)的有效数字(S1
); S2
和最终SELECT
)。问题
当我运行上面的查询时,我丢失了行。例如:channelid=123
与granulityid=4
在两个表(datastore.rtu
和datastore.audit0
)中共有12条记录。当我执行整个查询并将其存储在MATERIALIZED VIEW
中时,少于12行。然后我开始调查,以了解为什么我缺少记录,我遇到了WHERE
条款的奇怪行为。如果我执行此查询的EXPLAIN ANALIZE
,我会得到:
"Sort (cost=332212.76..332212.77 rows=1 width=232) (actual time=6042.736..6157.235 rows=61692 loops=1)"
" Sort Key: s2.k DESC, (abs(s2.rdx)) DESC, s2.channelid"
" Sort Method: external merge Disk: 10688kB"
" CTE s0"
" -> Merge Join (cost=0.85..332208.25 rows=1 width=84) (actual time=20.408..3894.071 rows=63635 loops=1)"
" Merge Cond: ((a.qualitycodeid = b.qualitycodeid) AND (a.channelid = b.channelid) AND (a.granulityid = b.granulityid) AND (a.timevalue = b.timevalue))"
" -> Index Scan using uq_rtu on rtu a (cost=0.43..289906.29 rows=3101628 width=52) (actual time=0.059..2467.145 rows=3102319 loops=1)"
" -> Index Scan using uq_audit0 on audit0 b (cost=0.42..10305.46 rows=98020 width=52) (actual time=0.049..108.138 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=4.37..4.38 rows=1 width=148) (actual time=4445.865..4509.839 rows=61692 loops=1)"
" -> Sort (cost=4.37..4.38 rows=1 width=148) (actual time=4445.863..4471.002 rows=63635 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 5624kB"
" -> Hash Right Join (cost=0.03..4.36 rows=1 width=148) (actual time=4102.842..4277.641 rows=63635 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.013..0.083 rows=168 loops=1)"
" -> Hash (cost=0.02..0.02 rows=1 width=132) (actual time=4102.002..4102.002 rows=63635 loops=1)"
" Buckets: 65536 (originally 1024) Batches: 2 (originally 1) Memory Usage: 3841kB"
" -> CTE Scan on s0 (cost=0.00..0.02 rows=1 width=132) (actual time=20.413..4038.078 rows=63635 loops=1)"
" CTE s2"
" -> CTE Scan on s1 (cost=0.00..0.07 rows=1 width=168) (actual time=4445.910..4972.832 rows=61692 loops=1)"
" -> CTE Scan on s2 (cost=0.00..0.05 rows=1 width=232) (actual time=4445.934..5312.884 rows=61692 loops=1)"
"Planning time: 1.782 ms"
"Execution time: 6201.148 ms"
我知道我必须有67106行。
在撰写本文时,我知道S0
会返回正确的行数。因此问题必须在于CTE
。
我觉得很奇怪的是:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1 WHERE Channelid=123 AND GranulityId=4 -- POST-FILTERING
返回10行:
"CTE Scan on s1 (cost=24554.34..24799.39 rows=1 width=196) (actual time=686.211..822.803 rows=10 loops=1)"
" Filter: ((channelid = 123) AND (granulityid = 4))"
" Rows Removed by Filter: 94890"
" CTE s0"
" -> Seq Scan on audit0 (cost=0.00..2603.20 rows=98020 width=160) (actual time=0.009..26.092 rows=98020 loops=1)"
" CTE s1"
" -> Unique (cost=21215.99..21951.14 rows=9802 width=176) (actual time=590.337..705.070 rows=94900 loops=1)"
" -> Sort (cost=21215.99..21461.04 rows=98020 width=176) (actual time=590.335..665.152 rows=99151 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: external merge Disk: 12376kB"
" -> Hash Left Join (cost=5.78..4710.74 rows=98020 width=176) (actual time=0.143..346.949 rows=99151 loops=1)"
" Hash Cond: (s0.channelid = sf.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> CTE Scan on s0 (cost=0.00..1960.40 rows=98020 width=160) (actual time=0.012..116.543 rows=98020 loops=1)"
" -> Hash (cost=3.68..3.68 rows=168 width=20) (actual time=0.096..0.096 rows=168 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 12kB"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.045 rows=168 loops=1)"
"Planning time: 0.385 ms"
"Execution time: 846.179 ms"
下一个返回正确的行数:
EXPLAIN ANALYZE
WITH
S0 AS (
SELECT * FROM datastore.audit0
WHERE Channelid=123 AND GranulityId=4 -- PRE FILTERING
),
S1 AS (
SELECT
DISTINCT ON(ChannelId, TimeValue)
S0.*
,SF.Factor::NUMERIC AS svpolfactor
,COALESCE(-log(SF.Factor), 0)::INTEGER AS k
FROM
S0 LEFT JOIN settings.SVPOLFactor AS SF ON ((S0.ChannelId = SF.ChannelId) AND (SF.StartTimestamp <= S0.TimeValue))
ORDER BY
ChannelId, TimeValue, StartTimestamp DESC
)
SELECT * FROM S1
其中:
"CTE Scan on s1 (cost=133.62..133.86 rows=12 width=196) (actual time=0.580..0.598 rows=12 loops=1)"
" CTE s0"
" -> Bitmap Heap Scan on audit0 (cost=83.26..128.35 rows=12 width=160) (actual time=0.401..0.423 rows=12 loops=1)"
" Recheck Cond: ((channelid = 123) AND (granulityid = 4))"
" Heap Blocks: exact=12"
" -> BitmapAnd (cost=83.26..83.26 rows=12 width=0) (actual time=0.394..0.394 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_audit0_channel (cost=0.00..11.12 rows=377 width=0) (actual time=0.055..0.055 rows=377 loops=1)"
" Index Cond: (channelid = 123)"
" -> Bitmap Index Scan on idx_audit0_granulity (cost=0.00..71.89 rows=3146 width=0) (actual time=0.331..0.331 rows=3120 loops=1)"
" Index Cond: (granulityid = 4)"
" CTE s1"
" -> Unique (cost=5.19..5.28 rows=12 width=176) (actual time=0.576..0.581 rows=12 loops=1)"
" -> Sort (cost=5.19..5.22 rows=12 width=176) (actual time=0.576..0.576 rows=12 loops=1)"
" Sort Key: s0.channelid, s0.timevalue, sf.starttimestamp DESC"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Right Join (cost=0.39..4.97 rows=12 width=176) (actual time=0.522..0.552 rows=12 loops=1)"
" Hash Cond: (sf.channelid = s0.channelid)"
" Join Filter: (sf.starttimestamp <= s0.timevalue)"
" -> Seq Scan on svpolfactor sf (cost=0.00..3.68 rows=168 width=20) (actual time=0.006..0.022 rows=168 loops=1)"
" -> Hash (cost=0.24..0.24 rows=12 width=160) (actual time=0.446..0.446 rows=12 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 6kB"
" -> CTE Scan on s0 (cost=0.00..0.24 rows=12 width=160) (actual time=0.403..0.432 rows=12 loops=1)"
"Planning time: 0.448 ms"
"Execution time: 4.510 ms"
因此问题似乎在于S1
。没有为channelid = 123
定义有效数字,因此,如果没有LEFT JOIN
,则不应生成这些记录。但这并不能解释为什么会有一些缺失。
问题
我使用LEFT JOIN
以便在获取有效数字时保持正确的基数,因此它不能删除记录,之后它只是算术。
这对我来说听起来有点儿麻烦。如果我不使用WHERE
子句,则生成所有记录(或组合)(我知道JOIN
是WHERE
子句),然后进行计算。当我不使用额外的WHERE
(原始查询)时,我会错过行(如示例中所示)。当我添加一个WHERE子句进行过滤时,结果是不同的(如果后过滤返回的记录多于预过滤,则可能没问题。)
任何指出我的错误和对查询的误解的建设性答案都是受欢迎的。谢谢。
答案 0 :(得分:2)
由于DISTINCT ON
中的S1
子句,您可能错过了行。您似乎正在使用它来仅选择SVPOLFactor
的最新适用行。但是,你写了
DISTINCT ON(ChannelId, TimeValue)
在查询S0
中,唯一行也可能因GranulityId
和/或QualityCodeId
而异。因此,例如,如果rtu
和audit0
中的行包含以下列:
Id | ChannelId | GranulityId | TimeValue | QualityCodeid
----|-----------+-------------+---------------------+---------------
1 | 123 | 4 | 2015-01-01 00:00:00 | 2
2 | 123 | 5 | 2015-01-01 00:00:00 | 2
然后S0
没有WHERE
过滤会返回这两个行,因为它们在GranulityId
中有所不同。但其中一项将被DISTINCT ON
中的S1
子句删除,因为它们对ChannelId
和TimeValue
具有相同的值。更糟糕的是,因为您只按ChannelId
和TimeValue
进行排序,哪一行被选中,哪一行被删除不是由您的查询中的任何内容决定的 - 这是偶然的!
在&#34;后过滤&#34;的示例中WHERE ChannelId = 123 AND GranulityId = 4
,这两行都在S0
中。然后,根据您无法控制的顺序,DISTINCT ON
中的S1
过滤掉第1行而不是第2行,这是可能的。然后,行2在最后被过滤掉,留下两行。 DISTINCT ON
子句中的错误导致第2行(您甚至不想看到),以消除中间查询中的第1行。
在&#34;预过滤&#34;的示例中在S0
中,您在第2行可以干扰第1行之前过滤掉第2行,因此第1行会将其转换为最终查询。
阻止排除这些行的一种方法是展开DISTINCT ON
和ORDER BY
条款以包含GranulityId
和QualityCodeId
:
DISTINCT ON(ChannelId, TimeValue, GranulityId, QualityCodeId)
-- ...
ORDER BY ChannelId, TimeValue, GranulityId, QualityCodeId, StartTimestamp DESC
当然,如果您过滤S0
的结果,以便它们对其中某些列的值都相同,则可以省略DISTINCT ON
中的值。在使用S0
和ChannelId
预过滤GranulityId
的示例中,可能是:
DISTINCT ON(TimeValue, QualityCodeId)
-- ...
ORDER BY TimeValue, QualityCodeId, StartTimestamp DESC
但是我怀疑你是否节省了很多时间,所以保留所有这些列可能是最安全的,以防你有一天再次更改查询并忘记更改DISTINCT ON
我想提一下the PostgreSQL docs警告DISTINCT ON
(强调我的)这些问题:
所有[
DISTINCT ON
]表达式相等的一组行被视为重复行,并且只有该行的第一行保留在输出中。注意&#34;第一行&#34;除非查询在足够的列上排序以保证到达DISTINCT
过滤器的行的唯一排序,否则集合不可预测。 (DISTINCT ON
排序后进行ORDER BY
处理。)
DISTINCT ON
子句不是SQL标准的一部分,由于其结果的可能不确定性质,有时会被视为不良样式。通过在GROUP BY
中明智地使用FROM
和子查询,可以避免使用此构造,但它通常是最方便的替代方案。
答案 1 :(得分:2)
你已经得到了正确答案,这只是一个补充。当您在派生表中计算开始/结束时,联接返回单行而您不需要DISTINCT ON
(这也可能更有效):
...
FROM S0 LEFT JOIN
(
SELECT *,
-- find the next StartTimestamp = End of the current period
COALESCE(LEAD(StartTimestamp)
OVER (PARTITION BY ChannelId
ORDER BY StartTimestamp, '+infinity') AS EndTimestamp
FROM SVPOLFactor AS t
) AS SF
ON (S0.ChannelId = SF.ChannelId)
AND (S0.TimeValue >= SF.StartTimestamp)
AND (S0.TimeValue < SF.EndTimestamp)
答案 2 :(得分:0)
由于操作的顺序不同DISTINCT ON(ChannelId, TimeValue) ... ORDER BY ChannelId, TimeValue, StartTimestamp
和WHERE Channelid=123 AND GranulityId=4
,它们实际上是两个逻辑上不同的查询。看看
create table sample(
distinctkey int,
orderkey int,
valkey int
);
insert into sample (distinctkey,orderkey,valkey)
select 1,10,150
union all
select 1,20,100;
两个类似于你的查询:
select distinctkey, orderkey, valkey
from (
select distinct on(distinctkey) distinctkey, orderkey, valkey
from sample
order by distinctkey, orderkey) t
where distinctkey = 1 and valkey = 100;
不返回任何行。而
select distinct on(distinctkey) distinctkey, orderkey, valkey
from (
select distinctkey, orderkey,valkey
from sample
where distinctkey = 1 and valkey = 100) t
order by distinctkey, orderkey;
返回1行。
您的查询可能会根据数据返回不同的行数。您应该只选择一个与您面临的任务相关的逻辑。