Question

我有两个表，conttagtable（t）和contfloattable（cf）。 T有大约43k行。 CF有超过90亿。

我在两个表的tagindex列上的两个表上创建了一个索引。此列可以被视为conttagtable的唯一标识符，也可以被视为conttagtable的{{1}}的外键。我没有在与其他表相关的任何表上明确创建PK或外键，尽管这两个表上的confloattable列在逻辑上相关，好像tagindex是conttagtable.tagindex和PRIMARY KEY contfloattable.tagindex。数据来自微软访问转储，我不知道我是否可以信任tagindex是唯一的，因此不会强制执行“唯一性”。

数据本身非常大。

对于每个FOREIGN KEY (tagindex) REFERENCES conttagtable(tagindex)，每隔15分钟contfloattable间隔，我需要从contfloattable.dateandtime获取单个任意选择的行。因此，如果给定conttagtable.tagid的{{1}}有4000个样本跨越30分钟，我需要0-14分钟范围内的样本和15-30分钟范围内的样本。 15分钟范围内的任何一个样品都是可以接受的;第1，最后，随机，无论如何。

简而言之，我需要每15分钟获取一个样本，但每个t.tagname只需要一个样本。现在的样本每5秒记录一次，数据跨越两年。对于sql而言，这是一个大数据问题并且超出了我的想法。我通过谷歌搜索或搜索SO尝试的所有时间间隔解决方案都产生了很长的查询时间，以至于它们不实用。

我的索引是否足以快速加入？（它们似乎是在省略时间间隔的部分）
我是否会因添加任何其他索引而受益？
实现上述目标的最佳/最快查询是什么？

这是一个包含模式和一些示例数据的SQLFiddle：http://sqlfiddle.com/#!1/c7d2f/2

架构：

contfloattable

我希望看到的输出是这样的：

tagid

...等等...

根据Clodoaldo的建议，这是我最近的尝试，是否有任何加快速度的建议？

        Table "public.conttagtable" (t)
   Column    |  Type   | Modifiers
-------------+---------+-----------
 tagname     | text    |
 tagindex    | integer |
 tagtype     | integer |
 tagdatatype | integer |
Indexes:
    "tagindex" btree (tagindex)


             Table "public.contfloattable" (CF)
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 dateandtime | timestamp without time zone |
 millitm     | integer                     |
 tagindex    | integer                     |
 Val         | double precision            |
 status      | text                        |
 marker      | text                        |
Indexes:
    "tagindex_contfloat" btree (tagindex)

上述查询计划：http://explain.depesz.com/s/loR

Answer 1

间隔15分钟：

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname

显示此查询的解释输出（如果有效），以便我们尝试优化。你可以在这个答案中发帖。

解释输出

"Sort  (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49481978.32 rows=19436288 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2741707.02..2790297.74 rows=19436288 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..388725.76 rows=19436288 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

看起来你需要这个索引：

create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

创建后运行analyze。现在请注意，大表上的任何索引都会对数据更改（插入等）产生显着的性能影响，因为它必须在每次更改时更新。

<强>更新

我添加了cf_tag_datetime索引（tagindex，dateandtime），这里是新的解释：

"Sort  (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

它似乎及时上升了:(但是，如果我删除了order by子句（不完全是我需要的，但会起作用），这就是发生的事情，大幅减少：

"Hash Join  (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
"  Hash Cond: (cf.tagindex = t.tagindex)"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"        Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"        ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"              Sort Key: i.tagindex, i.dateandtime"
"              ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"        ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"              ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                    Sort Key: cf.tagindex, cf.dateandtime"
"                    ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"  ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"        ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

我还没有尝试过这个索引......但是会这样做。待机状态。

现在再看一遍，我认为逆指数可能更好，因为它不仅可以用于Merge Join，还可以用于最终的Sort：

create index cf_tag_datetime on contfloattable (dateandtime, tagindex)

Answer 2

这是另一种表述。我很想知道它如何在完整的数据集上扩展。首先创建此索引：

CREATE INDEX contfloattable_tag_and_timeseg
ON contfloattable(tagindex, (floor(extract(epoch FROM dateandtime) / 60 / 15) ));

然后用尽可能多的work_mem来运行它：

SELECT 
  (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).*,
  (SELECT t.tagname FROM conttagtable t WHERE t.tagindex = x.tagindex) AS tagname
FROM contfloattable x ORDER BY dateandtime, tagname;

Sneaky Wombat ：从上面的sql解释完整的数据集（没有建议的索引）：http://explain.depesz.com/s/kGo

或者，这里只需要在contfloattable之间进行一次顺序传递，将值收集到一个tuplestore中，然后JOIN编辑以获取标记名称。它需要大量work_mem：

SELECT cf.dateandtime, cf.dataVal, cf.status, t.tagname
FROM 
  (
    SELECT (first_value(x) OVER (PARTITION BY x.tagindex, floor(extract(epoch FROM x.dateandtime) / 60 / 15))).*
    FROM contfloattable x
  ) cf
  INNER JOIN
  conttagtable t ON cf.tagindex = t.tagindex
ORDER BY cf.dateandtime, t.tagname;

Sneaky Wombat ：从上面的sql解释完整的数据集（没有建议的索引）：http://explain.depesz.com/s/57q

如果它有效，你会想要在查询时尽可能多地抛出work_mem。你还没有提到系统的内存，但是你需要一大块内存;尝试：

SET work_mem = '500MB';

...或更多，如果你有至少4GB的RAM并且在64位CPU上。同样，我真的很想知道它在完整数据集上的工作原理。

顺便说一下，为了确保这些查询的正确性，我建议您ALTER TABLE conttagtable ADD PRIMARY KEY (tagindex);然后DROP INDEX t_tagindex;。它需要一些时间，因为它将构建一个独特的索引。这里提到的大多数查询假设t.tagindex中conttagtable是唯一的，并且确实应该强制实施。唯一索引可用于旧的非唯一t_tagindex无法进行的其他优化，并且可以产生更好的统计估算值。

此外，在比较查询计划时，请注意cost不一定与实际执行时间严格成比例。如果估计值很好，那么它应该大致相关，但估计值只是那个。有时您会看到一个高成本计划的执行速度比所谓的低成本计划要快，原因包括错误的行数估计或索引选择性估计，查询计划器推断关系的能力限制，意外的相关性或成本参数如{{ 1}}和random_page_cost与真实系统不匹配。

每15分钟有效地查询一个巨大的时间序列表

2 个答案: