Question

我正在编写node.js应用程序以启用对PostgreSQL数据库的搜索。为了在搜索框中启用twitter type-ahead，我必须在数据库中处理一组关键字以在页面加载之前初始化Bloodhound。这就像下面这样：

SELECT distinct handlerid from lotintro where char_length(lotid)=7;

所以对于一张大桌子（lotintro）来说，这是昂贵的;它也是愚蠢的，因为查询结果很可能在一段时间内对不同的Web访问者保持不变。

处理此问题的正确方法是什么？我在考虑几个选择：

1）将查询放入存储过程并从node.js中调用它：

   SELECT * from getallhandlerid()

这是否意味着将编译查询并且数据库将自动返回相同的结果集而没有实际运行的查询，因为知道结果不会更改？

2）或者，创建一个单独的表来存储不同的handlerid并使用每天运行的触发器更新表？（我知道理想情况下，触发器应该针对表的每次插入/更新运行，但这会花费太多）。

3）按照建议创建部分索引。以下是收集的内容：

查询

SELECT distinct handlerid from lotintro where length(lotid) = 7;

索引

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

使用索引，查询成本约为250毫秒，请尝试运行

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.64..5542.65 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.08..5537.50 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Rows Removed by Index Recheck: 55285"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.57 rows=2056 width=0) (actual rows=298350 loops=1)"
"Total runtime: 243.686 ms"

如果没有索引，查询成本约为210毫秒，请尝试运行

explain (analyze on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"Total runtime: 214.235 ms"

我在这里做错了什么？

4）使用alexius＆＃39;建议的索引和查询：

create index on lotintro using btree(char_length(lotid), handlerid);

但它不是最佳解决方案。因为只有很少的不同值，你可以使用称为松散索引扫描的技巧，在你的情况下应该更快地运行：

explain (analyze on, BUFFERS on, TIMING OFF)
WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

"CTE Scan on t  (cost=444.52..446.54 rows=100 width=32) (actual rows=151 loops=1)"
"  Filter: (handlerid IS NOT NULL)"
"  Rows Removed by Filter: 1"
"  Buffers: shared hit=608"
"  CTE t"
"    ->  Recursive Union  (cost=0.42..444.52 rows=101 width=32) (actual rows=152 loops=1)"
"          Buffers: shared hit=608"
"          ->  Limit  (cost=0.42..4.17 rows=1 width=6) (actual rows=1 loops=1)"
"                Buffers: shared hit=4"
"                ->  Index Scan using lotid_btree on lotintro lotintro_1  (cost=0.42..7704.41 rows=2056 width=6) (actual rows=1 loops=1)"
"                      Index Cond: (char_length(lotid) = 7)"
"                      Buffers: shared hit=4"
"          ->  WorkTable Scan on t t_1  (cost=0.00..43.83 rows=10 width=32) (actual rows=1 loops=152)"
"                Filter: (handlerid IS NOT NULL)"
"                Rows Removed by Filter: 0"
"                Buffers: shared hit=604"
"                SubPlan 1"
"                  ->  Limit  (cost=0.42..4.36 rows=1 width=6) (actual rows=1 loops=151)"
"                        Buffers: shared hit=604"
"                        ->  Index Scan using lotid_btree on lotintro  (cost=0.42..2698.13 rows=685 width=6) (actual rows=1 loops=151)"
"                              Index Cond: ((char_length(lotid) = 7) AND (handlerid > t_1.handlerid))"
"                              Buffers: shared hit=604"
"Planning time: 1.574 ms"
**"Execution time: 25.476 ms"**

=========关于db的更多信息============================

dataloggerDB =＃\ d lotintro 表＆＃34; public.lotintro＆＃34;

    Column    |            Type             |  Modifiers
 --------------+-----------------------------+--------------
  lotstartdt   | timestamp without time zone | not null
  lotid        | text                        | not null
  ftc          | text                        | not null
  deviceid     | text                        | not null
  packageid    | text                        | not null
  testprogname | text                        | not null
  testprogdir  | text                        | not null
  testgrade    | text                        | not null
  testgroup    | text                        | not null
  temperature  | smallint                    | not null
  testerid     | text                        | not null
  handlerid    | text                        | not null
  numofsite    | text                        | not null
  masknum      | text                        |
  soaktime     | text                        |
  xamsqty      | smallint                    |
  scd          | text                        |
  speedgrade   | text                        |
  loginid      | text                        |
  operatorid   | text                        | not null
  loadboardid  | text                        | not null
  checksum     | text                        |
  lotenddt     | timestamp without time zone | not null
  totaltest    | integer                     | default (-1)
  totalpass    | integer                     | default (-1)
  earnhour     | real                        | default 0
  avetesttime  | real                        | default 0
  Indexes:
  "pkey_lotintro" PRIMARY KEY, btree (lotstartdt, testerid)
  "lotid7_idx" btree (handlerid) WHERE length(lotid) = 7

your version of Postgres,         [PostgreSQL 9.2]
cardinalities (how many rows?),   [411K rows for table lotintro]
percentage for length(lotid) = 7. [298350/411000=  73%]

=============将所有内容移植到PG 9.4后=====================

使用索引：

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=5542.78..5542.79 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=14242"
"  ->  Bitmap Heap Scan on lotintro  (cost=39.22..5537.64 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Recheck Cond: (length(lotid) = 7)"
"        Heap Blocks: exact=13313"
"        Buffers: shared hit=14242"
"        ->  Bitmap Index Scan on lotid7_idx  (cost=0.00..38.70 rows=2056 width=0) (actual rows=298350 loops=1)"
"              Buffers: shared hit=929"
"Planning time: 0.256 ms"
"Execution time: 154.657 ms"

没有索引：

explain (analyze on, BUFFERS on, TIMING OFF) SELECT distinct handlerid from lotintro where length(lotid) = 7

"HashAggregate  (cost=19490.11..19490.12 rows=1 width=6) (actual rows=151 loops=1)"
"  Group Key: handlerid"
"  Buffers: shared hit=13316"
"  ->  Seq Scan on lotintro  (cost=0.00..19484.97 rows=2056 width=6) (actual rows=298350 loops=1)"
"        Filter: (length(lotid) = 7)"
"        Rows Removed by Filter: 112915"
"        Buffers: shared hit=13316"
"Planning time: 0.168 ms"
"Execution time: 176.466 ms"

Answer 1

您需要索引WHERE子句中使用的确切表达式：http://www.postgresql.org/docs/9.4/static/indexes-expressional.html

CREATE INDEX char_length_lotid_idx ON lotintro (char_length(lotid));

您还可以按照建议创建STABLE或IMMUTABLE函数来打包此查询：http://www.postgresql.org/docs/9.4/static/sql-createfunction.html

您的上一个建议也是可行的，您要找的是MATERIALIZED VIEWS：http://www.postgresql.org/docs/9.4/static/sql-creatematerializedview.html 这可以防止您编写自定义触发器来更新非规范化表。

Answer 2

由于3/4的行满足您的条件（长度（lotid）= 7），因此索引本身无济于事。由于仅使用索引扫描，因此使用此索引可能会获得更好的性能：

create index on lotintro using btree(char_length(lotid), handlerid);

但这不是最佳解决方案。因为只有很少的不同值，你可以使用名为loose index scan的技巧，在你的情况下它应该更快：

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 ORDER BY handlerid LIMIT 1)  -- parentheses required
   UNION ALL
   SELECT (SELECT handlerid FROM lotintro WHERE char_length(lotid)=7 AND handlerid > t.handlerid ORDER BY handlerid LIMIT 1)
   FROM t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t WHERE handlerid IS NOT NULL;

对于此查询，您还需要创建上面提到的索引。

Answer 3

1）

不，函数不会以任何方式保留结果的快照。如果您定义函数STABLE（这是正确的），则有一些潜在的性能优化。 Per documentation:

STABLE函数无法修改数据库并且可以保证给出a中所有行的相同参数，返回相同的结果单个声明。

IMMUTABLE此处错误，可能会导致错误。

因此，这可以极大地使同一语句中的多个调用受益 - 但这不适合您的用例......

plpgsql函数的工作方式类似于预处理语句，在同一个会话中提供类似的奖励：

Difference between language sql and language plpgsql in PostgreSQL functions

2）

尝试MATERIALIZED VIEW。无论有没有MV（或其他一些缓存技术），partial index对您的特殊情况最有效：

CREATE INDEX lotid7_idx ON lotintro (handlerid)
WHERE  length(lotid) = 7;

请记住在应该使用索引的查询中包含索引条件，即使这似乎是多余的：

PostgreSQL does not use a partial index

但是，正如您提供的那样：

长度百分比（lotid）= 7. [298350/411000 = 73％]

如果您可以从中获取仅索引扫描，那么该索引只会有所帮助，因为条件几乎没有选择性。由于该表具有非常宽的行，因此仅索引扫描可以更快。

索引扫描松散

另外，rows=298350会折叠为rows=151，因此我会在此处解释松散的索引扫描：

Optimize GROUP BY query to retrieve latest record per user

或Postgres Wiki - 实际上是基于这篇文章。

WITH RECURSIVE t AS (
   (SELECT handlerid FROM lotintro
    WHERE  length(lotid) = 7
    ORDER  BY 1 LIMIT 1)

   UNION ALL
   SELECT (SELECT handlerid FROM lotintro
           WHERE  length(lotid) = 7
           AND    handlerid > t.handlerid
           ORDER  BY 1 LIMIT 1)
   FROM  t
   WHERE t.handlerid IS NOT NULL
   )
SELECT handlerid FROM t
WHERE  handlerid IS NOT NULL;

这会更快，但我建议与部分索引的组合。由于部分索引只有大约一半的大小并且更新频率较低（取决于访问模式），因此总体上更便宜。

如果将表保持真空以允许仅索引扫描，则更快。如果您有大量写入，则可以为此表设置更积极的存储参数：

PostgreSQL Initial Database Size

最后，您可以使用基于此查询的物化视图更快地完成此操作。

提高重复查询的查询效率

3 个答案:

1）

2）

索引扫描松散