Question

那里的巫师。我对以下sql查询的奇怪表现感到不满。我有以下查询，它由以下sql代码组成（在pg上运行）：

WITH temp_song_id_with_all_stemmed_words AS 
    (SELECT song_id FROM stemmed_words
    WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
    GROUP BY song_id
    HAVING COUNT(*)=5)
SELECT *
FROM words
WHERE song_id IN(
    SELECT song_id
    FROM temp_song_id_with_all_stemmed_words
)
ORDER BY song_id, global_position;

使用表格中的数据量计算大约需要10秒钟。我尝试了各种方法来优化此查询：

将＆＃34;与＆＃34;查询本身内的子句
将＆＃34;与＆＃34;临时表中的子句然后查询它
索引每个可能的列

但这一切都无济于事。计算时间仍然在大约10秒的范围内（假设所有内容都已缓存在内存中......它不会花费甚至一分钟）

然后我注意到，当我将查询拆分为其组成部分时，事情表现完全不同：

SELECT song_id FROM stemmed_words
        WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
        GROUP BY song_id
        HAVING COUNT(*)=5

此查询需要大约500毫秒才能计算出上衣，结果会产生3个ID

当我使用这些结果来计算封闭查询::

时

SELECT *
FROM words
WHERE song_id IN(337409,328981,304231)
ORDER BY song_id, global_position;

完成需要大约30ms

我不知道这里有什么内容，但我想可以使用正确的sql优化器来完成上面的操作。

当我查看解释输出时，我看到以下内容：

- UPDATE-- 输入解释（分析，详细）而不是解释

"Merge Join  (cost=20253.29..706336.00 rows=6312654 width=21) (actual time=240731.380..259453.350 rows=356 loops=1)"
"  Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
"  Merge Cond: (words.song_id = temp_song_id_with_all_stemmed_words.song_id)"
"  CTE temp_song_id_with_all_stemmed_words"
"    ->  HashAggregate  (cost=19799.62..19936.11 rows=13649 width=4) (actual time=43.168..44.916 rows=3 loops=1)"
"          Output: stemmed_words.song_id"
"          Group Key: stemmed_words.song_id"
"          Filter: (count(*) = 5)"
"          Rows Removed by Filter: 17181"
"          ->  Bitmap Heap Scan on public.stemmed_words  (cost=474.02..19714.55 rows=17014 width=4) (actual time=10.254..31.899 rows=21099 loops=1)"
"                Output: stemmed_words.stemmed_word, stemmed_words.song_id"
"                Recheck Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
"                Heap Blocks: exact=12239"
"                ->  Bitmap Index Scan on stemmed_words_pkey  (cost=0.00..469.76 rows=17014 width=0) (actual time=6.052..6.052 rows=21099 loops=1)"
"                      Index Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
"  ->  Index Scan using words_song_id_global_position_idx on public.words  (cost=0.44..653025.11 rows=12625308 width=21) (actual time=0.117..257820.366 rows=7860598 loops=1)"
"        Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
"  ->  Sort  (cost=316.75..317.25 rows=200 width=4) (actual time=44.953..45.017 rows=274 loops=1)"
"        Output: temp_song_id_with_all_stemmed_words.song_id"
"        Sort Key: temp_song_id_with_all_stemmed_words.song_id"
"        Sort Method: quicksort  Memory: 25kB"
"        ->  HashAggregate  (cost=307.10..309.10 rows=200 width=4) (actual time=44.928..44.929 rows=3 loops=1)"
"              Output: temp_song_id_with_all_stemmed_words.song_id"
"              Group Key: temp_song_id_with_all_stemmed_words.song_id"
"              ->  CTE Scan on temp_song_id_with_all_stemmed_words  (cost=0.00..272.98 rows=13649 width=4) (actual time=43.171..44.921 rows=3 loops=1)"
"                    Output: temp_song_id_with_all_stemmed_words.song_id"
"Planning time: 0.481 ms"
"Execution time: 259454.102 ms"

但老实说，我不明白那里发生了什么......对我来说看起来像中国人。

所以要总结一下我的问题：我有一种感觉，我可以将其优化为单个查询，而不是将其分成两个单独的查询。

为什么要花这么长时间才能以目前的形式完成？
如何将查询分成两个单独的查询来优化它，就像我上面那样？

Answer 1

这里的问题是PostgreSQL无法正确估计CTE（= WITH查询）将返回的行数。

PostgreSQL估计13649行，而你告诉我们正确的数字是3。

我希望你的第二种技术有好的结果（把“with”子句置于临时表中然后查询它）只要你ANALYZE这两个操作之间的临时表，因为那时候PostgreSQL确切知道它必须处理多少值。

优化嵌套查询以避免巨大的性能损失

1 个答案: