WITH temp_song_id_with_all_stemmed_words AS
(SELECT song_id FROM stemmed_words
WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
GROUP BY song_id
HAVING COUNT(*)=5)
SELECT *
FROM words
WHERE song_id IN(
SELECT song_id
FROM temp_song_id_with_all_stemmed_words
)
ORDER BY song_id, global_position;
使用表格中的数据量计算大约需要10秒钟。 我尝试了各种方法来优化此查询:
但这一切都无济于事。计算时间仍然在大约10秒的范围内(假设所有内容都已缓存在内存中......它不会花费甚至一分钟)
然后我注意到,当我将查询拆分为其组成部分时,事情表现完全不同:
SELECT song_id FROM stemmed_words
WHERE stemmed_word IN ('yesterdai','troubl','seem','awai','believ')
GROUP BY song_id
HAVING COUNT(*)=5
此查询需要大约500毫秒才能计算出上衣,结果会产生3个ID
当我使用这些结果来计算封闭查询::
时SELECT *
FROM words
WHERE song_id IN(337409,328981,304231)
ORDER BY song_id, global_position;
完成需要大约30ms
我不知道这里有什么内容,但我想可以使用正确的sql优化器来完成上面的操作。
当我查看解释输出时,我看到以下内容:
- UPDATE-- 输入解释(分析,详细)而不是解释
"Merge Join (cost=20253.29..706336.00 rows=6312654 width=21) (actual time=240731.380..259453.350 rows=356 loops=1)"
" Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
" Merge Cond: (words.song_id = temp_song_id_with_all_stemmed_words.song_id)"
" CTE temp_song_id_with_all_stemmed_words"
" -> HashAggregate (cost=19799.62..19936.11 rows=13649 width=4) (actual time=43.168..44.916 rows=3 loops=1)"
" Output: stemmed_words.song_id"
" Group Key: stemmed_words.song_id"
" Filter: (count(*) = 5)"
" Rows Removed by Filter: 17181"
" -> Bitmap Heap Scan on public.stemmed_words (cost=474.02..19714.55 rows=17014 width=4) (actual time=10.254..31.899 rows=21099 loops=1)"
" Output: stemmed_words.stemmed_word, stemmed_words.song_id"
" Recheck Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
" Heap Blocks: exact=12239"
" -> Bitmap Index Scan on stemmed_words_pkey (cost=0.00..469.76 rows=17014 width=0) (actual time=6.052..6.052 rows=21099 loops=1)"
" Index Cond: (stemmed_words.stemmed_word = ANY ('{yesterdai,troubl,seem,awai,believ}'::text[]))"
" -> Index Scan using words_song_id_global_position_idx on public.words (cost=0.44..653025.11 rows=12625308 width=21) (actual time=0.117..257820.366 rows=7860598 loops=1)"
" Output: words.song_id, words.word, words.global_position, words.line_number, words.verse_number"
" -> Sort (cost=316.75..317.25 rows=200 width=4) (actual time=44.953..45.017 rows=274 loops=1)"
" Output: temp_song_id_with_all_stemmed_words.song_id"
" Sort Key: temp_song_id_with_all_stemmed_words.song_id"
" Sort Method: quicksort Memory: 25kB"
" -> HashAggregate (cost=307.10..309.10 rows=200 width=4) (actual time=44.928..44.929 rows=3 loops=1)"
" Output: temp_song_id_with_all_stemmed_words.song_id"
" Group Key: temp_song_id_with_all_stemmed_words.song_id"
" -> CTE Scan on temp_song_id_with_all_stemmed_words (cost=0.00..272.98 rows=13649 width=4) (actual time=43.171..44.921 rows=3 loops=1)"
" Output: temp_song_id_with_all_stemmed_words.song_id"
"Planning time: 0.481 ms"
"Execution time: 259454.102 ms"
但老实说,我不明白那里发生了什么......对我来说看起来像中国人。
所以要总结一下我的问题: 我有一种感觉,我可以将其优化为单个查询,而不是将其分成两个单独的查询。
答案 0 :(得分:2)
这里的问题是PostgreSQL无法正确估计CTE(= WITH查询)将返回的行数。
PostgreSQL估计13649行,而你告诉我们正确的数字是3。
我希望你的第二种技术有好的结果(把“with”子句置于临时表中然后查询它)只要你ANALYZE
这两个操作之间的临时表,因为那时候PostgreSQL确切知道它必须处理多少值。