INNER JOIN条件下的列顺序会严重影响性能

时间:2018-12-13 02:34:07

标签: sql ruby-on-rails postgresql

我有两个相互链接的表,如下所示:

具有以下列和索引的表answered_questions

  • id:主键
  • taken_test_id:整数(外键)
  • question_id:整数(外键,链接到另一个名为questions的表)
  • indexes :( taken_test_id,question_id)

taken_tests

  • id:主键
  • user_id :(外键,指向表Users的链接)
  • 索引:user_id

第一个查询(带有EXPLAIN ANALYZE输出):

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "answered_questions"."taken_test_id" = "taken_tests"."id" 
WHERE 
  "taken_tests"."user_id" = 1;

输出:

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.025..2.208 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.483 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=0.00
2..0.003 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.276 ms
 Execution time: 2.365 ms
(7 rows)

另一个查询(在ActiveRecord中使用joins方法时,Rails会自动生成该查询)

EXPLAIN ANALYZE 
SELECT 
  "answered_questions".* 
FROM 
  "answered_questions" 
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE 
  "taken_tests"."user_id" = 1;

这是输出

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=23.611..1257.807 rows=653 loops=1)
   ->  Index Scan using index_taken_tests_on_user_id on taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=10.451..71.474 rows=371 loops=1)
         Index Cond: (user_id = 1)
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual time=2.07
1..3.195 rows=2 loops=371)
         Index Cond: (taken_test_id = taken_tests.id)
 Planning time: 0.302 ms
 Execution time: 1258.035 ms
(7 rows)

唯一的区别是 INNER JOIN 条件下的列顺序。在第一个查询中,它是ON "answered_questions"."taken_test_id" = "taken_tests"."id",而在第二个查询中,它是ON "taken_tests"."id" = "answered_questions"."taken_test_id"。但是查询时间却大不相同。

您知道为什么会这样吗?我读了一些文章,它说JOIN条件下的列顺序不应该影响执行时间(例如:Best practices for the order of joined columns in a sql join?

我正在使用Postgres 9.6。 answered_questions表中有超过4000万行,taken_tests表中有超过300万行

更新1:

当我使用(analyze true, verbose true, buffers true)运行EXPLAIN时,第二个查询的结果要好得多(非常类似于第一个查询)

EXPLAIN (ANALYZE TRUE, VERBOSE TRUE, BUFFERS TRUE) 
SELECT
  "answered_questions".* 
FROM
  "answered_questions"
  INNER JOIN "taken_tests" ON "taken_tests"."id" = "answered_questions"."taken_test_id" 
WHERE
  "taken_tests"."user_id" = 1;

输出

Nested Loop  (cost=0.99..116504.61 rows=1472 width=61) (actual time=0.030..2.192 rows=653 loops=1)
   Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated_at, a
nswered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
   Buffers: shared hit=1986
   ->  Index Scan using index_taken_tests_on_user_id on public.taken_tests  (cost=0.43..274.18 rows=91 width=4) (actual time=0.014..0.441 rows=371 loops=1)
         Output: taken_tests.id
         Index Cond: (taken_tests.user_id = 1)
         Buffers: shared hit=269
   ->  Index Scan using index_answered_questions_on_taken_test_id_and_question_id on public.answered_questions  (cost=0.56..1273.61 rows=365 width=61) (actual ti
me=0.002..0.003 rows=2 loops=371)
         Output: answered_questions.id, answered_questions.question_id, answered_questions.answer_text, answered_questions.created_at, answered_questions.updated
_at, answered_questions.taken_test_id, answered_questions.correct, answered_questions.answer
         Index Cond: (answered_questions.taken_test_id = taken_tests.id)
         Buffers: shared hit=1717
 Planning time: 0.238 ms
 Execution time: 2.335 ms

1 个答案:

答案 0 :(得分:1)

从最初的EXPLAIN ANALYZE语句结果中可以看到,这些查询产生了等效的查询计划,并且执行的方式完全相同。

区别在于同一单元的执行时间:

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) ( 实际时间= 0.014..0.483 rows=371 loops=1)

-> Index Scan using index_taken_tests_on_user_id on taken_tests (cost=0.43..274.18 rows=91 width=4) ( 实际时间= 10.451..71.474 rows=371 loops=1)

正如注释者已经指出的那样(请参阅wuestion注释中的文档链接),内部联接的查询计划应与表顺序无关。根据查询计划者的决定对其进行排序。这意味着您应该真正查看查询执行的其他性能优化部分。其中之一是用于缓存(SHARED BUFFER)的内存。看起来查询结果在很大程度上取决于此数据是否已经加载到内存中。就像您已经注意到的那样-等待一段时间后,查询执行时间会增加。这显然表明缓存到期问题比计划问题更多。 增大共享缓冲区的大小可能有助于解决它,但是查询的初始执行将始终花费更长的时间-这只是您的磁盘访问速度。

有关Pg数据库的内存配置的更多提示,请参见:https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

注意:VACUUM或ANALYZE命令在这里不太可能有帮助。这两个查询已经在使用相同的计划。但是请记住,由于PostgreSQL事务隔离机制(MVCC),它可能必须读取基础表行以验证从索引中获取结果后它们仍对当前事务可见。可以通过在吸尘期间更新可见性图(请参见https://www.postgresql.org/docs/10/storage-vm.html)来改善这一点。