Postgres外部数据包装器问题'is null'where子句

时间:2017-11-09 13:12:06

标签: sql postgresql foreign-data-wrapper

我正在尝试使用查询构建报表,通过FDW访问不同的postgres数据库。

我猜它为什么会这样。 没有where子句的第一个查询很好:

SELECT s.student_id, p.surname 
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id

但添加caluse会使此查询慢一百倍(40s vs 0.1s):

SELECT s.student_id, p.surname 
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id
WHERE s.learning_end_date IS NULL

EXPLAIN VERBOSE的结果:

Nested Loop  (cost=200.00..226.39 rows=1 width=734)
  Output: s.student_id, p.surname
  Join Filter: ((s.person_id)::text = (p.id)::text)
  ->  Foreign Scan on public.rep_student s  (cost=100.00..111.80 rows=1 width=436)
        Output: s.student_id, s.version, s.person_id, s.curriculum_flow_id, s.learning_start_date, s.learning_end_date, s.learning_end_reason, s.last_update_timestamp, s.aud_created_ts, s.aud_created_by, s.aud_last_updated_ts, s.aud_last_updated_by
        Remote SQL: SELECT student_id, person_id FROM public.rep_student WHERE ((learning_end_date IS NULL))
  ->  Foreign Scan on public.rep_person p  (cost=100.00..113.24 rows=108 width=734)
        Output: p.id, p.version, p.surname, p.name, p.middle_name, p.birthdate, p.info, p.photo, p.last_update_timestamp, p.is_archived, p.gender, p.aud_created_ts, p.aud_created_by, p.aud_last_updated_ts, p.aud_last_updated_by, p.full_name
        Remote SQL: SELECT id, surname FROM public.rep_person`

EXPLAIN ANALYZE的结果:

Nested Loop  (cost=200.00..226.39 rows=1 width=734) (actual time=27.138..38996.303 rows=939 loops=1)
  Join Filter: ((s.person_id)::text = (p.id)::text)
  Rows Removed by Join Filter: 15194898
  ->  Foreign Scan on rep_student s  (cost=100.00..111.80 rows=1 width=436) (actual time=0.685..4.259 rows=939 loops=1)
  ->  Foreign Scan on rep_person p  (cost=100.00..113.24 rows=108 width=734) (actual time=1.380..39.094 rows=16183 loops=939)
Planning time: 0.251 ms
Execution time: 38997.914 ms

表的数据计数相对较小。学习表中的几乎所有行在learning_end_date列中都为NULL。

学生~1000行。人数~15000。

似乎Postgres在使用FDW过滤NULL时遇到问题,因为此查询再次快速执行:

SELECT s.student_id, p.surname 
FROM rep_student s inner JOIN rep_person p ON p.id = s.person_id
WHERE s.learning_start_date < current_date

EXPLAIN VERBOSE的结果:

Hash Join  (cost=214.59..231.83 rows=36 width=734)
  Output: s.student_id, p.surname
  Hash Cond: ((s.person_id)::text = (p.id)::text)
  ->  Foreign Scan on public.rep_student s  (cost=100.00..116.65 rows=59 width=436)
        Output: s.student_id, s.version, s.person_id, s.curriculum_flow_id, s.learning_start_date, s.learning_end_date, s.learning_end_reason, s.last_update_timestamp, s.aud_created_ts, s.aud_created_by, s.aud_last_updated_ts, s.aud_last_updated_by
        Filter: (s.learning_start_date < ('now'::cstring)::date)
        Remote SQL: SELECT student_id, person_id, learning_start_date FROM public.rep_student"
  ->  Hash  (cost=113.24..113.24 rows=108 width=734)
        Output: p.surname, p.id
        ->  Foreign Scan on public.rep_person p  (cost=100.00..113.24 rows=108 width=734)
              Output: p.surname, p.id
              Remote SQL: SELECT id, surname FROM public.rep_person`

EXPLAIN ANALYZE的结果:

Hash Join  (cost=214.59..231.83 rows=36 width=734) (actual time=41.614..46.347 rows=940 loops=1)
  Hash Cond: ((s.person_id)::text = (p.id)::text)
  ->  Foreign Scan on rep_student s  (cost=100.00..116.65 rows=59 width=436) (actual time=0.718..3.829 rows=940 loops=1)
        Filter: (learning_start_date < ('now'::cstring)::date)
  ->  Hash  (cost=113.24..113.24 rows=108 width=734) (actual time=40.812..40.812 rows=16183 loops=1)
        Buckets: 16384 (originally 1024)  Batches: 2 (originally 1)  Memory Usage: 921kB
        ->  Foreign Scan on rep_person p  (cost=100.00..113.24 rows=108 width=734) (actual time=2.252..35.079 rows=16183 loops=1)
Planning time: 0.208 ms
Execution time: 47.176 ms

尝试在learning_end_date上添加索引,但没有遇到任何影响。

使用'IS NULL'where子句,我需要更改什么才能使查询执行得更快?任何想法将不胜感激!

1 个答案:

答案 0 :(得分:1)

您的问题是您没有关于这些外表的良好表统计信息,因此PostgreSQL优化器的行计数估计非常随意。

这导致优化器在报告为慢速的情况下选择嵌套循环连接,这是一个不合适的计划。

巧合的是,这种情况发生在某个IS NULL条件下。

使用

收集有关外表的统计信息
ANALYZE rep_student;
ANALYZE rep_person;

然后表现会好很多。

请注意,虽然autovacuum会自动收集本地表的统计信息,但它不会对远程表执行此操作,因为它不知道已更改了多少行,因此您应定期ANALYZE其数据发生更改的外部表。