我的数据库中有这两个表
Student Table Student Semester Table
| Column : Type | | Column : Type |
|------------|----------| |------------|----------|
| student_id : integer | | student_id : integer |
| satquan : smallint | | semester : integer |
| actcomp : smallint | | enrolled : boolean |
| entryyear : smallint | | major : text |
|-----------------------| | college : text |
|-----------------------|
其中student_id是学生表中的唯一键,以及学生学期表中的外键。学期整数在第一学期只是1,第二学期是2,依此类推。
我正在进行查询,我希望在他们的入学年份(有时通过他们的坐姿和/或行为分数)获得学生,然后从学生学期表中获取所有这些学生的相关数据。
目前,我的查询看起来像这样:
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student WHERE entryyear = 2006
) AND college = 'AS' AND ...
)
ORDER BY student_id, semester;
但是,当我选择~1k学生时,这导致相对长时间运行的查询(400ms)。根据执行计划,大部分时间都花在做散列连接上。为了改善这种情况,我在student_semester表中添加了satquan,actpcomp和entryyear列。这样可以将运行查询的时间缩短约90%,但会导致大量冗余数据。有更好的方法吗?
这些是我目前拥有的索引(以及student_id上的隐式索引):
CREATE INDEX act_sat_entryyear ON student USING btree (entryyear, actcomp, sattotal)
CREATE INDEX student_id_major_college ON student_semester USING btree (student_id, major, college)
查询计划
QUERY PLAN
Hash Join (cost=17311.74..35895.38 rows=81896 width=65) (actual time=121.097..326.934 rows=25680 loops=1)
Hash Cond: (public.student_semester.student_id = public.student_semester.student_id)
-> Seq Scan on student_semester (cost=0.00..14307.20 rows=698820 width=65) (actual time=0.015..154.582 rows=698820 loops=1)
-> Hash (cost=17284.89..17284.89 rows=2148 width=8) (actual time=121.062..121.062 rows=1284 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 51kB
-> HashAggregate (cost=17263.41..17284.89 rows=2148 width=8) (actual time=120.708..120.871 rows=1284 loops=1)
-> Hash Semi Join (cost=1026.68..17254.10 rows=3724 width=8) (actual time=4.828..119.619 rows=6184 loops=1)
Hash Cond: (public.student_semester.student_id = student.student_id)
-> Seq Scan on student_semester (cost=0.00..16054.25 rows=42908 width=4) (actual time=0.013..109.873 rows=42331 loops=1)
Filter: ((college)::text = 'AS'::text)
-> Hash (cost=988.73..988.73 rows=3036 width=4) (actual time=4.801..4.801 rows=3026 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 107kB
-> Bitmap Heap Scan on student (cost=71.78..988.73 rows=3036 width=4) (actual time=0.406..3.223 rows=3026 loops=1)
Recheck Cond: (entryyear = 2006)
-> Bitmap Index Scan on student_act_sat_entryyear_index (cost=0.00..71.03 rows=3036 width=0) (actual time=0.377..0.377 rows=3026 loops=1)
Index Cond: (entryyear = 2006)
Total runtime: 327.708 ms
我错误地认为查询中没有Seq Scan。我认为由于符合大学条件的行数,Seq Scan正在完成;当我将其更改为学生较少的学生时,使用索引。资料来源:https://stackoverflow.com/a/5203827/880928
使用entryyear列查询学生学期表
SELECT * FROM student_semester
WHERE student_id IN(
SELECT student_id FROM student_semester
WHERE entryyear = 2006 AND collgs = 'AS'
) ORDER BY student_id, semester;
查询计划
Sort (cost=18597.13..18800.49 rows=81343 width=65) (actual time=72.946..74.003 rows=25680 loops=1)
Sort Key: public.student_semester.student_id, public.student_semester.semester
Sort Method: quicksort Memory: 3546kB
-> Nested Loop (cost=9843.87..11962.91 rows=81343 width=65) (actual time=24.617..40.751 rows=25680 loops=1)
-> HashAggregate (cost=9843.87..9845.73 rows=186 width=4) (actual time=24.590..24.836 rows=1284 loops=1)
-> Bitmap Heap Scan on student_semester (cost=1612.75..9834.63 rows=3696 width=4) (actual time=10.401..23.637 rows=6184 loops=1)
Recheck Cond: (entryyear = 2006)
Filter: ((collgs)::text = 'AS'::text)
-> Bitmap Index Scan on entryyear_act_sat_semester_enrolled_cumdeg_index (cost=0.00..1611.82 rows=60192 width=0) (actual time=10.259..10.259 rows=60520 loops=1)
Index Cond: (entryyear = 2006)
-> Index Scan using student_id_index on student_semester (cost=0.00..11.13 rows=20 width=65) (actual time=0.003..0.010 rows=20 loops=1284)
Index Cond: (student_id = public.student_semester.student_id)
Total runtime: 74.938 ms
答案 0 :(得分:1)
您查询的干净版本是
select ss.*
from
student s
inner join
student_semester ss using(student_id)
where
s.entryyear = 2006
and exists (
select 1
from student_semester
where
college = 'AS'
and student_id = s.student_id
)
order by ss.student_id, semester
答案 1 :(得分:1)
执行查询的另一种方法是使用窗口函数。
select t.* -- Has the extra NumMatches column. To eliminate it, list the columns you want
from (select ss.*,
sum(case when ss.college = 'AS' and s.entry_year = 206 then 1 else 0 end) over
(partition by student_id) as NumMatches
from student_semester ss join
student s
on ss.student_id = s.student_id
) t
where NumMatches > 0;
窗口函数通常比加入聚合更快,所以我怀疑这可能表现良好。
答案 2 :(得分:0)
你想要的是,2006年进入并且曾经的学生都在AS大学。
第一版。
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND student_id IN (SELECT student_id
FROM student_semester s2 WHERE s2.college='AS')
AND /* other criteria */
ORDER BY sem.student_id, semester;
第二版
SELECT sem.*
FROM student s JOIN student_semester sem USING (student_id)
WHERE s.entry_year=2006
AND EXISTS
(SELECT 1 FROM student_semester s2
WHERE s2.student_id = s.student_id AND s2.college='AS')
-- CREATE INDEX foo on student_semester(student_id, college);
AND /* other criteria */
ORDER BY sem.student_id, semester;
我希望两者都快,但是它们是否比另一个(或完全相同的计划)表现更好是一个PG神秘。
[编辑] 这是一个没有半连接的版本。我不希望它运作良好,因为每次学生在AS时都会有多次点击。
SELECT DISTINCT ON ( /* PK of sem */ )
FROM student s
JOIN student_semester sem USING (student_id)
JOIN student_semester s2 USING (student_id)
WHERE s.entry_year=2006
AND s2.college='AS'
ORDER BY sem.student_id, semester;