我正在编写一个查询,它使用聚合函数来减少数据重复,因为查询会连接十二个表。
考虑查询(简化以显示瓶颈):
SELECT r.source_uri AS su_on_r,
tag.voted_tag AS vt_on_tag,
tag.votes AS v_on_tag,
FROM release r
INNER JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN (
SELECT source_uri, string_agg(voted_tag, '|') as voted_tag, string_agg(votes::text, '|') as votes
FROM release_voted_tag
GROUP BY source_uri
) tag
ON r.source_uri = tag.source_uri
release_barcode上的过滤器将r
的数量限制为21(约为8米)。
LEFT JOIN
的计划如下:
-> Merge Left Join (cost=1461.05..157205.05 rows=125 width=242) (actual time=23.322..1994.827 rows=21 loops=1)
Merge Cond: ((r.source_uri)::text = (release_voted_tag.source_uri)::text)
-> Sort (cost=1460.50..1460.81 rows=125 width=178) (actual time=0.974..0.991 rows=21 loops=1)
Sort Key: r.source_uri
Sort Method: quicksort Memory: 30kB
-> Nested Loop (cost=0.99..1456.15 rows=125 width=178) (actual time=0.071..0.870 rows=21 loops=1)
-> Index Scan using release_barcode_barcode_idx on release_barcode barcode (cost=0.43..382.71 rows=125 width=62) (actual time=0.029..0.061 rows=21 loops=1)
Index Cond: ((barcode)::text = ANY ('{75992731324,075992731324,0075992731324}'::text[]))
-> Index Scan using release_source_uri_idx on release r (cost=0.56..8.58 rows=1 width=169) (actual time=0.037..0.037 rows=1 loops=21)
Index Cond: ((source_uri)::text = (barcode.source_uri)::text)
-> Materialize (cost=0.55..155340.82 rows=161233 width=132) (actual time=0.026..1625.598 rows=321318 loops=1)
-> GroupAggregate (cost=0.55..153325.41 rows=161233 width=132) (actual time=0.024..1446.457 rows=321318 loops=1)
Group Key: release_voted_tag.source_uri
-> Index Scan using release_voted_tag_source_uri_idx on release_voted_tag (cost=0.55..136510.34 rows=1151726 width=82) (actual time=0.007..647.964 rows=1151726 loops=1)
这是完整的计划(显示包含过滤子句的完整查询):
https://explain.depesz.com/s/dkNC
在我看来,问题是LEFT JOIN
返回的行数。
如果应用了r
的过滤器,超过1米的数字远不及我预期的数量。我希望返回84行,这相当于:
select release_barcode.source_uri,voted_tag
from release_barcode,release_voted_tag
where release_voted_tag.source_uri=release_barcode.source_uri and barcode IN ( '75992731324', '075992731324', '0075992731324');
我认为这可以限制从release_voted_tag
中选择的记录数量,因为在查询之外应用了ON
过滤器。
如前所述,还有其他1:M联接。我最初把它写成:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
t.title,
string_agg(distinct tComposer.composer, '|') AS c_on_tComposer
FROM release r
JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri
AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
LEFT JOIN medium m
ON r.source_uri = m.source_uri
LEFT JOIN track t
ON m.id = t.medium
LEFT JOIN track_composer tComposer
ON t.id = tComposer.track
GROUP BY r.source_uri, t.title;
但是,由于track_composer
通过medium
和track
加入release_voted_tag
,我们最终会有多个track_composer
行,当有多个track_composer
时会聚合秒。例如,如果有string_agg(tag.voted_tag, '|')
个distinct
,则重复tag.voted_tag
。
请注意,我们在使用tag.votes
时必须小心谨慎,因为track_composer
和Merge Left Join
必须在以后关联。
我发现我可以通过svg
执行聚合的相关子查询来解决这个问题,但这不是很高效吗?它会针对每一行运行吗?
这就是为什么我移动到连接中的子查询,因为这样我可以将聚合放在连接中并确保只返回一行,使得连接到其他1:M表...清醒。
为什么要制定昂贵的svg
,如何才能提高效率?
答案 0 :(得分:1)
我会把它重写为:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
FROM release r
JOIN release_barcode barcode
ON r.source_uri = barcode.source_uri
AND barcode.barcode IN ( '75992731324', '075992731324', '0075992731324')
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
GROUP BY r.source_uri;
甚至:
SELECT r.source_uri AS su_on_r,
string_agg(tag.voted_tag, '|') AS vt_on_tag,
string_agg(tag.votes::text, '|') AS v_on_tag,
FROM release r
LEFT JOIN release_voted_tag tag
ON r.source_uri = tag.source_uri
WHERE r.source_uri IN (SELECT source_uri FROM release_barcode WHERE
barcode IN ('75992731324','075992731324', '0075992731324')
GROUP BY r.source_uri;
答案 1 :(得分:0)
您获得一百万行的原因是首先运行内部查询,然后应用过滤器。如果你想首先应用过滤器,那么你需要将它添加到内部查询(这可能不是一个好主意,因为它将创建一个相关的查询)重写连接,如其他一个回复中所述( lad2025的)