在已连接的子查询中复制WHERE条件

时间:2017-11-24 10:16:03

标签: sql postgresql query-optimization

我写了一个连接十四个表的查询。当条件返回大量行时,查询需要很长时间。这是原始查询,具有较大的IN条件:

SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r, r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r, string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode, string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre, string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ, string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag, IMAGE.uri AS u_on_image, IMAGE.width AS w_on_image, IMAGE.height AS h_on_image, IMAGE.score AS s_on_image, string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType, string_agg(tag.votes::TEXT, '|') AS v_on_tag, string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url, event.label_name AS ln_on_event, event.cat AS c_on_event, m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m, t.position AS p_on_t, t.title AS t_on_t, string_agg(DISTINCT t.duration::TEXT, '|') AS d_on_t, string_agg(DISTINCT tArtist.artist::TEXT, '|') AS a_on_tArtist, string_agg(DISTINCT tComposer.composer::TEXT, '|') AS c_on_tComposer, string_agg(DISTINCT tIsrc.isrc::TEXT, '|') AS i_on_tIsrc
FROM release r
LEFT JOIN release_barcode barcode ON r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre ON r.source_uri = genre.source_uri
LEFT JOIN release_type typ ON r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag ON r.source_uri = tag.source_uri
LEFT JOIN release_image IMAGE ON r.source_uri = IMAGE.source_uri
LEFT JOIN release_image_type imageType ON IMAGE.id = imageType.image_id
LEFT JOIN release_url url ON r.source_uri = url.source_uri
LEFT JOIN release_event event ON r.source_uri = event.source_uri
LEFT JOIN medium m ON r.source_uri = m.source_uri
LEFT JOIN track t ON m.id = t.medium
LEFT JOIN track_artist tArtist ON t.id = tArtist.track
LEFT JOIN track_composer tComposer ON t.id = tComposer.track
LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
WHERE r.source_uri IN (
  'https://api.discogs.com/releases/1955915'
  ,'https://api.discogs.com/releases/8602631'
  ,[and so on for about thirty more URIs]
  )
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r, a_on_r, c_on_r, rd_on_r, u_on_image, w_on_image, h_on_image, s_on_image, ln_on_event, c_on_event, p_on_m, t_on_m, f_on_m, p_on_t, t_on_t;

看一下解释,由于大的GROUP BY语句,大部分工作都在排序:https://explain.depesz.com/s/dV5o

您可以看到聚合在> 90k行上工作。由于连接数的原因,行数非常大,许多1:m的表会导致行的指数增长。

首次尝试,将聚合移动到已加入的子查询

所以我想知道如何重写查询而不必将所有这些行组合起来。我决定将连接编写为子查询,并将聚合移动到这些子查询中。

我的第一次尝试是(仅release_barcode的一个示例,对所有表重复):

LEFT JOIN (
    SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
    FROM release_barcode
    GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri

这样做的原因是返回的行数更少,而且我不需要进行大量的排序,因为顶级查询中没有GROUP BY。

但是,这个速度慢了! 这是因为查询规划器似乎没有首先应用顶级查询的条件。而是将它们连接在一起。

下一次尝试,在子查询中重复标准

所以我尝试了不同的东西;为了强制每个子查询中的过滤器,我只是复制了标准:

LEFT JOIN (
    SELECT source_uri, string_agg(DISTINCT barcode::TEXT, '|') AS b_on_barcode
    FROM release_barcode
    WHERE source_uri IN (
      'https://api.discogs.com/releases/1955915'
      ,'https://api.discogs.com/releases/8602631'
      ,[and so on for about thirty more URIs]
      )
    GROUP BY source_uri
) AS barcode ON r.source_uri = barcode.source_uri

在每个子查询中只复制了WHERE子句。

结果不言而喻:https://explain.depesz.com/s/exSw

一个更复杂的查询,但速度提高了100倍!

但当然,重复的标准闻起来非常有吸引力。

所以我的问题有两个:

  • 这种类型的优化是否有名称,是不是不赞成?
  • 有没有更好的方法来避免重复(请参阅我的第一次尝试)?

1 个答案:

答案 0 :(得分:0)

  • 增加geqo_treshold(甚至join_collapse_limit注意:这可能会将计划时间增加到一秒以上

  • 通过将紧密相关表格拆分为CTE来减少范围表条目的数量:

  • [行大小很大]避免 fat <​​/ em>索引和胖表(例如%uri字段:将其放入单独的表中并通过代理键引用它)
  • [下一步可能是:将整个查询(没有聚合)放在第三个CTE中,并在主查询中进行聚合]
WITH rel AS (
        SELECT * FROM release 
        WHERE source_uri IN (
  'https://api.discogs.com/releases/1955915'
  ,'https://api.discogs.com/releases/8602631'
  -- ,[and so on for about thirty more URIs]
        )
, media AS (
        SELECT *
        FROM medium m -- ON r.source_uri = m.source_uri
        LEFT JOIN track t ON m.id = t.medium
        LEFT JOIN track_artist tArtist ON t.id = tArtist.track
        LEFT JOIN track_composer tComposer ON t.id = tComposer.track
        LEFT JOIN track_isrc tIsrc ON t.id = tIsrc.track
        )
SELECT r.source_uri AS su_on_r, r.title AS t_on_r, r.subtitle AS s_on_r, r.artist_name AS an_on_r
        , r.asin AS a_on_r, r.country AS c_on_r, r.release_date AS rd_on_r
        , string_agg(DISTINCT barcode.barcode::TEXT, '|') AS b_on_barcode
        , string_agg(DISTINCT genre.genre::TEXT, '|') AS g_on_genre
        , string_agg(DISTINCT typ.type::TEXT, '|') AS t_on_typ
        , string_agg(tag.voted_tag::TEXT, '|') AS vt_on_tag
        , img.uri AS u_on_image, img.width AS w_on_image
        , img.height AS h_on_image, img.score AS s_on_image
        , string_agg(DISTINCT imageType.image_type::TEXT, '|') AS it_on_imageType
        , string_agg(tag.votes::TEXT, '|') AS v_on_tag
        , string_agg(DISTINCT url.url::TEXT, '|') AS u_on_url
        , event.label_name AS ln_on_event, event.cat AS c_on_event
        , m.position AS p_on_m, m.title AS t_on_m, m.format AS f_on_m
        , m.position AS p_on_t, m.title AS t_on_t <<-- !!need to fix thes in the CTE
        , string_agg(DISTINCT m.duration::TEXT, '|') AS d_on_t
        , string_agg(DISTINCT m.artist::TEXT, '|') AS a_on_tArtist
        , string_agg(DISTINCT m.composer::TEXT, '|') AS c_on_tComposer
        , string_agg(DISTINCT m.isrc::TEXT, '|') AS i_on_tIsrc
FROM rel r -- <<--- ##########################  CTE
LEFT JOIN release_barcode barcode       ON      r.source_uri = barcode.source_uri
LEFT JOIN release_genre genre           ON      r.source_uri = genre.source_uri
LEFT JOIN release_type typ              ON      r.source_uri = typ.source_uri
LEFT JOIN release_voted_tag tag         ON      r.source_uri = tag.source_uri
LEFT JOIN release_image img             ON      r.source_uri = img.source_uri
  LEFT JOIN release_image_type imageType         ON img.id = imageType.image_id
LEFT JOIN release_url url               ON      r.source_uri = url.source_uri
LEFT JOIN release_event event           ON      r.source_uri = event.source_uri
LEFT JOIN media                         ON      r.source_uri = media.source_uri -- <<--- ##########################  CTE
GROUP BY su_on_r, t_on_r, s_on_r, an_on_r
        , a_on_r, c_on_r, rd_on_r
        , u_on_image, w_on_image, h_on_image
        , s_on_image, ln_on_event, c_on_event
        , p_on_m, t_on_m, f_on_m, p_on_t, t_on_t
        ;

注意:在将术语移至media CTE时,我犯了一些错误。还有一些重命名要做......