大型联接的查询优化

时间:2019-09-02 12:47:10

标签: postgresql query-performance postgresql-11

(pg版本11.2)

我具有以下表格结构:

# -*- mode: python ; coding: utf-8 -*-

block_cipher = None


a = Analysis(['app.py'],
             pathex=['<root_project_path>'],
             binaries=[],
             datas=[],
             hiddenimports=['six', 'fractions', 'csv', 'pytz', 'timeit'],
             hookspath=[],
             runtime_hooks=['hook-runtime.py'],
             excludes=['statsmodels'],
             win_no_prefer_redirects=False,
             win_private_assemblies=False,
             cipher=block_cipher,
             noarchive=False)
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/statsmodels", prefix="statsmodels")
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/numpy", prefix="numpy")
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/pandas", prefix="pandas")
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/scipy", prefix="scipy")
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/dateutil", prefix="dateutil")
a.datas += Tree("/wine/drive_c/Python36/Lib/site-packages/patsy", prefix="patsy")

pyz = PYZ(a.pure, a.zipped_data,
             cipher=block_cipher)
exe = EXE(pyz,
          a.scripts,
          a.binaries,
          a.zipfiles,
          a.datas,
          [],
          name='app',
          debug=False,
          bootloader_ignore_signals=False,
          strip=False,
          upx=False,
          runtime_tmpdir=None,
          console=True )

CREATE TABLE site_tally ( id serial, dt_created timestamp WITHOUT TIME ZONE DEFAULT now() NOT NULL, dt_updated timestamp WITHOUT TIME ZONE, geo text NOT NULL, dt_tally date NOT NULL, parent_site text NOT NULL, site_id integer NOT NULL, tracked boolean NOT NULL, utm_parameter_id integer NOT NULL, device text NOT NULL, layout_id integer NOT NULL, views integer DEFAULT 0, PRIMARY KEY (id, geo) ) PARTITION BY LIST (geo); CREATE UNIQUE INDEX site_tally_uindex ON site_tally (geo, dt_tally, parent_site, site_id, tracked, utm_parameter_id, device, layout_id); CREATE TABLE site_tally_uk PARTITION OF site_tally FOR VALUES IN ('UK'); CREATE TABLE site_tally_us PARTITION OF site_tally FOR VALUES IN ('US'); CREATE TABLE site_tally_au PARTITION OF site_tally FOR VALUES IN ('AU'); CREATE TABLE utm_parameters ( id serial NOT NULL PRIMARY KEY, dt_created timestamp DEFAULT now() NOT NULL, source text DEFAULT 'default'::text NOT NULL, medium text DEFAULT 'default'::text NOT NULL, campaign text DEFAULT 'default'::text NOT NULL, term text DEFAULT 'default'::text NOT NULL, content text DEFAULT 'default'::text NOT NULL ); CREATE UNIQUE INDEX utm_parameters_source_medium_campaign_term_content_uindex ON utm_parameters (source, medium, campaign, term, content); 出于性能原因专门进行了分区,因为我们不需要查询多个site_tally

我遇到一个极端情况,其中我们的查询之一需要很长时间才能运行:

geo

解释分析:

SELECT SUM(views) AS views,
       term       AS utm
FROM site_tally
         INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
WHERE geo = 'UK'
    AND dt_tally >= '2019-08-01'
    AND dt_tally <= '2019-08-31'
    AND parent_site = 'site1'
    AND source = 'source1'
    AND medium = 'medium1'
    AND campaign = 'campaign1'
    AND tracked = FALSE
GROUP BY source,
         medium,
         campaign,
         term;

在这种特殊情况下,有许多GroupAggregate (cost=1.11..12152.56 rows=1 width=74) (actual time=88.064..163032.380 rows=351 loops=1) " Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign, utm_parameters.term" -> Nested Loop (cost=1.11..12152.53 rows=1 width=70) (actual time=59.993..163025.340 rows=15823 loops=1) -> Index Scan using utm_parameters_source_medium_campaign_term_content_uindex on utm_parameters (cost=0.55..8.57 rows=1 width=70) (actual time=0.024..39.883 rows=5994 loops=1) Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text) AND (campaign = 'campaign1'::text)) -> Append (cost=0.56..12143.95 rows=1 width=8) (actual time=26.022..27.188 rows=3 loops=5994) -> Index Scan using site_tally_uk_geo_dt_tally_parent_site_site_id_tracked_utm__idx on site_tally_uk (cost=0.56..12143.95 rows=1 width=8) (actual time=26.020..27.185 rows=3 loops=5994) Index Cond: ((geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (dt_tally <= '2019-08-31'::date) AND (parent_site = 'site1'::text) AND (tracked = false) AND (utm_parameter_id = utm_parameters.id)) Filter: (NOT tracked) Planning Time: 0.693 ms Execution Time: 163032.762 ms 进行分组,而没有term的查询的行为则大不相同:

term

解释分析:

SELECT SUM(views) AS views,
                       campaign   AS utm
                FROM site_tally
                         INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
                WHERE geo = 'UK'
                  AND dt_tally >= '2019-08-01'
                  AND dt_tally <= '2019-08-31'
                  AND parent_site = 'site1'
                  AND source = 'source1'
                  AND medium = 'medium1'
                  AND tracked = FALSE
                GROUP BY source,
                         medium,
                         campaign;

注意:GroupAggregate (cost=87129.06..87129.13 rows=3 width=48) (actual time=54.451..54.451 rows=1 loops=1) " Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign" -> Sort (cost=87129.06..87129.07 rows=3 width=44) (actual time=50.572..51.398 rows=15823 loops=1) Sort Key: utm_parameters.campaign Sort Method: quicksort Memory: 2610kB -> Hash Join (cost=1583.46..87129.04 rows=3 width=44) (actual time=11.359..46.521 rows=15823 loops=1) Hash Cond: (site_tally_uk.utm_parameter_id = utm_parameters.id) -> Append (cost=1322.54..86645.61 rows=84764 width=8) (actual time=4.268..31.765 rows=53612 loops=1) -> Bitmap Heap Scan on site_tally_uk (cost=1322.54..86221.79 rows=84764 width=8) (actual time=4.267..28.157 rows=53612 loops=1) Recheck Cond: ((dt_tally <= '2019-08-31'::date) AND (geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (parent_site = 'site1'::text) AND (NOT tracked)) Heap Blocks: exact=5237 -> Bitmap Index Scan on site_tally_uk_geo_dt_tally_parent_site_tracked_idx (cost=0.00..1301.35 rows=84764 width=0) (actual time=3.519..3.519 rows=53612 loops=1) Index Cond: (dt_tally <= '2019-08-31'::date) -> Hash (cost=260.09..260.09 rows=66 width=44) (actual time=7.083..7.084 rows=5994 loops=1) Buckets: 8192 (originally 1024) Batches: 1 (originally 1) Memory Usage: 556kB -> Bitmap Heap Scan on utm_parameters (cost=5.23..260.09 rows=66 width=44) (actual time=1.346..5.862 rows=5994 loops=1) Recheck Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text)) Heap Blocks: exact=2655 -> Bitmap Index Scan on utm_parameters_source_medium_campaign_term_content_uindex (cost=0.00..5.21 rows=66 width=0) (actual time=0.991..0.992 rows=5994 loops=1) Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text)) Planning Time: 0.571 ms Execution Time: 54.773 ms 包含更多基于整数的列(在site_tally列之后),这些列在SELECT中也用作SUM值。我决定将它们排除在问题之外,因为它已经是一个冗长的问题了!

因此,理想情况下,我想加快此查询的速度,我尝试了另一种索引策略:

views

即使我尝试使用诸如CREATE INDEX testing ON site_tally (geo, dt_tally, parent_site, tracked) WHERE geo='UK' and dt_tally >= '2019-08-01' and parent_site='site1' and tracked=FALSE; 之类的查询来具体说明查询,该查询计划者也不会选择该索引。

目前,我无法更改dt_tally > '2019-07-31'上的唯一索引(其他查询依赖于该特定的列顺序)

我想了解此查询中实际发生的情况(我不太了解EXPLAIN输出)。

1 个答案:

答案 0 :(得分:3)

看起来utm_parameters上的索引扫描估计值很不合理。

  • 首先,尝试简单

    ANALYZE utm_parameters;
    

    看看是否能解决问题。

  • 如果这不能解决问题,请尝试收集更详细的统计信息:

    ALTER TABLE utm_parameters
       ALTER source SET STATISTICS 1000,
       ALTER medium SET STATISTICS 1000,
       ALTER campaign SET STATISTICS 1000;
    
    ANALYZE utm_parameters;
    
  • 如果这也不能改善估计,则问题可能出在列之间。尝试创建扩展统计信息:

    CREATE STATISTICS utm_parameters_stats (dependencies)
       ON source, medium, campaign FROM utm_parameters;
    
    ANALYZE utm_parameters;
    

似乎最后一个选项为您解决了问题。那怎么了?

  • PostgreSQL具有相当不错的统计信息,可以估计column = value形式的条件的选择性。

  • 让我们假设三个条件中每个条件的选择性为0.1,即90%的行被滤除。不知道有什么更好的选择,PostgreSQL假定条件在统计上是独立的,因此它假定所有三个条件的选择性总和为0.1 * 0.1 * 0.1 = 0.001。

  • 现在发生的情况是不是独立的,例如,如果campaign对于两行是相同的,那么medium很可能也相同行。因此PostgreSQL的估计将比实际情况要低。

  • 这个较低的估计值使PostgreSQL选择一个嵌套循环联接,这是小型外部表的最佳访问路径。但是,如果外部表很大,则嵌套循环联接的性能将很差。因此,确定估算值可以提高性能。