使用Postgres 9.4,我通过非常简单的查询获得令人不安的性能损失。这是真正的障碍,阻碍了我的迁移。
注意:我的SQL Server 2014 DWH中不存在此问题。
这个想法很简单:计算dim_activity的元素,按product_key和country_key分组。但是,我想检索product_string和country_string而不是product_key和country_key。因此,我加入。
QUERY 1
SELECT
c.country_code,
b.real_box_sku,
COUNT(*)
FROM "DWH_activity".dim_activity a
LEFT JOIN "DWH_main".dim_box b ON a.box_key=b.box_key
LEFT JOIN "DWH_main".dim_country c ON a.country_key=c.country_key
WHERE a.event_type IN('New Customer','Reactivation','Active','Simple')
GROUP BY
c.country_code,
b.real_box_sku
这需要超过 4分钟+ ,这是查询计划:
"GroupAggregate (cost=1648008.80..1724929.95 rows=52200 width=25) (actual time=223523.127..271740.258 rows=425 loops=1)"
" Output: c.country_code, b.real_box_sku, count(*)"
" Group Key: c.country_code, b.real_box_sku"
" Buffers: shared hit=2 read=77611, temp read=51224 written=51224"
" -> Sort (cost=1648008.80..1667108.59 rows=7639915 width=25) (actual time=223518.029..269632.659 rows=7628149 loops=1)"
" Output: c.country_code, b.real_box_sku"
" Sort Key: c.country_code, b.real_box_sku"
" Sort Method: external merge Disk: 186416kB"
" Buffers: shared hit=2 read=77611, temp read=51224 written=51224"
" -> Hash Left Join (cost=59.51..408988.74 rows=7639915 width=25) (actual time=0.688..9803.950 rows=7628149 loops=1)"
" Output: c.country_code, b.real_box_sku"
" Hash Cond: (a.country_key = c.country_key)"
" Buffers: shared hit=2 read=77611"
" -> Hash Left Join (cost=35.79..303916.18 rows=7639915 width=15) (actual time=0.661..7129.092 rows=7628149 loops=1)"
" Output: a.country_key, b.real_box_sku"
" Hash Cond: (a.box_key = b.box_key)"
" Buffers: shared hit=2 read=77610"
" -> Seq Scan on "DWH_activity".dim_activity a (cost=0.00..198831.57 rows=7639915 width=6) (actual time=0.020..4032.800 rows=7628149 loops=1)"
" Output: a.country_key, a.vertical, a.cust_key, a.sub_key, a.fact_key, a.event_date_at, a.event_time_at, a.box_key, a.sub_type, a.activity_before, a.event_type, a.reason_key"
" Filter: ((a.event_type)::text = ANY ('{"New Customer",Reactivation,Active,Simple}'::text[]))"
" Rows Removed by Filter: 454422"
" Buffers: shared read=77593"
" -> Hash (cost=26.46..26.46 rows=746 width=17) (actual time=0.631..0.631 rows=746 loops=1)"
" Output: b.real_box_sku, b.box_key"
" Buckets: 1024 Batches: 1 Memory Usage: 37kB"
" Buffers: shared hit=2 read=17"
" -> Seq Scan on "DWH_main".dim_box b (cost=0.00..26.46 rows=746 width=17) (actual time=0.011..0.359 rows=746 loops=1)"
" Output: b.real_box_sku, b.box_key"
" Buffers: shared hit=2 read=17"
" -> Hash (cost=16.10..16.10 rows=610 width=16) (actual time=0.019..0.019 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buckets: 1024 Batches: 1 Memory Usage: 1kB"
" Buffers: shared read=1"
" -> Seq Scan on "DWH_main".dim_country c (cost=0.00..16.10 rows=610 width=16) (actual time=0.009..0.013 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buffers: shared read=1"
"Planning time: 0.447 ms"
"Execution time: 271781.990 ms"
QUERY2 :相同但只有一个JOIN(每个连接表上的完全相同的性能)
SELECT
c.country_code,
COUNT(*)
FROM "DWH_activity".dim_activity a
LEFT JOIN "DWH_main".dim_country c ON a.country_key=c.country_key
GROUP BY
c.country_code
这需要 5秒,这是查询计划:
"HashAggregate (cost=309990.64..309992.64 rows=200 width=12) (actual time=5943.200..5943.200 rows=7 loops=1)"
" Output: c.country_code, count(*)"
" Group Key: c.country_code"
" Buffers: shared read=77594"
" -> Hash Left Join (cost=23.73..269577.79 rows=8082571 width=12) (actual time=0.037..3873.109 rows=8082571 loops=1)"
" Output: c.country_code"
" Hash Cond: (a.country_key = c.country_key)"
" Buffers: shared read=77594"
" -> Seq Scan on "DWH_activity".dim_activity a (cost=0.00..158418.71 rows=8082571 width=2) (actual time=0.016..1261.439 rows=8082571 loops=1)"
" Output: a.country_key, a.vertical, a.cust_key, a.sub_key, a.fact_key, a.event_date_at, a.event_time_at, a.box_key, a.sub_type, a.activity_before, a.event_type, a.reason_key"
" Buffers: shared read=77593"
" -> Hash (cost=16.10..16.10 rows=610 width=16) (actual time=0.013..0.013 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buckets: 1024 Batches: 1 Memory Usage: 1kB"
" Buffers: shared read=1"
" -> Seq Scan on "DWH_main".dim_country c (cost=0.00..16.10 rows=610 width=16) (actual time=0.006..0.011 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buffers: shared read=1"
"Planning time: 0.140 ms"
"Execution time: 5943.249 ms"
QUERY3 :我先进行计数,然后加入汇总
SELECT
c.country_code,
b.real_box_sku,
COUNT(*)
FROM (
SELECT
country_key,
box_key,
COUNT(*)
FROM "DWH_activity".dim_activity a
GROUP BY
country_key,
box_key
) a
LEFT JOIN "DWH_main".dim_box b ON a.box_key=b.box_key
LEFT JOIN "DWH_main".dim_country c ON a.country_key=c.country_key
GROUP BY
c.country_code,
b.real_box_sku
这需要 3秒 ,这是查询计划:
"HashAggregate (cost=219263.82..219294.06 rows=3024 width=25) (actual time=3990.415..3990.492 rows=425 loops=1)"
" Output: c.country_code, b.real_box_sku, count(*)"
" Group Key: c.country_code, b.real_box_sku"
" Buffers: shared hit=35 read=77578"
" -> Hash Left Join (cost=219097.50..219241.14 rows=3024 width=25) (actual time=3989.832..3990.232 rows=440 loops=1)"
" Output: b.real_box_sku, c.country_code"
" Hash Cond: (a.country_key = c.country_key)"
" Buffers: shared hit=35 read=77578"
" -> Hash Left Join (cost=219073.78..219175.84 rows=3024 width=15) (actual time=3989.815..3990.073 rows=440 loops=1)"
" Output: a.country_key, b.real_box_sku"
" Hash Cond: (a.box_key = b.box_key)"
" Buffers: shared hit=34 read=77578"
" -> HashAggregate (cost=219037.99..219068.23 rows=3024 width=6) (actual time=3989.414..3989.508 rows=440 loops=1)"
" Output: a.country_key, a.box_key, count(*)"
" Group Key: a.country_key, a.box_key"
" Buffers: shared hit=32 read=77561"
" -> Seq Scan on "DWH_activity".dim_activity a (cost=0.00..158418.71 rows=8082571 width=6) (actual time=0.024..1115.551 rows=8082571 loops=1)"
" Output: a.country_key, a.vertical, a.cust_key, a.sub_key, a.fact_key, a.event_date_at, a.event_time_at, a.box_key, a.sub_type, a.activity_before, a.event_type, a.reason_key"
" Buffers: shared hit=32 read=77561"
" -> Hash (cost=26.46..26.46 rows=746 width=17) (actual time=0.378..0.378 rows=746 loops=1)"
" Output: b.real_box_sku, b.box_key"
" Buckets: 1024 Batches: 1 Memory Usage: 37kB"
" Buffers: shared hit=2 read=17"
" -> Seq Scan on "DWH_main".dim_box b (cost=0.00..26.46 rows=746 width=17) (actual time=0.011..0.210 rows=746 loops=1)"
" Output: b.real_box_sku, b.box_key"
" Buffers: shared hit=2 read=17"
" -> Hash (cost=16.10..16.10 rows=610 width=16) (actual time=0.010..0.010 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buckets: 1024 Batches: 1 Memory Usage: 1kB"
" Buffers: shared hit=1"
" -> Seq Scan on "DWH_main".dim_country c (cost=0.00..16.10 rows=610 width=16) (actual time=0.003..0.006 rows=14 loops=1)"
" Output: c.country_code, c.country_key"
" Buffers: shared hit=1"
"Planning time: 0.220 ms"
"Execution time: 3990.589 ms"
因此,看起来聚合多个连接会使我的查询无法运行
因此,我们的大多数报告都必须折旧或无法运行......
感谢您的帮助;)
EDIT1:我已经用explain的输出替换了查询plain analyze(analyze,verbose,buffers,跟随a_horse_with_no_name的请求