我有一个表约有300k记录的参数产品。 是否可以优化此查询?
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM "parameters_products"
WHERE product_id IN
(SELECT product_id
FROM parameters_products
WHERE parameter_id IN ('2'))
GROUP BY parameter_id
查询输出:
2;274669
EXPLAIN ANALYZE VERBOSE ...输出:
HashAggregate (cost=23628.54..23628.56 rows=2 width=8) (actual time=2231.367..2231.368 rows=1 loops=1)
Output: parameters_products.parameter_id, count(parameters_products.product_id)
Group Key: parameters_products.parameter_id
-> Hash Semi Join (cost=9607.86..22256.43 rows=274421 width=8) (actual time=692.586..1893.261 rows=274669 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
Hash Cond: (parameters_products.product_id = parameters_products_1.product_id)
-> Seq Scan on public.parameters_products (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.025..353.358 rows=299728 loops=1)
Output: parameters_products.parameter_id, parameters_products.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=692.331..692.331 rows=274669 loops=1)
Output: parameters_products_1.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products parameters_products_1 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.013..344.656 rows=274669 loops=1)
Output: parameters_products_1.product_id
Filter: (parameters_products_1.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.279 ms
Execution time: 2231.499 ms
启用PostgreSQL 9.4.1和VACUUM。
刚试过这个问题,但它也很慢:
SELECT pp1.parameter_id,
count(pp1.product_id)
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
WHERE pp2.parameter_id IN (2)
GROUP BY pp1.parameter_id
-
HashAggregate (cost=23742.42..23742.44 rows=2 width=8) (actual time=2361.654..2361.654 rows=1 loops=1)
Output: pp1.parameter_id, count(pp1.product_id)
Group Key: pp1.parameter_id
-> Hash Join (cost=9607.86..22370.31 rows=274421 width=8) (actual time=715.409..2012.345 rows=274669 loops=1)
Output: pp1.parameter_id, pp1.product_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4356.28 rows=299728 width=8) (actual time=0.012..360.789 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=5105.60..5105.60 rows=274421 width=4) (actual time=715.176..715.176 rows=274669 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2425kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..5105.60 rows=274421 width=4) (actual time=0.009..353.386 rows=274669 loops=1)
Output: pp2.product_id
Filter: (pp2.parameter_id = 2)
Rows Removed by Filter: 25059
Planning time: 0.135 ms
Execution time: 2361.735 ms
索引:
CREATE INDEX parameters_products_parameter_id_idx
ON parameters_products
USING btree
(parameter_id);
CREATE INDEX parameters_products_product_id_idx
ON parameters_products
USING btree
(product_id);
CREATE INDEX parameters_products_product_id_parameter_id_idx
ON parameters_products
USING btree
(product_id, parameter_id);
EXPLAIN ANALYZE VERBOSE
SELECT pp1.parameter_id
FROM parameters_products pp1
LEFT JOIN parameters_products pp2 ON pp1.product_id = pp2.product_id
-
Hash Left Join (cost=9241.88..22699.06 rows=299728 width=4) (actual time=727.683..2080.798 rows=299728 loops=1)
Output: pp1.parameter_id
Hash Cond: (pp1.product_id = pp2.product_id)
-> Seq Scan on public.parameters_products pp1 (cost=0.00..4324.28 rows=299728 width=8) (actual time=0.031..355.656 rows=299728 loops=1)
Output: pp1.parameter_id, pp1.product_id
-> Hash (cost=4324.28..4324.28 rows=299728 width=4) (actual time=727.579..727.579 rows=299728 loops=1)
Output: pp2.product_id
Buckets: 16384 Batches: 4 Memory Usage: 2644kB
-> Seq Scan on public.parameters_products pp2 (cost=0.00..4324.28 rows=299728 width=4) (actual time=0.008..350.797 rows=299728 loops=1)
Output: pp2.product_id
Planning time: 0.472 ms
Execution time: 2392.582 ms
SET enable_seqscan = OFF;
减少执行时间,但不是很明显。
答案 0 :(得分:2)
我要尝试的第一件事就是用IN
替换EXISTS
:
SELECT parameter_id AS id,
COUNT(product_id) AS COUNT
FROM parameters_products pp
WHERE EXISTS (SELECT 1
FROM parameters_products pp2
WHERE pp2.product_id = pp.product_id AND
pp2.parameter_id = 2
)
GROUP BY parameter_id;
并且,请确保您在parameters_products(product_id, parameter_id)
上有索引。
另一个想法是使用窗口函数:
select parameter_id, count(*)
from (select pp.*,
sum(case when pp.parameter_id = 2 then 1 else 0 end) over (partition by product_id) as cnt2
from parameters_products pp
) pp
where cnt2 > 0
group by parameter_id;
答案 1 :(得分:1)
尝试:
SELECT pp1.parameter_id AS ID, COUNT(pp1.product_id) AS COUNT
FROM parameters_products pp1
JOIN
parameters_products pp2
ON
pp2.parameter_id = 2
AND
pp1.product_id = pp2.product_id
GROUP BY
pp1.parameter_id
将过滤条件从WHERE子句移动到ON子句会减少JOIN中涉及的总行数。希望这能说明您在评论中看到的相同节省,使执行时间低于1秒。
答案 2 :(得分:0)
freenode上的#postgresql中的RhodiumToad推荐了一个如下窗口函数。注意这与Gordon Linoff的窗函数不同,使用bool_or而不是sum(case ...):
SELECT parameter_id, count(product_id)
FROM
(SELECT *, bool_or(parameter_id = 2)
OVER
(partition by product_id) AS matching
FROM parameters_products) s
WHERE matching
GROUP BY parameter_id;
RhodiumToad还提到,无论是使用窗口函数,连接还是子选择,work_mem参数对于此比例的任何查询都可能太小。他建议增加work_mem参数以避免将例程排序到溢出到磁盘。
如果其中任何一个对你有帮助,那么归功于RhodiumToad。