Question

在 Postgres 数据库中，以下两个查询是否等效于效果？每个人都说“连接总是比子查询更快”，但 Postgres 查询规划器是否将子查询优化为幕后的连接？

查询1：

UPDATE table_a SET col_1 = 'a fixed value' WHERE col_2 IN ( SELECT col_2 FROM table_b );

解释计划：

Update on table_a (cost=0.00..9316.10 rows=1 width=827) -> Nested Loop Semi Join (cost=0.00..9316.10 rows=1 width=827) -> Seq Scan on table_a (cost=0.00..9287.20 rows=1 width=821) -> Index Scan using idx_table_b on table_b (cost=0.00..14.45 rows=1 width=14) Index Cond: (col_2 = (table_a.col_2)::numeric)

查询2：

UPDATE table_a ta SET col_1 = 'a fixed value' FROM table_b tb WHERE ta.col_2 = tb.col_2;

解释计划：

Update on table_a ta (cost=0.00..9301.67 rows=1 width=827) -> Nested Loop (cost=0.00..9301.67 rows=1 width=827) -> Seq Scan on table_a ta (cost=0.00..9287.20 rows=1 width=821) -> Index Scan using idx_table_b on table_b tb (cost=0.00..14.45 rows=1 width=14) Index Cond: (col_2 = (ta.col_2)::numeric)

我相信它们在结果中是等效的（如果我错了，请提醒我）。我用大量数据尝试了几个解释计划。在更新完整表格和将table_a.col_2限制为一个小子集时，它们在性能上似乎是等效的。

我想确定我没有错过任何东西。如果它们是等价的，您会选择哪个以及为什么？

Answer 1

Postgres查询规划器是否将子查询优化为幕后连接？

通常，是的。

不要猜，请查看查询计划。

假设：

CREATE TABLE table_a (col_1 text, col_2 integer );
CREATE TABLE table_b (col_2 integer);
INSERT INTO table_b(col_2) VALUES (1),(2),(4),(NULL);
INSERT INTO table_a (col_1, col_2) VALUES ('a fixed value', 2), ('a fixed value', NULL), ('some other value', 2);
ANALYZE table_a;
ANALYZE table_b;

比较

regress=> explain UPDATE table_a
SET col_1 = 'a fixed value'
WHERE col_2 IN (
    SELECT col_2 FROM table_b
);
                                QUERY PLAN                                
--------------------------------------------------------------------------
 Update on table_a  (cost=1.09..2.15 rows=2 width=16)
   ->  Hash Semi Join  (cost=1.09..2.15 rows=2 width=16)
         Hash Cond: (table_a.col_2 = table_b.col_2)
         ->  Seq Scan on table_a  (cost=0.00..1.03 rows=3 width=10)
         ->  Hash  (cost=1.04..1.04 rows=4 width=10)
               ->  Seq Scan on table_b  (cost=0.00..1.04 rows=4 width=10)
(6 rows)

regress=> explain UPDATE table_a ta
regress-> SET col_1 = 'a fixed value'
regress-> FROM table_b tb
regress-> WHERE ta.col_2 = tb.col_2;
                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Update on table_a ta  (cost=1.07..2.14 rows=1 width=16)
   ->  Hash Join  (cost=1.07..2.14 rows=1 width=16)
         Hash Cond: (tb.col_2 = ta.col_2)
         ->  Seq Scan on table_b tb  (cost=0.00..1.04 rows=4 width=10)
         ->  Hash  (cost=1.03..1.03 rows=3 width=10)
               ->  Seq Scan on table_a ta  (cost=0.00..1.03 rows=3 width=10)
(6 rows)

请参阅？同样的计划。子查询已转换为连接。

使用EXISTS而不是IN表达它通常更干净。这对于NOT IN vs NOT EXISTS来说更为重要，因为它们在空值方面在语义上是不同的，所以无论如何它都是一个好习惯。你写的是：

UPDATE table_a a
SET col_1 = 'a fixed value'
WHERE EXISTS (SELECT 1 FROM table_b b WHERE b.col_2 = a.col_2);

这将也倾向于产生相同的计划，但IMO更好一点 - 尤其是因为如果它不计划到一个联接，相关子查询通常不如巨型IN列表扫描那么可怕。

Answer 2

<强> IN：

如果指定的值与子查询或列表中的任何值匹配，则返回true。

<强>存在：

如果子查询包含任何行，则返回true。

<强>加入：

在加入列上加入2个结果集。

如果加入列为UNIQUE，则join更快。

如果不是，那么IN is faster than JOIN on DISTINCT。

在我的博客中查看有关Microsoft SQL Server性能详细信息的文章，这可能与PostgreSQL有关：

IN vs. JOIN vs. EXISTS in Microsoft SQL Server

使用子查询更新与使用联接更新 - 这在性能方面更好

2 个答案: