Question

我正在尝试更新order_item中的每一行。 Status是新创建的列，必须具有order_update表中的最新值。一个项目可以有多个更新。

我正在使用PostgreSQL 9.1

我有这个更新sql。
表order_item有800K记录表order_update有5Mil记录。

update order_item
set status = (
    select production_stage
    from order_update
    where id = (
        select max(id)
        from order_update
        where order_item_id = order_item.id
    )
);

如何让这个SQL执行最佳方式。我知道更新需要一些时间，只是希望尽可能快地完成。

我发现在5Mil记录上做这个sql时。

select max(id) from order_update where order_item_id = 100;

解释：

Result  (cost=784.10..784.11 rows=1 width=0)"   InitPlan 1 (returns $0)
    ->  Limit  (cost=0.00..784.10 rows=1 width=8)
          ->  Index Scan Backward using order_update_pkey on order_update  (cost=0.00..104694554.13 rows=133522 width=8)
                Index Cond: (id IS NOT NULL)
                Filter: (order_item_id = 100)

大约需要6秒钟。

当我在1Mil记录中执行相同的sql时：
解释：

Aggregate  (cost=13.43..13.44 rows=1 width=8)   ->  Index Scan using
order_update_order_item_id_idx on order_update  (cost=0.00..13.40
rows=11 width=8)
        Index Cond: (order_item_id = 100)

大约需要11毫秒 11毫秒对6秒。为什么巨大的差异？

为了缩小范围我试试这个：

select id from order_update where order_item_id = 100 order by id asc
limit 1 
Total query runtime: 41 ms.

然后这个：

select id from order_update where order_item_id = 100 order by id desc
limit 1 
Total query runtime: 5310 ms.

asc和desc中的巨大差异。

解决方案：创建索引：

CREATE INDEX order_update_mult_idx ON order_update (order_item_id, id DESC);

更新：

UPDATE order_item i
SET    test_print_provider_id = u.test_print_provider_id
FROM  (
   SELECT DISTINCT ON (1)
          test_print_provider_id
   FROM   orders
   ORDER  BY 1, id DESC
   ) u
WHERE  i.order_id = u.id
AND    i.test_print_provider_id IS DISTINCT FROM u.test_print_provider_id;

Answer 1

我有根据的猜测：这将基本更快。

UPDATE order_item i
SET    status = u.production_stage
FROM  (
   SELECT DISTINCT ON (1)
          order_item_id, production_stage
   FROM   order_update
   ORDER  BY 1, id DESC
   ) u
WHERE  i.id = u.order_item_id
AND    i.status IS DISTINCT FROM u.production_stage;   -- avoid empty updates

问题中的查询存在微妙的差异。原始版本更新 order_item的每个行。如果找不到order_update中匹配的行，则会导致status设置为NULL。此查询仅保留这些行（原始值保留，无更新）。
在这个密切相关的答案中，DISTINCT ON的子查询的详细说明：
Select first row in each GROUP BY group?
通常，单个子查询应该可以轻松地胜过相关子查询的方法。通过优化查询更是如此。
如果应order_item.status定义NOT NULL，则可以使用<>简化最后一行。
这样的multicolumn index可能有所帮助：
```
CREATE INDEX order_update_mult_idx ON order_update(order_item_id, id DESC);
```
第二列的降序至关重要但是，由于您在单次扫描中使用了全部或大部分表，因此索引可能无济于事。可能在Postgres 9.2或更高版本中使用covering index除外：
```
CREATE INDEX order_update_mult_idx
ON order_update(order_item_id, id DESC, production_stage);
```

EXPLAIN只给你Postgres提出的计划。如果计划员估算并且成本参数未准确设置，则可以取消这些数字。要获得实际的性能数据，您必须运行EXPLAIN ANALYZE - 当然，这会占用大表，因为它会测试执行查询。

Answer 2

如果您在包含order_item_id和production_stage的Id上的order_update中有索引，则会有所帮助。除此之外，这是相当简单的。使用临时表而不是子查询可能是一个选项，但我没有看到其他可以改进的内容。

Answer 3

~~以下重建怎么样？~~

update order_item
set status = (
    select a.production_stage from (
        select ou.id, ou.production_stage
        from order_update ou
        where ou.order_item_id = order_item.id
        order by ou.id desc
    ) a limit 1
);

编辑：由于以上情况较慢，以下重建会怎样？

update order_item
set status = (
    select a.production_stage from (
/********************************************** INNER QUERY START **/
        select ou.order_item_id, ou.production_stage
        from order_update ou
        INNER JOIN (
            select order_item_id, max(id) as max_id
            from order_update
            group by order_item_id
        ) ou_max ON (ou.order_item_id = ou_max.order_item_id
                     AND ou.id = ou_max.max_id)
/********************************************** INNER QUERY END **/
    ) a where a.order_item_id = order_item.id
);

在此，您的DBMS将仅执行内部查询 以创建临时表 A 。在此之后，它将简单地表现为：update order_item set status = (select a.production_stage from a where a.order_item_id = order_item.id);。这将非常快，因为A已经创建并可用作整个更新的固定表 - 不会为每个order_item_id重新创建。

更新大表上的查询慢

3 个答案: