Question

我正在尝试优化一个易于解释但难以编写的查询。我有一个网站，允许用户将图像上传到文件夹并在特定日期发布这些文件夹。

我想显示最新文件夹中的图像，文件名较小的图像（即0.jpg，1.jpg，我选择0.jpg）仅来自已发布的图像文件夹。

数据库结构如下（为简洁起见，我省略了不相关的列）：

CREATE TABLE image (
    id SERIAL PRIMARY KEY,
    "imageFileId" integer REFERENCES image_file(id),
    "imageFolderId" integer REFERENCES image_folder(id)
);

CREATE UNIQUE INDEX "PK_d6db1ab4ee9ad9dbe86c64e4cc3" ON image(id int4_ops);
CREATE INDEX "IDX_IMAGE_IMAGE_FOLDER" ON image("imageFolderId" int4_ops);
CREATE INDEX "IDX_IMAGE_IMAGE_FILE" ON image("imageFileId" int4_ops);


CREATE TABLE image_file (
    id SERIAL PRIMARY KEY,
    filename character varying NOT NULL DEFAULT 'file.jpg'::character varying,
);

CREATE UNIQUE INDEX "PK_a63c149156c13fef954c6f56398" ON image_file(id int4_ops);
CREATE INDEX "IDX_IMAGE_FILE_FILENAME" ON image_file(filename text_ops);

CREATE TABLE image_folder (
    id SERIAL PRIMARY KEY,
    "publicationDate" timestamp without time zone,
);

CREATE UNIQUE INDEX "PK_7913e2df97a29ff24201598251e" ON image_folder(id int4_ops);
CREATE INDEX "IDX_IMAGE_FOLDER_PUBLICATION_DATE" ON image_folder("publicationDate" timestamp_ops);

我们提出了此查询，将random_page_cost设置为1之后确实变快了，但是仍然很慢：

SELECT DISTINCT
    ON (image_folder."publicationDate", image."imageFolderId") image.*
FROM image
INNER JOIN 
    (SELECT "imageFolderId", min(image_file.filename) AS "firstFileName"
    FROM image
    INNER JOIN image_file
        ON image_file.id = image."imageFileId"
    GROUP BY  image."imageFolderId" ) AS first_image_file
    ON first_image_file."imageFolderId" = image."imageFolderId"
INNER JOIN image_folder
    ON image_folder.id = image."imageFolderId"
INNER JOIN image_file
    ON image_file.id = image."imageFileId"
WHERE image_file.filename = first_image_file."firstFileName"
        AND image_folder."publicationDate" IS NOT NULL
        AND image_folder."publicationDate" <= now()

ORDER BY  image_folder."publicationDate" DESC,
        image."imageFolderId" DESC,
        image_file.filename ASC LIMIT 40 OFFSET 0

我可以做些什么来优化此查询吗？我正在考虑简化数据库并摆脱image_file，但是由于它是一个非常以图像为中心的网站，因此我可能需要在这些文件上添加一些元数据，这就是为什么要这样设计的原因。

更新：当我在每个表上获得约50万条记录时，这才开始变慢。肯定会在短时间内增加，并且可能会变得更慢。

更新2：查询计划：

Limit  (cost=47064.65..47064.65 rows=1 width=53)
  ->  Unique  (cost=47064.65..47064.65 rows=1 width=53)
        ->  Sort  (cost=47064.65..47064.65 rows=1 width=53)
              Sort Key: image_folder."publicationDate" DESC, image."imageFolderId" DESC, image_file.filename
              ->  Nested Loop  (cost=35419.77..47064.65 rows=1 width=53)
                    Join Filter: (image_1."imageFolderId" = image_folder.id)
                    ->  Nested Loop  (cost=35419.71..47064.58 rows=1 width=49)
                          Join Filter: (image_1."imageFolderId" = image."imageFolderId")
                          ->  Nested Loop  (cost=35419.63..46000.90 rows=9454 width=21)
                                ->  HashAggregate  (cost=35419.55..35447.66 rows=9371 width=40)
                                      Group Key: image_1."imageFolderId"
                                      ->  Hash Join  (cost=11870.20..34935.82 rows=483723 width=17)
                                            Hash Cond: (image_file_1.id = image_1."imageFileId")
                                            ->  Seq Scan on image_file image_file_1  (cost=0.00..21237.56 rows=502521 width=17)
                                            ->  Hash  (cost=10177.17..10177.17 rows=483723 width=8)
                                                  ->  Seq Scan on image image_1  (cost=0.00..10177.17 rows=483723 width=8)
                                ->  Index Scan using "IDX_IMAGE_FILE_FILENAME" on image_file  (cost=0.08..1.12 rows=1 width=17)
                                      Index Cond: ((filename)::text = (min((image_file_1.filename)::text)))
                          ->  Index Scan using "IDX_IMAGE_IMAGE_FILE" on image  (cost=0.08..0.11 rows=1 width=32)
                                Index Cond: ("imageFileId" = image_file.id)
                    ->  Index Scan using "PK_7913e2df97a29ff24201598251e" on image_folder  (cost=0.06..0.06 rows=1 width=12)
                          Index Cond: (id = image."imageFolderId")
                          Filter: (("publicationDate" IS NOT NULL) AND ("publicationDate" <= now()))

Answer 1

好的，这是我看到的：

您的查询几乎没有任何过滤条件。理论上，您正在阅读大多数行。
您的查询有一个LIMIT子句，仅显示40行。仅当查询可以“流水线化”时，这才可以作为过滤条件有效。看来您的查询可能是
您的查询中有一个子查询，该子查询以您要加入的表表达式的形式出现。此子查询没有过滤条件。因此，它将读取image和image_file中的所有行。对我来说，这听起来像是罪魁祸首。

现在，您是否有机会将这个子查询“保存”到一个单独的表中，然后对它进行查询？也许您可以每小时更新一次，并在其上添加适当的索引。如果确实存在这种可能性，那么我认为您只要这样做就能在查询中看到真正的进步。

您可以使用Materialized View代替表格，并每小时大约一次或在您确定一些特殊事件后“刷新”它。

无论如何，我都会得到执行计划并将其添加到您的问题中。这将使我们对PostgreSQL的优化程序正在执行的操作有很好的了解。要获得执行计划，请explain放在您的选择之前，例如：

explain
SELECT DISTINCT
ON (image_folder."publicationDate", image."imageFolderId") image.*
FROM image
...

Answer 2

尝试向下推LIMIT，将其重写以使用row_number()窗口函数获取每个文件夹的词典最小文件名，并在此进行简化。

SELECT x.id,
       x."imageFileId",
       x."imageFolderId"
       FROM (SELECT im.id,
                    im."imageFileId",
                    im."imageFolderId",
                    imfo."publicationDate",
                    row_number() OVER (PARTITION BY imfo.id
                                       ORDER BY imfi.filename DESC) rn
                    FROM (SELECT *
                                 FROM image_folder
                                 WHERE "publicationDate" <= now()
                                 ORDER BY "publicationDate" DESC
                                 LIMIT 40) imfo
                         INNER JOIN image im
                                    ON im."imageFolderId" = imfo.id
                         INNER JOIN image_file imfi
                                    ON imfi.id = im."imageFileId"
                    WHERE imfo."publicationDate" <= now()) x
       WHERE x.rn = 1
       ORDER BY x."publicationDate" DESC,
                x."imageFolderId" DESC;

另外尝试索引ON image ("imageFolderId", "imageFileId")和ON image_folder ("publicationDate" DESC)。如果您使用的版本> = 10，则还可以尝试使用哈希索引ON image_file USING HASH (id)和/或ON image_folder USING HASH (id)。

如何使用3个具有大量数据的表的联接，限制和最小值优化此查询

2 个答案: