我正在尝试优化一个易于解释但难以编写的查询。我有一个网站,允许用户将图像上传到文件夹并在特定日期发布这些文件夹。
我想显示最新文件夹中的图像,文件名较小的图像(即0.jpg,1.jpg,我选择0.jpg)仅来自已发布的图像文件夹。
数据库结构如下(为简洁起见,我省略了不相关的列):
CREATE TABLE image (
id SERIAL PRIMARY KEY,
"imageFileId" integer REFERENCES image_file(id),
"imageFolderId" integer REFERENCES image_folder(id)
);
CREATE UNIQUE INDEX "PK_d6db1ab4ee9ad9dbe86c64e4cc3" ON image(id int4_ops);
CREATE INDEX "IDX_IMAGE_IMAGE_FOLDER" ON image("imageFolderId" int4_ops);
CREATE INDEX "IDX_IMAGE_IMAGE_FILE" ON image("imageFileId" int4_ops);
CREATE TABLE image_file (
id SERIAL PRIMARY KEY,
filename character varying NOT NULL DEFAULT 'file.jpg'::character varying,
);
CREATE UNIQUE INDEX "PK_a63c149156c13fef954c6f56398" ON image_file(id int4_ops);
CREATE INDEX "IDX_IMAGE_FILE_FILENAME" ON image_file(filename text_ops);
CREATE TABLE image_folder (
id SERIAL PRIMARY KEY,
"publicationDate" timestamp without time zone,
);
CREATE UNIQUE INDEX "PK_7913e2df97a29ff24201598251e" ON image_folder(id int4_ops);
CREATE INDEX "IDX_IMAGE_FOLDER_PUBLICATION_DATE" ON image_folder("publicationDate" timestamp_ops);
我们提出了此查询,将random_page_cost
设置为1
之后确实变快了,但是仍然很慢:
SELECT DISTINCT
ON (image_folder."publicationDate", image."imageFolderId") image.*
FROM image
INNER JOIN
(SELECT "imageFolderId", min(image_file.filename) AS "firstFileName"
FROM image
INNER JOIN image_file
ON image_file.id = image."imageFileId"
GROUP BY image."imageFolderId" ) AS first_image_file
ON first_image_file."imageFolderId" = image."imageFolderId"
INNER JOIN image_folder
ON image_folder.id = image."imageFolderId"
INNER JOIN image_file
ON image_file.id = image."imageFileId"
WHERE image_file.filename = first_image_file."firstFileName"
AND image_folder."publicationDate" IS NOT NULL
AND image_folder."publicationDate" <= now()
ORDER BY image_folder."publicationDate" DESC,
image."imageFolderId" DESC,
image_file.filename ASC LIMIT 40 OFFSET 0
我可以做些什么来优化此查询吗?我正在考虑简化数据库并摆脱image_file
,但是由于它是一个非常以图像为中心的网站,因此我可能需要在这些文件上添加一些元数据,这就是为什么要这样设计的原因。
更新:当我在每个表上获得约50万条记录时,这才开始变慢。肯定会在短时间内增加,并且可能会变得更慢。
更新2:查询计划:
Limit (cost=47064.65..47064.65 rows=1 width=53)
-> Unique (cost=47064.65..47064.65 rows=1 width=53)
-> Sort (cost=47064.65..47064.65 rows=1 width=53)
Sort Key: image_folder."publicationDate" DESC, image."imageFolderId" DESC, image_file.filename
-> Nested Loop (cost=35419.77..47064.65 rows=1 width=53)
Join Filter: (image_1."imageFolderId" = image_folder.id)
-> Nested Loop (cost=35419.71..47064.58 rows=1 width=49)
Join Filter: (image_1."imageFolderId" = image."imageFolderId")
-> Nested Loop (cost=35419.63..46000.90 rows=9454 width=21)
-> HashAggregate (cost=35419.55..35447.66 rows=9371 width=40)
Group Key: image_1."imageFolderId"
-> Hash Join (cost=11870.20..34935.82 rows=483723 width=17)
Hash Cond: (image_file_1.id = image_1."imageFileId")
-> Seq Scan on image_file image_file_1 (cost=0.00..21237.56 rows=502521 width=17)
-> Hash (cost=10177.17..10177.17 rows=483723 width=8)
-> Seq Scan on image image_1 (cost=0.00..10177.17 rows=483723 width=8)
-> Index Scan using "IDX_IMAGE_FILE_FILENAME" on image_file (cost=0.08..1.12 rows=1 width=17)
Index Cond: ((filename)::text = (min((image_file_1.filename)::text)))
-> Index Scan using "IDX_IMAGE_IMAGE_FILE" on image (cost=0.08..0.11 rows=1 width=32)
Index Cond: ("imageFileId" = image_file.id)
-> Index Scan using "PK_7913e2df97a29ff24201598251e" on image_folder (cost=0.06..0.06 rows=1 width=12)
Index Cond: (id = image."imageFolderId")
Filter: (("publicationDate" IS NOT NULL) AND ("publicationDate" <= now()))
答案 0 :(得分:0)
好的,这是我看到的:
您的查询几乎没有任何过滤条件。理论上,您正在阅读大多数行。
您的查询有一个LIMIT
子句,仅显示40行。仅当查询可以“流水线化”时,这才可以作为过滤条件有效。看来您的查询可能是
您的查询中有一个子查询,该子查询以您要加入的表表达式的形式出现。此子查询没有过滤条件。因此,它将读取image
和image_file
中的所有行。对我来说,这听起来像是罪魁祸首。
现在,您是否有机会将这个子查询“保存”到一个单独的表中,然后对它进行查询?也许您可以每小时更新一次,并在其上添加适当的索引。如果确实存在这种可能性,那么我认为您只要这样做就能在查询中看到真正的进步。
您可以使用Materialized View代替表格,并每小时大约一次或在您确定一些特殊事件后“刷新”它。
无论如何,我都会得到执行计划并将其添加到您的问题中。这将使我们对PostgreSQL的优化程序正在执行的操作有很好的了解。要获得执行计划,请explain
放在您的选择之前,例如:
explain
SELECT DISTINCT
ON (image_folder."publicationDate", image."imageFolderId") image.*
FROM image
...
答案 1 :(得分:0)
尝试向下推LIMIT
,将其重写以使用row_number()
窗口函数获取每个文件夹的词典最小文件名,并在此进行简化。
SELECT x.id,
x."imageFileId",
x."imageFolderId"
FROM (SELECT im.id,
im."imageFileId",
im."imageFolderId",
imfo."publicationDate",
row_number() OVER (PARTITION BY imfo.id
ORDER BY imfi.filename DESC) rn
FROM (SELECT *
FROM image_folder
WHERE "publicationDate" <= now()
ORDER BY "publicationDate" DESC
LIMIT 40) imfo
INNER JOIN image im
ON im."imageFolderId" = imfo.id
INNER JOIN image_file imfi
ON imfi.id = im."imageFileId"
WHERE imfo."publicationDate" <= now()) x
WHERE x.rn = 1
ORDER BY x."publicationDate" DESC,
x."imageFolderId" DESC;
另外尝试索引ON image ("imageFolderId", "imageFileId")
和ON image_folder ("publicationDate" DESC)
。如果您使用的版本> = 10,则还可以尝试使用哈希索引ON image_file USING HASH (id)
和/或ON image_folder USING HASH (id)
。