我写了一个爬虫应用程序,其数据被持久化到PostgreSQL 9.4数据库。我写的模式包含两个表(与问题相关):
抓取工具需要能够找到符合以下条件的下一个n
网页行:
next
列值必须低于now()
locked
列值必须为FALSE
next
列值必须低于now()
此外:
locked
列值设置为TRUE
。 (HTTP之后应用程序更新next
列的值)next
列值设置为now() + frequency
。其中frequency
只是自定义整数变量。这是一个并发系统,所以必须使用锁来完成更新,否则一些网页将在并发查询时多次返回。
我当前查询的问题是:
以下是我尝试的两个查询:
查询n°1:
WITH ssss AS (
UPDATE sources sss
SET
next = now() + interval '1 milliseconds' * frequency,
FROM (
SELECT s.domain, s.compression
FROM (
SELECT id, domain, compression
FROM sources
WHERE
next <= now()
ORDER BY next ASC
LIMIT limit,
OFFSET 0
) s
WHERE
pg_try_advisory_xact_lock(s.id)
) ss
WHERE sss.domain = ss.domain
RETURNING sss.domain, sss.compression
)
UPDATE webpages www
SET
locked = TRUE
FROM (
SELECT w.url, ssss.compression
FROM webpages w
INNER JOIN ssss
ON w.domain = ssss.domain
WHERE
next <= now()
AND
locked = FALSE
LIMIT limit,
) ww
WHERE www.url = ww.url
RETURNING www.refreshpow, www.url, www.type, www.domain, ww.compression;
查询n°2:
WITH ssss AS (
UPDATE sources sss
SET
next = now() + interval '1 milliseconds' * frequency,
FROM (
SELECT s.domain, s.compression
FROM (
SELECT id, domain, compression
FROM sources
WHERE
next <= now()
ORDER BY next ASC
LIMIT limit,
OFFSET 0
) s
WHERE
pg_try_advisory_xact_lock(s.id)
) ss
WHERE sss.domain = ss.domain
RETURNING sss.domain, sss.compression
)
UPDATE webpages wwww
SET
locked = TRUE
FROM (
SELECT ww.id, ww.url, ww.compression
FROM (
SELECT w.id, w.url, ssss.compression
FROM webpages w
INNER JOIN ssss
ON w.domain = ssss.domain
WHERE
next <= now()
AND
locked = FALSE
LIMIT limit,
OFFSET 0
) ww
WHERE
pg_try_advisory_xact_lock(ww.id)
) www
WHERE wwww.url = www.url
RETURNING wwww.refreshpow, wwww.url, wwww.type, wwww.domain, www.compression;
下面列出了我在SQL查询开发过程中遇到的一些问题,以便您了解我为什么会遇到此查询:
UPDATE ... LIMIT 1
,这意味着你必须做一个子查询或CTE(参见:UPDATE ... LIMIT 1 answer)SELECT
匹配的行需要与FOR UPDATE
或pg_try_advisory_xact_lock
锁定(请参阅:UPDATE ... LIMIT 1 answer)(a = b) AND (c = d)
语句中,(c = d)
可以在(a = b)
之前执行。因此,如果您只想锁定匹配的行直到LIMIT limit
,则需要将SELECT
放在已嵌入的子查询/ CTE SELECT
内的另一个子查询/ CTE中。 (见:UPDATE ... LIMIT 1 answer)OFFSET 0
用于阻止内联以下是我的表格模式:
sources
表架构:
CREATE TABLE sources
(
id serial NOT NULL,
domain text NOT NULL,
created timestamp with time zone,
topic character varying(255),
last timestamp with time zone,
next timestamp with time zone NOT NULL,
compression boolean DEFAULT true,
CONSTRAINT sources_pkey PRIMARY KEY (domain)
)
CREATE INDEX next_index
ON sources
USING btree
(next);
CREATE UNIQUE INDEX sources_domain
ON sources
USING btree
(domain COLLATE pg_catalog."default");
CREATE UNIQUE INDEX sources_id
ON sources
USING btree
(id);
webpages
表:
CREATE TABLE webpages
(
id serial NOT NULL,
url text NOT NULL,
created timestamp with time zone,
locked boolean DEFAULT false,
type enum_webpages_type,
last timestamp with time zone,
next timestamp with time zone NOT NULL,
refreshpow integer NOT NULL DEFAULT 2,
locale character varying(255),
title text,
image text,
date timestamp with time zone,
tags text[],
authors text[],
summary text,
html text,
domain text,
parent text,
error uuid,
CONSTRAINT webpages_pkey PRIMARY KEY (url),
CONSTRAINT webpages_domain_fkey FOREIGN KEY (domain)
REFERENCES sources (domain) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT webpages_error_fkey FOREIGN KEY (error)
REFERENCES errors (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT webpages_parent_fkey FOREIGN KEY (parent)
REFERENCES webpages (url) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE INDEX webpages_domain
ON webpages
USING btree
(domain COLLATE pg_catalog."default");
CREATE INDEX webpages_error
ON webpages
USING btree
(error);
CREATE UNIQUE INDEX webpages_id
ON webpages
USING btree
(id);
CREATE INDEX webpages_last
ON webpages
USING btree
(last);
CREATE INDEX webpages_next_locked
ON webpages
USING btree
(next, locked);
CREATE UNIQUE INDEX webpages_url
ON webpages
USING btree
(url COLLATE pg_catalog."default");
这是基于我对SQL和PostgreSQL的有限理解,所以我很乐意得到任何帮助。