并发爬虫PostgreSQL查询问题

时间:2015-11-01 22:12:06

标签: sql performance postgresql concurrency web-crawler

我写了一个爬虫应用程序,其数据被持久化到PostgreSQL 9.4数据库。我写的模式包含两个表(与问题相关):

  • 来源,网络域名列表
  • 网页,链接到其来源的网页列表

抓取工具需要能够找到符合以下条件的下一个n网页行:

  • 网页的next列值必须低于now()
  • 网页的locked列值必须为FALSE
  • 网页的来源next列值必须低于now()

此外:

  • 必须更新每个匹配的网页行,以便将locked列值设置为TRUE。 (HTTP之后应用程序更新next列的值)
  • 必须更新链接到匹配网页行的每个源行,以便将next列值设置为now() + frequency。其中frequency只是自定义整数变量。

这是一个并发系统,所以必须使用锁来完成更新,否则一些网页将在并发查询时多次返回。

我当前查询的问题是:

  • 查询相对较慢(130k源和21M网页上数百毫秒)
  • 在并发查询中返回重复的网页
  • 也在寻找一般改进

以下是我尝试的两个查询:

查询n°1:

WITH ssss AS (
    UPDATE sources sss
    SET
        next = now() + interval '1 milliseconds' * frequency,
    FROM (
        SELECT s.domain, s.compression
        FROM (
            SELECT id, domain, compression
            FROM sources
            WHERE
                next <= now()
            ORDER BY next ASC
            LIMIT limit,
            OFFSET 0
        ) s
        WHERE
            pg_try_advisory_xact_lock(s.id)
    ) ss
    WHERE sss.domain = ss.domain
    RETURNING sss.domain, sss.compression
)
UPDATE webpages www
SET
    locked = TRUE
FROM (
    SELECT w.url, ssss.compression
    FROM webpages w
    INNER JOIN ssss
    ON w.domain = ssss.domain
    WHERE
        next <= now()
        AND
        locked = FALSE
    LIMIT limit,
) ww
WHERE www.url = ww.url
RETURNING www.refreshpow, www.url, www.type, www.domain, ww.compression;

查询n°2:

WITH ssss AS (
    UPDATE sources sss
    SET
        next = now() + interval '1 milliseconds' * frequency,
    FROM (
        SELECT s.domain, s.compression
        FROM (
            SELECT id, domain, compression
            FROM sources
            WHERE
                next <= now()
            ORDER BY next ASC
            LIMIT limit,
            OFFSET 0
        ) s
        WHERE
            pg_try_advisory_xact_lock(s.id)
    ) ss
    WHERE sss.domain = ss.domain
    RETURNING sss.domain, sss.compression
)
UPDATE webpages wwww
SET
    locked = TRUE
FROM (
    SELECT ww.id, ww.url, ww.compression
    FROM (
        SELECT w.id, w.url, ssss.compression
        FROM webpages w
        INNER JOIN ssss
        ON w.domain = ssss.domain
        WHERE
            next <= now()
            AND
            locked = FALSE
        LIMIT limit,
        OFFSET 0
    ) ww
    WHERE
            pg_try_advisory_xact_lock(ww.id)
) www
WHERE wwww.url = www.url
RETURNING wwww.refreshpow, wwww.url, wwww.type, wwww.domain, www.compression;

下面列出了我在SQL查询开发过程中遇到的一些问题,以便您了解我为什么会遇到此查询:

  • PostgreSQL没有UPDATE ... LIMIT 1,这意味着你必须做一个子查询或CTE(参见:UPDATE ... LIMIT 1 answer
  • 由于此子查询/ CTE引入了竞争条件,因此内部SELECT匹配的行需要与FOR UPDATEpg_try_advisory_xact_lock锁定(请参阅:UPDATE ... LIMIT 1 answer
  • PostgreSQL的优化器不尊重运算符关联顺序。例如:在(a = b) AND (c = d)语句中,(c = d)可以在(a = b)之前执行。因此,如果您只想锁定匹配的行直到LIMIT limit,则需要将SELECT放在已嵌入的子查询/ CTE SELECT内的另一个子查询/ CTE中。 (见:UPDATE ... LIMIT 1 answer
  • OFFSET 0用于阻止内联

以下是我的表格模式:

sources表架构:

CREATE TABLE sources
(
  id serial NOT NULL,
  domain text NOT NULL,
  created timestamp with time zone,
  topic character varying(255),
  last timestamp with time zone,
  next timestamp with time zone NOT NULL,
  compression boolean DEFAULT true,
  CONSTRAINT sources_pkey PRIMARY KEY (domain)
)

CREATE INDEX next_index
  ON sources
  USING btree
  (next);

CREATE UNIQUE INDEX sources_domain
  ON sources
  USING btree
  (domain COLLATE pg_catalog."default");

CREATE UNIQUE INDEX sources_id
  ON sources
  USING btree
  (id);

webpages表:

CREATE TABLE webpages
(
  id serial NOT NULL,
  url text NOT NULL,
  created timestamp with time zone,
  locked boolean DEFAULT false,
  type enum_webpages_type,
  last timestamp with time zone,
  next timestamp with time zone NOT NULL,
  refreshpow integer NOT NULL DEFAULT 2,
  locale character varying(255),
  title text,
  image text,
  date timestamp with time zone,
  tags text[],
  authors text[],
  summary text,
  html text,
  domain text,
  parent text,
  error uuid,
  CONSTRAINT webpages_pkey PRIMARY KEY (url),
  CONSTRAINT webpages_domain_fkey FOREIGN KEY (domain)
      REFERENCES sources (domain) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE,
  CONSTRAINT webpages_error_fkey FOREIGN KEY (error)
      REFERENCES errors (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT webpages_parent_fkey FOREIGN KEY (parent)
      REFERENCES webpages (url) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
)

CREATE INDEX webpages_domain
  ON webpages
  USING btree
  (domain COLLATE pg_catalog."default");

CREATE INDEX webpages_error
  ON webpages
  USING btree
  (error);

CREATE UNIQUE INDEX webpages_id
  ON webpages
  USING btree
  (id);

CREATE INDEX webpages_last
  ON webpages
  USING btree
  (last);

CREATE INDEX webpages_next_locked
  ON webpages
  USING btree
  (next, locked);

CREATE UNIQUE INDEX webpages_url
  ON webpages
  USING btree
  (url COLLATE pg_catalog."default");

这是基于我对SQL和PostgreSQL的有限理解,所以我很乐意得到任何帮助。

0 个答案:

没有答案