Question

从网页中提取链接时，我希望获得除标题内的所有链接。为了不从标题中提取链接，我创建了一个XPath，它匹配除页眉或页脚之外的所有内容：

*[not(ancestor-or-self::*[contains(@id,"header") or contains(@class,"header") or header or contains(@id,"footer") or contains(@class,"footer") or footer])]

问题在于，当我作为restricted_paths参数放置时，它无法正常工作。在没有指定restricted_paths参数的情况下，我获得了比我更多的链接。

LinkExtractor(allow_domains=netloc).extract()

返回120个链接。

LinkExtractor(allow_domains=netloc, restrict_xpaths=["""//*[not(ancestor-or-self::*[contains(@id,"header") or contains(@class,"header") or header or contains(@id,"footer") or contains(@class,"footer") or footer])]"""]).extract_links(response)

返回199个链接，包括标题内的链接（<div id="header"> .... </div>）

Scrapy LinkExtractor - restrict_paths - 排除标签

0 个答案: