Question

我正在尝试使用scrapy来抓取网站，但该网站没有网站地图或网页索引。如何使用scrapy抓取网站的所有页面？

我只需要下载网站的所有页面而不提取任何项目。我只需要设置遵循蜘蛛规则中的所有链接吗？但我不知道scrapy是否会以这种方式避免复制网址。

Answer 1

我自己就找到了答案。使用CrawlSpider类，我们只需要在SgmlLinkExtractor函数中设置变量allow =（）。正如文件所说：

allow（正则表达式（或列表）） - 一个正则表达式（或正则表达式列表），（绝对）URL必须匹配才能被提取。如果没有给出（或为空），它将匹配所有链接。

Answer 2

在您的<%= @html %>中，将Spider定义为要爬网的域的列表。

allowed_domains

然后，您可以使用class QuotesSpider(scrapy.Spider): name = 'quotes' allowed_domains = ['quotes.toscrape.com']来跟踪链接。参见the docs for Spiders和the tutorial。

或者，您可以使用response.follow()（例如David Thompson mentioned）过滤域。

LinkExtractor