Question

我正在尝试访问某个网站，并检查是否没有链接重定向到网站中已关闭的网页。由于没有可用的站点地图，我使用Scrapy抓取网站并获取每个页面上的所有链接，但我无法输出包含找到的所有链接及其状态代码的文件。我用来测试代码的网站是quotes.toscrape.com，我的代码是：

from scrapy.spiders import Spider
from mytest.items import MytestItem
from scrapy.http
import Request
import re
class MySpider(Spider):
    name = "sample"
    allowed_domains   = ["quotes.toscrape.com"]
    start_urls = ["http://quotes.toscrape.com"]
    def parse(self, response):
        links = response.xpath('//a/@href').extract()
\# We stored already crawled links in this list
        crawledLinks = []
        for link in links:
\# If it is a proper link and is not checked yet, yield it to the Spider
        if link not in crawledLinks:
             link = "http://quotes.toscrape.com" + link
             crawledLinks.append(link)
             yield Request(link, self.parse)

我在yield之后尝试添加以下几行：

item = MytestItem()
item['url'] = link
item['status'] = response.status
yield item

但它给了我一堆重复，没有状态为404或301的网址。有谁知道如何获得所有状态的网址？

Answer 1

默认情况下，Scrapy不会返回任何不成功的请求，但是如果设置errback on the request，则可以在其中一个函数中获取它们并处理它们。

def parse(self, response):
    # some code
    yield Request(link, self.parse, errback=self.parse_error)

def parse_error(self, failure):
    # log the response as an error

参数failure将包含more information on the exact reason失败，因为它可能是HTTP错误（您可以获取响应），但也可能是DNS查找错误等（没有响应）。

该文档包含一个示例，说明如何使用失败来确定错误原因并访问Response如果可用：

def errback_httpbin(self, failure):
    # log all failures
    self.logger.error(repr(failure))

    # in case you want to do something special for some errors,
    # you may need the failure's type:

    if failure.check(HttpError):
        # these exceptions come from HttpError spider middleware
        # you can get the non-200 response
        response = failure.value.response
        self.logger.error('HttpError on %s', response.url)

    elif failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        self.logger.error('DNSLookupError on %s', request.url)

    elif failure.check(TimeoutError, TCPTimedOutError):
        request = failure.request
        self.logger.error('TimeoutError on %s', request.url)

Answer 2

您应该在设置中使用HTTPERROR_ALLOW_ALL或在所有请求中设置元键handle_httpstatus_all = True，有关详细信息，请参阅文档。

无法在Scrapy中获取所有http请求

2 个答案: