Question

我正在尝试抓取网站中的损坏链接，到目前为止，我已经有了可以成功登录并爬网该代码的代码，但它仅记录HTTP状态200代码：

class HttpStatusSpider(scrapy.Spider):
    name = 'httpstatus'
    handle_httpstatus_all = True

    link_extractor = LinkExtractor()

    def start_requests(self):
        """This method ensures we login before we begin spidering"""
        # Little bit of magic to handle the CSRF protection on the login form
        resp = requests.get('http://localhost:8000/login/')
        tree = html.fromstring(resp.content)
        csrf_token = tree.cssselect('input[name=csrfmiddlewaretoken]')[0].value

        return [FormRequest('http://localhost:8000/login/', callback=self.parse,
                            formdata={'username': 'mischa_cs',
                                      'password': 'letmein',
                                      'csrfmiddlewaretoken': csrf_token},
                            cookies={'csrftoken': resp.cookies['csrftoken']})]

    def parse(self, response):
        item = HttpResponseItem()
        item['url'] = response.url
        item['status'] = response.status
        item['referer'] = response.request.headers.get('Referer', '')
        yield item

        for link in self.link_extractor.extract_links(response):
            r = Request(link.url, self.parse)
            r.meta.update(link_text=link.text)
            yield r

docs和these answers使我相信handle_httpstatus_all = True会导致scrapy将错误的请求传递给我的parse方法，但是到目前为止'无法捕获任何东西。

我还在不同的代码迭代中尝试了handle_httpstatus_list和自定义errback处理程序。

我需要更改什么以捕获scrapy遇到的HTTP错误代码？

Answer 1

handle_httpstatus_list可以在蜘蛛级别定义，但是handle_httpstatus_all只能在Request级别定义，包括meta参数。

在这些情况下，我仍然建议使用errback，但是如果一切都受到控制，则不会造成新的问题。

Answer 2

因此，我不知道这是否是正确的抓取方法，但是它确实允许我处理所有HTTP状态代码（包括5xx）。

我通过将以下代码段添加到我的项目的HttpErrorMiddleware来禁用settings.py：

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': None
}

使用scrapy捕获HTTP错误

2 个答案: