Question

I have a functional scraper that scrapes a large number of websites from a database, and reads the results to the same database. I take the domain from the database, and manually append https://www. on the url. Even if this url is not correct, the vast majority of sites can redirect the spider correctly, but for some few sites, I get a DNSLookup error since there is no redirect in place, even though the site clearly exists and is accessible via a browser.

My question is, is there a way to retry a scrape that gets a DNSLookup error, but with a different URL? I am currently handling my errors in errback where I insert the necessary information to the database depending on what kind of an error I get. Is there a way to request a new scrape from the scrape results?

Answer 1

When you yield Request to some url, besides callback, you can set errback, where you can catch such cases. In official documentation you can find pretty good example of its usage: http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks

def start_requests(self):
    for u in self.start_urls:
        yield scrapy.Request(u, callback=self.parse_httpbin,
                                errback=self.errback_httpbin,
                                dont_filter=True)

def errback_httpbin(self, failure):
    # log all failures
    self.logger.error(repr(failure))

    if failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        self.logger.error('DNSLookupError on %s', request.url)

And also check, maybe RetryMiddleware will fit your goals. Check official docs for scrapy here: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry

How can I retry a failed scrape with a different URL?

1 个答案: