I have a functional scraper that scrapes a large number of websites from a database, and reads the results to the same database. I take the domain from the database, and manually append https://www. on the url. Even if this url is not correct, the vast majority of sites can redirect the spider correctly, but for some few sites, I get a DNSLookup error since there is no redirect in place, even though the site clearly exists and is accessible via a browser.
My question is, is there a way to retry a scrape that gets a DNSLookup error, but with a different URL? I am currently handling my errors in errback where I insert the necessary information to the database depending on what kind of an error I get. Is there a way to request a new scrape from the scrape results?
答案 0 :(得分:0)
When you yield Request to some url, besides callback
, you can set errback
, where you can catch such cases. In official documentation you can find pretty good example of its usage: http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
if failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
And also check, maybe RetryMiddleware
will fit your goals. Check official docs for scrapy
here: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry