我尝试使用Scrapy在网站上找到" DNS查找失败的所有链接"。
问题是,每个没有任何错误的网站都打印在 parse_obj 方法上,但是当网址返回DNS查找失败时,回调 parse_obj不会调用。
我希望所有域名都出现错误" DNS查找失败",我该怎么做?
日志:
2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET http://domain.com> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET http://expired-domain.com/> (failed 1 times): DNS lookup failed: address 'expired-domain.com' not found: [Errno 11001] getaddrinfo failed.
代码:
class MyItem(Item):
url= Field()
class someSpider(CrawlSpider):
name = 'Crawler'
start_urls = ['http://domain.com']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self, response):
item = MyItem()
item['url'] = []
for link in LxmlLinkExtractor(allow=()).extract_links(response):
parsed_uri = urlparse(link.url)
url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
print url
答案 0 :(得分:4)
CrawlSpider规则不允许传递错误(这是一种耻辱)
这是我为捕获DNS错误而提供的another answer的变体:
# -*- coding: utf-8 -*-
import random
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError
class HttpbinSpider(CrawlSpider):
name = "httpbin"
# this will generate test links so that we can see CrawlSpider in action
start_urls = (
'https://httpbin.org/links/10/0',
)
rules = (
Rule(LinkExtractor(),
callback='parse_page',
# hook to be called when this Rule generates a Request
process_request='add_errback'),
)
# this is just to no retry errors for this example spider
custom_settings = {
'RETRY_ENABLED': False
}
# method to be called for each Request generated by the Rules above,
# here, adding an errback to catch all sorts of errors
def add_errback(self, request):
self.logger.debug("add_errback: patching %r" % request)
# this is a hack to trigger a DNS error randomly
rn = random.randint(0, 2)
if rn == 1:
newurl = request.url.replace('httpbin.org', 'httpbin.organisation')
self.logger.debug("add_errback: patching url to %s" % newurl)
return request.replace(url=newurl,
errback=self.errback_httpbin)
# this is the general case: adding errback to all requests
return request.replace(errback=self.errback_httpbin)
def parse_page(self, response):
self.logger.info("parse_page: %r" % response)
def errback_httpbin(self, failure):
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
self.logger.error(repr(failure))
if failure.check(HttpError):
# you can get the response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
这是你在控制台上得到的:
$ scrapy crawl httpbin
2016-03-08 15:16:30 [scrapy] INFO: Scrapy 1.0.5 started (bot: httpbinlinks)
2016-03-08 15:16:30 [scrapy] INFO: Optional features available: ssl, http11
2016-03-08 15:16:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'httpbinlinks.spiders', 'SPIDER_MODULES': ['httpbinlinks.spiders'], 'BOT_NAME': 'httpbinlinks'}
2016-03-08 15:16:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-08 15:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-08 15:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-08 15:16:30 [scrapy] INFO: Enabled item pipelines:
2016-03-08 15:16:30 [scrapy] INFO: Spider opened
2016-03-08 15:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 15:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 15:16:30 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/0> (referer: None)
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/5>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/9>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/8> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/7> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/6> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/3> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/4> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/1> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/2> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [scrapy] INFO: Closing spider (finished)
2016-03-08 15:16:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/request_bytes': 2577,
'downloader/request_count': 10,
'downloader/request_method_count/GET': 10,
'downloader/response_bytes': 3968,
'downloader/response_count': 8,
'downloader/response_status_count/200': 8,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 8, 14, 16, 31, 761515),
'log_count/DEBUG': 20,
'log_count/ERROR': 4,
'log_count/INFO': 14,
'request_depth_max': 1,
'response_received_count': 8,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2016, 3, 8, 14, 16, 30, 427657)}
2016-03-08 15:16:31 [scrapy] INFO: Spider closed (finished)