如果页面抛出验证码,我想停止蜘蛛。因此,我将以下代码写入我的自定义下载器中间件,如下所示。
def process_response(self, request, response, spider):
title = response.xpath('//title/text()').extract_first()
if title == 'Robot Check':
print('CAPTCHA THROWWWEEED')
raise CloseSpider('Closing Spider...')
else:
return response
我的蜘蛛:我删除了一些部分
class GoToPageSpider(scrapy.Spider):
name = 'gotopage'
allowed_domains = ["www.amazon.com"]
start_urls = ['https://www.amazon.com']
def __init__(self):
....
def parse(self, response):
yield Request(response.url, callback=self.start_crawling)
def start_crawling(self, response):
for isbn in self.queue_results:
link = self.link + isbn
yield Request(link, callback=self.parse_book, meta={'ISBN':isbn})
def parse_book(self, response):
isbn = response.meta['ISBN']
title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
if title:
title = title.split("(")[0]
rank = ''
rank = response.xpath('//*[@id="SalesRank"]/text()').extract()
if rank:
yield {'ISBN': isbn, 'RANK': rank, 'TITLE': title}
def close(self, spider, reason):
....
问题:虽然我看到'Closing Spider...'
消息,但它并没有阻止蜘蛛。
我还在蜘蛛中写了raise CloseSpider('Closing Spider...')
代码。这一次,它甚至没有显示消息。
这是我收到的追溯。我看到我也遇到During handling of the above exception, another exception occurred:
错误。
ERROR: Error downloading <GET https://www.amazon.com/dp/0486460169>
Traceback (most recent call last):
File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://www.amazon.com/dp/0486460169>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 53, in process_response
spider=spider)
File "C:\Users\HP\PycharmProjects\Amazon_v1\amazon_books\amazon_books\middlewares.py", line 95, in process_response
raise CloseSpider(reason='Closing Spider...')
scrapy.exceptions.CloseSpider