Question

如果页面抛出验证码，我想停止蜘蛛。因此，我将以下代码写入我的自定义下载器中间件，如下所示。

def process_response(self, request, response, spider):
    title = response.xpath('//title/text()').extract_first()
    if title == 'Robot Check':
        print('CAPTCHA THROWWWEEED')
        raise CloseSpider('Closing Spider...')
    else:
        return response

我的蜘蛛：我删除了一些部分

class GoToPageSpider(scrapy.Spider):
    name = 'gotopage'
    allowed_domains = ["www.amazon.com"]
    start_urls = ['https://www.amazon.com']

    def __init__(self):
         ....

    def parse(self, response):
        yield Request(response.url, callback=self.start_crawling)

    def start_crawling(self, response):
        for isbn in self.queue_results:
           link = self.link + isbn
           yield Request(link, callback=self.parse_book, meta={'ISBN':isbn})

    def parse_book(self, response):
        isbn = response.meta['ISBN']

        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        if title:
            title = title.split("(")[0]

        rank = ''
        rank = response.xpath('//*[@id="SalesRank"]/text()').extract()
        if rank:
           yield {'ISBN': isbn, 'RANK': rank, 'TITLE': title}

    def close(self, spider, reason):
        ....

问题：虽然我看到'Closing Spider...'消息，但它并没有阻止蜘蛛。我还在蜘蛛中写了raise CloseSpider('Closing Spider...')代码。这一次，它甚至没有显示消息。

这是我收到的追溯。我看到我也遇到During handling of the above exception, another exception occurred:错误。

ERROR: Error downloading <GET https://www.amazon.com/dp/0486460169>
Traceback (most recent call last):
  File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1363, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://www.amazon.com/dp/0486460169>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\twisted\internet\defer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "c:\users\hp\pycharmprojects\amazon_v1\venv\lib\site-packages\scrapy\core\downloader\middleware.py", line 53, in process_response
    spider=spider)
  File "C:\Users\HP\PycharmProjects\Amazon_v1\amazon_books\amazon_books\middlewares.py", line 95, in process_response
    raise CloseSpider(reason='Closing Spider...')
scrapy.exceptions.CloseSpider

CloseSpider：无法关闭蜘蛛

0 个答案: