Question

如果url包含＆＃34; https＆＃34;我有一个[引发IgnoreRequests（）]的中间件。

class MiddlewareSkipHTTPS(object):
    def process_response(self, request, response, spider):
        if (response.url.find("https") > -1):
            raise IgnoreRequest()
        else:
            return response

有没有办法完全阻止scrapy对HTTPS网址执行GET请求？我在没有[IgnoreRequests（）]的情况下获得了response_bytes / response_count的相同值，并使用了我的代码片段。我正在寻找零值并跳过抓取网址。我不想让scrapy从https页面抓取/下载所有字节，只需转到下一个网址即可。

注意：必须是中间件，不要使用嵌入在蜘蛛中的规则。有数百只蜘蛛，并希望巩固逻辑。

Answer 1

不要使用process_response，在已经提出请求后调用它。

您需要使用

def process_request(request, spider):
     request.url # URL being scraped

在实际发出请求之前调用此方法。

见这里

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_request

Answer 2

在你的设置中这样做应该可以正常工作

DOWNLOAD_HANDLERS = {
    'https': None
}

Scrapy Middleware忽略URL并阻止抓取

2 个答案: