Question

有没有办法根据HTTP响应状态代码设置新的代理ip（例如：来自池）？例如，启动IP表单和IP列表，直到它获得503响应（或另一个http错误代码），然后使用下一个直到它被阻止，依此类推，例如：

if http_status_code in [403, 503, ..., n]:
    proxy_ip = 'new ip'
    # Then keep using it till it's gets another error code

有什么想法吗？

Answer 1

Scrapy有一个下载程序中间件，默认情况下启用它来处理代理。它被称为HTTP Proxy Middleware，它的作用是允许您向proxy提供元密钥Request，并将该代理用于此请求。

这样做的方法很少第一个，直接在你的蜘蛛代码中使用它：

def parse(self, response):
    if response.status in range(400, 600):
        return Request(response.url, 
                       meta={'proxy': 'http://myproxy:8010'}
                       dont_filter=True)  # you need to ignore filtering because you already did one request to this url

另一个更优雅的方法是使用自定义下载器中间件，它可以处理多个回调并保持蜘蛛代码清洁：

from project.settings import PROXY_URL
class MyDM(object):
    def process_response(self, request, response, spider):
        if response.status in range(400, 600):
            logging.debug('retrying [{}]{} with proxy: {}'.format(response.status, response.url, PROXY_URL)
            return Request(response.url, 
                           meta={'proxy': PROXY_URL}
                           dont_filter=True)
        return response

请注意，默认情况下，scrapy不会通过200以外的任何响应代码。 Scrapy会自动使用300处理重定向代码Redirect middleware，并使用HttpError中间件在400和500上引发请求错误。要处理200以外的请求，您需要：

在Request Meta中指定：

Request(url, meta={'handle_httpstatus_list': [404,505]})
# or for all 
Request(url, meta={'handle_httpstatus_all': True})

设置项目/蜘蛛网参数：

HTTPERROR_ALLOW_ALL = True  # for all
HTTPERROR_ALLOWED_CODES = [404, 505]  # for specific

根据http://doc.scrapy.org/en/latest/topics/spider-middleware.html#httperror-allowed-codes

如何在scrapy中检测HTTP响应状态代码并相应地设置代理？

1 个答案: