Scrapy如何通过SSL代理刮取HTTPS网站

时间:2019-02-22 03:15:18

标签: ssl https proxy scrapy

我有SSL代理服务器,我想剪贴https站点。我的意思是scrapy和代理之间的连接已加密,然后代理将打开与网站的连接。 经过一些调试后,我发现以下内容: 目前scrap抓的情况如下:-

如果网站是http,则使用ScrapyProxyAgent,它向客户端发送问候,然后将对网站的连接请求发送到代理

但是如果站点是https

它使用不向客户端发送客户端问候的TunnelingAgent,因此连接终止。

我需要告诉scrapy首先通过ScrapyProxyAgent建立连接,然后使用TunnelingAgent不确定该怎么做。

我试图创建一个https DOWNLOAD_HANDLERS,但我不是那个专家

class MyHTTPDownloader(HTTP11DownloadHandler):

    def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""

        timeout = request.meta.get('download_timeout') or self._connectTimeout
        bindaddress = request.meta.get('bindaddress')
        proxy = request.meta.get('proxy')
        agent = ScrapyProxyAgent(reactor,proxyURI=to_bytes(proxy, encoding='ascii'),
                    connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)
        _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
        proxyHost = to_unicode(proxyHost)
        url = urldefrag(request.url)[0]
        method = to_bytes(request.method)
        headers = TxHeaders(request.headers)
        omitConnectTunnel = b'noconnect' in proxyParams
        proxyConf = (proxyHost, proxyPort,
                     request.headers.get(b'Proxy-Authorization', None))
        if request.body:
            bodyproducer = _RequestBodyProducer(request.body)
        if request.body:
            bodyproducer = _RequestBodyProducer(request.body)
        elif method == b'POST':
            bodyproducer = _RequestBodyProducer(b'')
        else:
            bodyproducer = None
            start_time = time()
        tunnelingAgent = TunnelingAgent(reactor, proxyConf,
                             contextFactory=self._contextFactory, connectTimeout=timeout,
                             bindAddress=bindaddress, pool=self._pool)

        agent.request(method, to_bytes(url, encoding='ascii'), headers, bodyproducer)

代理代理连接后,我需要建立一个隧道。 那有可能吗?

提前感谢

0 个答案:

没有答案