Question

我的中间件刮擦有以下问题：

我通过https向站点发出请求，并且还使用了代理。在定义中间件并在其中使用process_response时，response.headers的确只有网站标题。有什么方法可以从代理隧道建立的CONNECT请求中获取标头？我们正在使用的代理在此响应中添加一些信息作为标头，我们希望在中间件中使用它。我发现在TunnelingTCP4ClientEndpoint.processProxyResponse中，参数rcvd_bytes具有我需要的所有信息。我没有找到在中间件中获取rcvd_bytes的方法。

我还发现一年前有一个类似的（相同）问题，但尚未解决：Not receiving headers Scrapy ProxyMesh

以下是代理网站中的示例：

对于HTTPS，IP位于5.6.7.8代理对等IP的CONNECT响应标头x-hola-ip示例中：

Request
CONNECT example.com:80 HTTP/1.1
Host: example.com:80
Accept: */*

Response:
HTTP/1.1 200 OK
Content-Type: text/html
x-hola-ip: 5.6.7.8

在此示例中，我想获取x-hola-ip。

使用像curl --proxy mysuperproxy https://stackoverflow.com这样的curl时，我还在CONNECT响应中也获得了正确的数据。

如果这不可能，我可能的解决方案是到目前为止以某种方式对类进行猴子修补，或者您可能会在python中找到更好的解决方案。

预先感谢您的帮助。

注意：我还在scrapy的github问题上发布了这个问题，如果我找到任何解决方案，我将同时更新两个站点：）

在Matthew的帮助下工作的解决方案：

from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingTCP4ClientEndpoint, TunnelError, TunnelingAgent
)
from scrapy import twisted_version

class MyHTTPDownloader(HTTP11DownloadHandler):
    i = ''
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        agent = ScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
            maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
            warnsize=getattr(spider, 'download_warnsize', self._default_warnsize),
            fail_on_dataloss=self._fail_on_dataloss)


        agent._TunnelingAgent = MyTunnelingAgent

        return agent.download_request(request)

class MyTunnelingAgent(TunnelingAgent):
    if twisted_version >= (15, 0, 0):
        def _getEndpoint(self, uri):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, uri.host, uri.port, self._proxyConf,
                self._contextFactory, self._endpointFactory._connectTimeout,
                self._endpointFactory._bindAddress)
    else:
        def _getEndpoint(self, scheme, host, port):
            return MyTunnelingTCP4ClientEndpoint(
                self._reactor, host, port, self._proxyConf,
                self._contextFactory, self._connectTimeout,
                self._bindAddress)

class MyTunnelingTCP4ClientEndpoint(TunnelingTCP4ClientEndpoint):
    def processProxyResponse(self, rcvd_bytes):
        # log('hier rcvd_bytes')
        MyHTTPDownloader.i = rcvd_bytes
        return super(MyTunnelingTCP4ClientEndpoint, self).processProxyResponse(rcvd_bytes)

在您的设置中：

DOWNLOAD_HANDLERS = {
    'http': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
    'https': 'crawler.MyHTTPDownloader.MyHTTPDownloader',
}

Answer 1

我在#3329中看到来自Scrapinghub的人说他们不太可能添加该功能，并建议创建一个自定义子类来获得所需的行为。因此，请记住：

我相信在创建子类之后，可以通过在DOWNLOAD_HANDLERS中设置http和https键来指向scrapy来使用它。

请记住，我没有一个本地HTTP代理来发送额外的标头进行测试，因此这只是我认为需要发生的事情的“餐巾纸草图”：

from scrapy.core.downloader.handlers.http11 import (
    HTTP11DownloadHandler, ScrapyAgent, TunnelingAgent,
)

class MyHTTPDownloader(HTTP11DownloadHandler):
    def download_request(self, request, spider):
        # we're just overriding here to monkey patch the attribute
        ScrapyAgent._TunnelingAgent = MyTunnelingAgent
        return super(MyHTTPDownloader, self).download_request(request, spider)

class MyTunnelingAgent(TunnelingAgent):
    # ... and here is where it would get weird

最后一点动摇了，因为我相信我对需要重写以捕获所需字节的方法有清楚的了解，但是我脑中没有足够的Twisted框架来知道在哪里>放置它们，以使它们暴露于回到蜘蛛的Response中。

在中间件中获取代理响应

1 个答案: