如何编写一个通过socksipy发出请求的scrapy的DownloadHandler?

时间:2014-02-17 21:32:27

标签: python web-scraping scrapy twisted socks

我试图在Tor上使用scrapy。我一直在努力探讨如何编写一个使用socksipy连接的scrapy的DownloadHandler。

Scrapy的HTTP11DownloadHandler在这里:https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py

以下是创建自定义下载处理程序的示例: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py

以下是创建SocksiPyConnection类的代码:http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/

class SocksiPyConnection(httplib.HTTPConnection):
    def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
        self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
        httplib.HTTPConnection.__init__(self, *args, **kwargs)

    def connect(self):
        self.sock = socks.socksocket()
        self.sock.setproxy(*self.proxyargs)
        if isinstance(self.timeout, float):
            self.sock.settimeout(self.timeout)
        self.sock.connect((self.host, self.port))

由于scrapy代码中扭曲反应器的复杂性,我无法弄清楚插件是如何插入它的。有什么想法吗?

请不要回答类似私有的替代方案,或发布答案说" scrapy不能与袜子代理商一起使用" - 我知道,这就是为什么我试图编写一个使用socksipy发出请求的自定义下载器。

2 个答案:

答案 0 :(得分:7)

我能够使用https://github.com/habnabit/txsocksx来完成这项工作。

执行pip install txsocksx后,我需要用ScrapyAgent替换scrapy的txsocksx.http.SOCKS5Agent

我只是从HTTP11DownloadHandler复制了ScrapyAgentscrapy/core/downloader/handlers/http.py的代码,将它们分类并编写了这段代码:

class TorProxyDownloadHandler(HTTP11DownloadHandler):

    def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""
        agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)
        return agent.download_request(request)


class ScrapyTorAgent(ScrapyAgent):
    def _get_agent(self, request, timeout):
        bindaddress = request.meta.get('bindaddress') or self._bindAddress
        proxy = request.meta.get('proxy')
        if proxy:
            _, _, proxyHost, proxyPort, proxyParams = _parse(proxy)
            scheme = _parse(request.url)[0]
            omitConnectTunnel = proxyParams.find('noconnect') >= 0
            if  scheme == 'https' and not omitConnectTunnel:
                proxyConf = (proxyHost, proxyPort,
                             request.headers.get('Proxy-Authorization', None))
                return self._TunnelingAgent(reactor, proxyConf,
                    contextFactory=self._contextFactory, connectTimeout=timeout,
                    bindAddress=bindaddress, pool=self._pool)
            else:
                _, _, host, port, proxyParams = _parse(request.url)
                proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,
                    timeout=timeout, bindAddress=bindaddress)
                agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)
                return agent

        return self._Agent(reactor, contextFactory=self._contextFactory,
            connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

在settings.py中,需要这样的东西:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.http.TorProxyDownloadHandler'
}

现在通过像Tor这样的袜子代理工作代理Scrapy。

答案 1 :(得分:3)

尝试来自其中一位Twisted开发者的txsocksx

由于Twsisted和Scrapy的优点,您可以轻松地将SOCKS用作代理:

downloader.py

import scrapy.core.downloader.handlers.http11 as handler
from twisted.internet import reactor
from txsocksx.http import SOCKS5Agent
from twisted.internet.endpoints import TCP4ClientEndpoint
from scrapy.core.downloader.webclient import _parse


class TorScrapyAgent(handler.ScrapyAgent):
    _Agent = SOCKS5Agent

    def _get_agent(self, request, timeout):
        proxy = request.meta.get('proxy')

        if proxy:
            proxy_scheme, _, proxy_host, proxy_port, _ = _parse(proxy)

            if proxy_scheme == 'socks5':
                endpoint = TCP4ClientEndpoint(reactor, proxy_host, proxy_port)

                return self._Agent(reactor, proxyEndpoint=endpoint)

        return super(TorScrapyAgent, self)._get_agent(request, timeout)


class TorHTTPDownloadHandler(handler.HTTP11DownloadHandler):
    def download_request(self, request, spider):
        agent = TorScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
                               maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
                               warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))

        return agent.download_request(request)

settings.py注册新处理程序:

DOWNLOAD_HANDLERS = {
    'http': 'crawler.downloader.TorHTTPDownloadHandler',
    'https': 'crawler.downloader.TorHTTPDownloadHandler'
}

现在,您只需告诉抓取工具使用proxy。我建议通过中间件来实现:

class ProxyDownloaderMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'socks5://127.0.0.1:950'