使用toripchanger模块的刮spider蜘蛛的连接错误

时间:2019-02-20 12:24:11

标签: python proxy scrapy tor polipo

首先,大多数物品被刮掉了(我错过了3件物品),因此存在Internet连接,但是行为并非我所期望的。 除了我没有接受互联网协议方面的培训之外,我只是模糊地了解这些原理,以及它如何工作。

所以我用scrapy来爬行网站。

我尽力做到匿名,即使我的蜘蛛很礼貌也不会被禁止。

设置

因此,我的middlewares.pysettings.py在我的草书项目中是这样配置的,关于互联网连接:

settings.py

#proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'

RETRY_ENABLED = True
RETRY_TIMES = 5  # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [401, 403, 404, 408, 500, 502, 503, 504]
...
DOWNLOADER_MIDDLEWARES = {
'folder.middlewares.RandomUserAgentMiddleware': 400,
'folder.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

middlewares.py

ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
    _requests_count = 0

    def process_request(self, request, spider):
        self._requests_count += 1
        if self._requests_count > 10:
            self._requests_count = 0 
            ip_changer.get_new_ip()

        request.meta['proxy'] = settings.get('HTTP_PROXY')
        spider.log('Proxy : %s' % request.meta['proxy'])

所以在这里我们看到我想每十个请求更改一次IP。

我如何启动Tor和Polipo

我使用Vidalia启动Tor和Polipo代理。 如此配置。

常规设置

  • 代理应用程序(选项)
    • 要在Tor启动时启动代理应用程序:选中。
      • C:\ Users \ truc \ Documents \ Tor \ polipo.exe
    • 代理应用程序的参数
      • -c“ C:\ Users \ truc \ Documents \ Tor \ config”

请注意,我是从法语翻译过来的,因此如果您看不到标题完全相同的

,这是正常的

polipo的config文件配置为:

# Uncomment this if you want to use a parent SOCKS proxy:
socksParentProxy = "localhost:9050"
socksProxyType = socks5
diskCacheRoot = ""

# Uncomment one of these if you want to allow remote clients to connect:
# proxyAddress = "::0"        # both IPv4 and IPv6
proxyAddress = "0.0.0.0"    # IPv4 only

在此文件中,没有其他注释。

高级设置

  • 任务控制
    • 要使用TCP(控制端口):已选中
      • 127.0.0.1:9051
  • Tor的配置文件
    • C:\ Users \ truc \ Documents \ Tor \ Data \ Tor \ torrc

tor的torrc配置文件配置为:

ControlPort 9051
DataDirectory C:/Users/truc/Documents/Tor/Data/Tor
HashedControlPassword  ******* 
Log notice stdout
SocksPort 9050

我如何发射蜘蛛

py -m scrapy crawl spider -a arg1=0 -a arg2=30

所以我的蜘蛛报废了30种不同的地址,因此这里至少有30种不同的请求,而无需计算登录页面。

我如何检查我的IP

在我的蜘蛛文件中,我向http://checkip.dyndns.org/发送了一个请求,以检查我的IP是否更改。

def parse_page(self,response):
    ... #parsing and returning item
    yield scrapy.Request('http://checkip.dyndns.org/', meta={'item':item}, callback=self.checkip, dont_filter=True)
    yield item

def checkip(self, response):
    print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))

遇到了例外情况

与我的预期相反,IP不变。在该命令中,它打印了IP: 195.176.3.20 25次(由于没有例外,所以打印了30次),而我希望每10个请求进行一次更改。 是的,这很奇怪,只能获得27个项目,并且只能返回25次IP,这是因为我获得了与您请求http://checkip.dyndns.org/页面时在下面看到的异常相同的异常。

在日志文件中,有以下几行:

2019-02-19 17:07:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: I_AM_A_POLITE_ROBOT)
2019-02-19 17:07:22 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.5, 
cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, 
Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) 
[MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), 
cryptography 2.5, Platform Windows-7-6.1.7601-SP1
2019-02-19 17:07:22 [scrapy.crawler] INFO: Overridden settings:{'AUTOTHROTTLE_ENABLED': True, 
'BOT_NAME': 'I_AM_A_POLITE_ROBOT',
'DOWNLOAD_DELAY': 2, 'LOG_FILE': 'monlog.log',
'NEWSPIDER_MODULE':'folder.spiders', 
'RETRY_HTTP_CODES': [401, 403, 404, 408, 500, 502, 503, 504],
'RETRY_TIMES': 5, 'ROBOTSTXT_OBEY': True, 
'SPIDER_MODULES': ['folder.spiders']}

2019-02-19 17:07:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-19 17:07:23 [spider_name] DEBUG: Proxy : http://127.0.0.1:8123

2019-02-19 17:07:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.website.com/robots.txt> (referer: None)
2019-02-19 17:07:25 [spider_name] DEBUG: Proxy : http://127.0.0.1:8123
2019-02-19 17:07:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.website.com/compte/login> (referer: None)
... #here we see with 200 there is a connexion well established, but just after that there is proble I do not understand.
2019-02-19 17:07:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:8118
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "[WinError 10058] Une demande d’envoi ou de réception de données n’a pas été autorisée car le socket avait déjà été éteint dans cette direction par un appel d’arrêt précédent"
# translate it with: A request to send or receive data was disallowed because the socket had already been shut down in that direction with a previous shutdown call
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
...#and repeated many times.
2019-02-19 17:07:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.website.com/page1>
Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
 (self._dns_host, self.port), self.timeout, **extra_kw)
  File "C:\Python37\lib\site-packages\urllib3\util\connection.py", line 80, in create_connection
raise err
  File "C:\Python37\lib\site-packages\urllib3\util\connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée
#translate it with:No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
  File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw)
  File "C:\Python37\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Python37\lib\http\client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Python37\lib\http\client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Python37\lib\http\client.py", line 1016, in _send_output
self.send(msg)
  File "C:\Python37\lib\http\client.py", line 956, in send
self.connect()
  File "C:\Python37\lib\site-packages\urllib3\connection.py", line 181, in connect conn = self._new_conn()
  File "C:\Python37\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x04EF7BD0>: Failed to establish a new connection: [WinError 10061]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
  File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
  File "C:\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: 
HTTPConnectionPool(host='127.0.0.1', port=8118): Max retries exceeded with url: 
http://icanhazip.com/ (Caused by ProxyError('Cannot connect to proxy.',
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x04EF7BD0>: Failed to establish a new connection: 
[WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 107, in get_new_ip
current_ip = self.get_current_ip()
  File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 84, in get_current_ip
response = get(ICANHAZIP, proxies={'http': self.local_http_proxy})
  File "C:\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
  File "C:\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
  File "C:\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
  File "C:\Python37\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
  File "C:\Python37\lib\site-packages\requests\adapters.py", line 510, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPConnectionPool(host='127.0.0.1', port=8118): 
Max retries exceeded with url: http://icanhazip.com/ (Caused by ProxyError('Cannot connect to proxy.',
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x04EF7BD0>: 
Failed to establish a new connection: [WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
   File "C:\Python37\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
  File "C:\Users\truc\scrapy_project\middlewares.py", line 39, in process_request
ip_changer.get_new_ip()
  File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 109, in get_new_ip
self._obtain_new_ip()
  File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 180, in _obtain_new_ip
controller.authenticate(password=self.tor_password)
  File "C:\Python37\lib\site-packages\stem\control.py", line 1100, in authenticate
stem.connection.authenticate(self, *args, **kwargs)
  File "C:\Python37\lib\site-packages\stem\connection.py", line 625, in authenticate
raise auth_exc
  File "C:\Python37\lib\site-packages\stem\connection.py", line 579, in authenticate
authenticate_password(controller, password, False)
  File "C:\Python37\lib\site-packages\stem\connection.py", line 735, in authenticate_password
raise IncorrectPassword(str(auth_response), auth_response) stem.connection.IncorrectPassword: 
 Authentication failed: Password did not match HashedControlPassword value from configuration

好吧,很长的时间我很抱歉,但是这里已经很完整了,正如我所说,我没有接受这些技能的培训。 请注意上面重复的执行,我的意思是这是与发生的异常相同的模式,所以我将其简称为短路。

我该怎么办?

Tor版本:0.3.4.8,Vidalia:0.2.21,Scrapy 1.6.0,polipo:我不知道,python:Python 3.7.2

0 个答案:

没有答案