首先,大多数物品被刮掉了(我错过了3件物品),因此存在Internet连接,但是行为并非我所期望的。 除了我没有接受互联网协议方面的培训之外,我只是模糊地了解这些原理,以及它如何工作。
所以我用scrapy来爬行网站。
我尽力做到匿名,即使我的蜘蛛很礼貌也不会被禁止。
因此,我的middlewares.py
和settings.py
在我的草书项目中是这样配置的,关于互联网连接:
settings.py
#proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'
RETRY_ENABLED = True
RETRY_TIMES = 5 # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [401, 403, 404, 408, 500, 502, 503, 504]
...
DOWNLOADER_MIDDLEWARES = {
'folder.middlewares.RandomUserAgentMiddleware': 400,
'folder.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
middlewares.py
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
request.meta['proxy'] = settings.get('HTTP_PROXY')
spider.log('Proxy : %s' % request.meta['proxy'])
所以在这里我们看到我想每十个请求更改一次IP。
我使用Vidalia启动Tor和Polipo代理。 如此配置。
请注意,我是从法语翻译过来的,因此如果您看不到标题完全相同的
,这是正常的 polipo的config
文件配置为:
# Uncomment this if you want to use a parent SOCKS proxy:
socksParentProxy = "localhost:9050"
socksProxyType = socks5
diskCacheRoot = ""
# Uncomment one of these if you want to allow remote clients to connect:
# proxyAddress = "::0" # both IPv4 and IPv6
proxyAddress = "0.0.0.0" # IPv4 only
在此文件中,没有其他注释。
tor的torrc
配置文件配置为:
ControlPort 9051
DataDirectory C:/Users/truc/Documents/Tor/Data/Tor
HashedControlPassword *******
Log notice stdout
SocksPort 9050
py -m scrapy crawl spider -a arg1=0 -a arg2=30
所以我的蜘蛛报废了30种不同的地址,因此这里至少有30种不同的请求,而无需计算登录页面。
在我的蜘蛛文件中,我向http://checkip.dyndns.org/
发送了一个请求,以检查我的IP是否更改。
def parse_page(self,response):
... #parsing and returning item
yield scrapy.Request('http://checkip.dyndns.org/', meta={'item':item}, callback=self.checkip, dont_filter=True)
yield item
def checkip(self, response):
print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))
与我的预期相反,IP不变。在该命令中,它打印了IP: 195.176.3.20
25次(由于没有例外,所以打印了30次),而我希望每10个请求进行一次更改。
是的,这很奇怪,只能获得27个项目,并且只能返回25次IP,这是因为我获得了与您请求http://checkip.dyndns.org/
页面时在下面看到的异常相同的异常。
在日志文件中,有以下几行:
2019-02-19 17:07:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: I_AM_A_POLITE_ROBOT)
2019-02-19 17:07:22 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.5,
cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0,
Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52)
[MSC v.1916 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018),
cryptography 2.5, Platform Windows-7-6.1.7601-SP1
2019-02-19 17:07:22 [scrapy.crawler] INFO: Overridden settings:{'AUTOTHROTTLE_ENABLED': True,
'BOT_NAME': 'I_AM_A_POLITE_ROBOT',
'DOWNLOAD_DELAY': 2, 'LOG_FILE': 'monlog.log',
'NEWSPIDER_MODULE':'folder.spiders',
'RETRY_HTTP_CODES': [401, 403, 404, 408, 500, 502, 503, 504],
'RETRY_TIMES': 5, 'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['folder.spiders']}
2019-02-19 17:07:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-02-19 17:07:23 [spider_name] DEBUG: Proxy : http://127.0.0.1:8123
2019-02-19 17:07:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.website.com/robots.txt> (referer: None)
2019-02-19 17:07:25 [spider_name] DEBUG: Proxy : http://127.0.0.1:8123
2019-02-19 17:07:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.website.com/compte/login> (referer: None)
... #here we see with 200 there is a connexion well established, but just after that there is proble I do not understand.
2019-02-19 17:07:42 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:8118
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "[WinError 10058] Une demande d’envoi ou de réception de données n’a pas été autorisée car le socket avait déjà été éteint dans cette direction par un appel d’arrêt précédent"
# translate it with: A request to send or receive data was disallowed because the socket had already been shut down in that direction with a previous shutdown call
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
2019-02-19 17:07:43 [stem] INFO: Error while receiving a control message (SocketClosed): received exception "peek of closed file"
...#and repeated many times.
2019-02-19 17:07:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.website.com/page1>
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "C:\Python37\lib\site-packages\urllib3\util\connection.py", line 80, in create_connection
raise err
File "C:\Python37\lib\site-packages\urllib3\util\connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée
#translate it with:No connection could be made because the target machine actively refused it
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen chunked=chunked)
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request conn.request(method, url, **httplib_request_kw)
File "C:\Python37\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Python37\lib\http\client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Python37\lib\http\client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Python37\lib\http\client.py", line 1016, in _send_output
self.send(msg)
File "C:\Python37\lib\http\client.py", line 956, in send
self.connect()
File "C:\Python37\lib\site-packages\urllib3\connection.py", line 181, in connect conn = self._new_conn()
File "C:\Python37\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x04EF7BD0>: Failed to establish a new connection: [WinError 10061]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Python37\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Python37\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError:
HTTPConnectionPool(host='127.0.0.1', port=8118): Max retries exceeded with url:
http://icanhazip.com/ (Caused by ProxyError('Cannot connect to proxy.',
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x04EF7BD0>: Failed to establish a new connection:
[WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 107, in get_new_ip
current_ip = self.get_current_ip()
File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 84, in get_current_ip
response = get(ICANHAZIP, proxies={'http': self.local_http_proxy})
File "C:\Python37\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python37\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Python37\lib\site-packages\requests\adapters.py", line 510, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPConnectionPool(host='127.0.0.1', port=8118):
Max retries exceeded with url: http://icanhazip.com/ (Caused by ProxyError('Cannot connect to proxy.',
NewConnectionError('<urllib3.connection.HTTPConnection object at 0x04EF7BD0>:
Failed to establish a new connection: [WinError 10061] Aucune connexion n’a pu être établie car l’ordinateur cible l’a expressément refusée')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "C:\Python37\lib\site-packages\scrapy\core\downloader\middleware.py", line 37, in process_request
response = yield method(request=request, spider=spider)
File "C:\Users\truc\scrapy_project\middlewares.py", line 39, in process_request
ip_changer.get_new_ip()
File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 109, in get_new_ip
self._obtain_new_ip()
File "C:\Python37\lib\site-packages\toripchanger\changer.py", line 180, in _obtain_new_ip
controller.authenticate(password=self.tor_password)
File "C:\Python37\lib\site-packages\stem\control.py", line 1100, in authenticate
stem.connection.authenticate(self, *args, **kwargs)
File "C:\Python37\lib\site-packages\stem\connection.py", line 625, in authenticate
raise auth_exc
File "C:\Python37\lib\site-packages\stem\connection.py", line 579, in authenticate
authenticate_password(controller, password, False)
File "C:\Python37\lib\site-packages\stem\connection.py", line 735, in authenticate_password
raise IncorrectPassword(str(auth_response), auth_response) stem.connection.IncorrectPassword:
Authentication failed: Password did not match HashedControlPassword value from configuration
好吧,很长的时间我很抱歉,但是这里已经很完整了,正如我所说,我没有接受这些技能的培训。 请注意上面重复的执行,我的意思是这是与发生的异常相同的模式,所以我将其简称为短路。
我该怎么办?
Tor版本:0.3.4.8,Vidalia:0.2.21,Scrapy 1.6.0,polipo:我不知道,python:Python 3.7.2