最近,我不得不抓一大堆网址,其中许多网站无法加载,加载时间过长,不存在等。
当我的蜘蛛获得一系列此类损坏的网址时,它会自动关闭。 我怎样才能改变这种行为,并要求它不要在失败的URL上出汗,而只是跳过它们。
这是我丑陋的错误跟踪:
Error during info_callback
Traceback (most recent call last):
File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
self._write(bytes)
File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
sent = self._tlsConnection.send(toSend)
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 949, in send
result = _lib.SSL_write(self._ssl, buf, len(buf))
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
return wrapped(connection, where, ret)
File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
transport = connection.get_app_data()
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
return self._app_data
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'
From callback <function infoCallback at 0x7feaa9e3a8c0>:
Traceback (most recent call last):
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
callback(Connection._reverse_mapping[ssl], where, return_code)
File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
connection.get_app_data().failVerification(f)
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
return self._app_data
File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'
除了我不理解的上述错误之外,我还得到了很多TimeoutError
和扭曲的失败。
2015-10-05 12:30:10 [scrapy] DEBUG: Retrying <GET http://www.example.com> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
有什么错误?为什么我的蜘蛛接近这些?我怎么能改变它?
答案 0 :(得分:3)
第一个错误是scrapy中存在错误:http://jsfiddle.net/0w5ejd5q/2/
可以通过安装service_identity
:
pip install service_identity
第二个问题是twisted无法连接到示例域。在这种情况下,没有任何事情要做,因为URL被跳过没有任何问题 - 只记录在另一端没有任何东西。我认为这与您的蜘蛛关闭无关,但由于上述错误而导致错误。