Spider关闭以获取多个失败的URL

时间:2015-10-05 08:54:27

标签: python scrapy

最近,我不得不抓一大堆网址,其中许多网站无法加载,加载时间过长,不存在等。

当我的蜘蛛获得一系列此类损坏的网址时,它会自动关闭。 我怎样才能改变这种行为,并要求它不要在失败的URL上出汗,而只是跳过它们。

这是我丑陋的错误跟踪:

Error during info_callback
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._write(bytes)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
    sent = self._tlsConnection.send(toSend)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 949, in send
    result = _lib.SSL_write(self._ssl, buf, len(buf))
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
    return wrapped(connection, where, ret)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
    transport = connection.get_app_data()
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'

From callback <function infoCallback at 0x7feaa9e3a8c0>:
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
    connection.get_app_data().failVerification(f)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'

除了我不理解的上述错误之外,我还得到了很多TimeoutError和扭曲的失败。

2015-10-05 12:30:10 [scrapy] DEBUG: Retrying <GET http://www.example.com> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]

有什么错误?为什么我的蜘蛛接近这些?我怎么能改变它?

1 个答案:

答案 0 :(得分:3)

第一个错误是scrapy中存在错误:http://jsfiddle.net/0w5ejd5q/2/

可以通过安装service_identity

来解决
pip install service_identity

第二个问题是twisted无法连接到示例域。在这种情况下,没有任何事情要做,因为URL被跳过没有任何问题 - 只记录在另一端没有任何东西。我认为这与您的蜘蛛关闭无关,但由于上述错误而导致错误。