Question

最近，我不得不抓一大堆网址，其中许多网站无法加载，加载时间过长，不存在等。

当我的蜘蛛获得一系列此类损坏的网址时，它会自动关闭。我怎样才能改变这种行为，并要求它不要在失败的URL上出汗，而只是跳过它们。

这是我丑陋的错误跟踪：

Error during info_callback
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._write(bytes)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
    sent = self._tlsConnection.send(toSend)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 949, in send
    result = _lib.SSL_write(self._ssl, buf, len(buf))
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
    return wrapped(connection, where, ret)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
    transport = connection.get_app_data()
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'

From callback <function infoCallback at 0x7feaa9e3a8c0>:
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
    connection.get_app_data().failVerification(f)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'

除了我不理解的上述错误之外，我还得到了很多TimeoutError和扭曲的失败。

2015-10-05 12:30:10 [scrapy] DEBUG: Retrying <GET http://www.example.com> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]

有什么错误？为什么我的蜘蛛接近这些？我怎么能改变它？

Answer 1

第一个错误是scrapy中存在错误：http://jsfiddle.net/0w5ejd5q/2/

可以通过安装service_identity：

来解决

pip install service_identity

第二个问题是twisted无法连接到示例域。在这种情况下，没有任何事情要做，因为URL被跳过没有任何问题 - 只记录在另一端没有任何东西。我认为这与您的蜘蛛关闭无关，但由于上述错误而导致错误。

Spider关闭以获取多个失败的URL

1 个答案: