Question

我要使用scrapy框架抓取使用TLS v1.2的https://dms.psc.sc.gov/Web/dockets。但是在请求URL时，它无法加载并引发[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]。

在git https://github.com/scrapy/scrapy/issues/981上讨论了一个问题，但它对我不起作用。我有scrapy v 0.24.5和扭曲版本＆gt; = 14。

当我尝试抓取另一个也使用TLS v1.2的网站时，它可以正常工作，但不适用于https://dms.psc.sc.gov。如何解决这个问题？

Answer 1

Scrapy中的

PR fixing this problem已经合并。最近（2016年2月）还有另一个拉动请求修复similar bug

我看到最近的Scrapy版本，我可以正确地抓取您的页面，但是旧版本的问题仍然存在。

一般情况下，如果您在Scrapy上遇到HTTP-s问题，解决方案是：

将Scrapy升级到最新版本
检查您使用的Twisted版本，如果它不是最近更新到Twisted版本的最新版本（截至编写14版本时，确认在SSL方面明显更好）

如果您在更新Scrapy和Twisted后仍然遇到问题，则可能需要继承ScrapyClientContextFactory - 请参阅下面的答案以获取详细信息。

this github issue

中的更多详情

Answer 2

1 添加 DOWNLOADER_CLIENTCONTEXTFACTORY='testproject.CustomContext.CustomClientContextFactory' 到 settings.py

在项目目录中
2. 创建文件，名为 CustomContext.py 并添加以下代码

from OpenSSL import SSL from twisted.internet.ssl import ClientContextFactory from twisted.internet._sslverify import ClientTLSOptions from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory class CustomClientContextFactory(ScrapyClientContextFactory): def getContext(self, hostname=None, port=None): ctx = ClientContextFactory.getContext(self) # Enable all workarounds to SSL bugs as documented by # http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html ctx.set_options(SSL.OP_ALL) if hostname: ClientTLSOptions(hostname, ctx) return ctx

注意：它适用于在Windows中使用 https 网站进行抓取，但是当我在Ubuntu 14.04中尝试相同时，它会抛出错误，如下所示： -

from twisted.internet._sslverify import ClientTLSOptions exceptions.ImportError: cannot import name ClientTLSOptions

如果有人为上述错误添加解决方案，那将会很棒。

修改

而不是使用from twisted.internet._sslverify import ClientTLSOptions

我已将其更改为以下

try: # available since twisted 14.0 from twisted.internet._sslverify import ClientTLSOptions except ImportError: ClientTLSOptions = None

Answer 3

任何人都有＆＃34; TypeError：unbound方法getContext（）必须使用ClientContextFactory实例作为第一个参数调用...＆＃34;

替换ctx = ClientContextFactory.getContext(self)

ctx = ScrapyClientContextFactory.getContext(self)

Answer 4

Vinodh Velumayil的回答是正确的。但我必须编辑这个字符串：

ctx = ClientContextFactory.getContext(self)

到此：

inst = ClientContextFactory()
ctx = inst.getContext()

使用scrapy爬网SSL站点

4 个答案: