Question

我正在开展数据抓取项目，我的代码使用 Scrapy （版本 1.0.4 ）和 Selenium （版本 2.47.1 ）。

from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.spiders import CrawlSpider
from selenium import webdriver

class TradesySpider(CrawlSpider):
    name = 'tradesy'
    start_urls = ['My Start url',]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        while True:
           tradesy_urls = Selector(response).xpath('//div[@id="right-panel"]"]')
           data_urls = tradesy_urls.xpath('div[@class="item streamline"]/a/@href').extract()
           for link in data_urls:
               url = 'My base url'+link
               yield Request(url=url,callback=self.parse_data)
               time.sleep(10)
           try:
               data_path = self.driver.find_element_by_xpath('//*[@id="page-next"]')
           except:
               break
           data_path.click()
           time.sleep(10)

    def parse_data(self,response):
        'Scrapy Operations...'

当我执行我的代码时，我得到了一些网址的预期输出，但对于其他网址，我收到了以下错误。

2016-01-19 15:45:17 [scrapy] DEBUG: Retrying <GET MY_URL> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]

请为此查询提供解决方案。

Answer 1

根据此reported issue，您可以创建自己的ContextFactory来处理SSL。

<强> context.py：

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory


class CustomContextFactory(ScrapyClientContextFactory):
    """
    Custom context factory that allows SSL negotiation.
    """

    def __init__(self):
        # Use SSLv23_METHOD so we can use protocol negotiation
        self.method = SSL.SSLv23_METHOD

<强> settings.py

DOWNLOADER_CLIENTCONTEXTFACTORY = 'yourproject.context.CustomContextFactory'

Answer 2

eLRuLL答案的变体，不需要额外的文件。它“装饰”了ScrapyClientContextFactory类的init方法。

from OpenSSL import SSL
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory

init = ScrapyClientContextFactory.__init__ 
def init2(self, *args, **kwargs):
  init(self, *args, **kwargs)
  self.method = SSL.SSLv23_METHOD
ScrapyClientContextFactory.__init__ = init2

Answer 3

使用Scrapy 1.5.0时遇到了此错误：

Error downloading: https://my.website.com>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'tls12_check_peer_sigalg', 'wrong curve')]>]

最终起作用的是更新了我的Twisted版本（从17.9.0-> 19.10.0）。我还将Scrapy更新为2.4.0，以及其他一些版本：

cryptography == 2.2.2-> 2.3
parsel == 1.4.0-> 1.5.0
pyOpenSSL == 17.5.0-> 19.0.0
urllib3 == 1.22-> 1.24.3

scrapy中的python.failure.Failure OpenSSL.SSL.Error（版本1.0.4）

3 个答案: